Prediction(6)PyLib and Machine Learning-优快云博客

本文链接：https://blog.youkuaiyun.com/magic_dreamer/article/details/84743888

本文详细介绍了如何在Python环境中利用Zeppelin进行Spark操作，包括安装配置、单节点模式设置、随机森林算法应用及遇到错误的解决方法。重点覆盖了Spark集成与Zeppelin的整合步骤，提供了从基础配置到复杂算法应用的全面指南。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Prediction(6)PyLib and Machine Learning

1. Introduction
An ensemble method will create a model composed of a set of other base models. Gradientboostedtrees and RandomForest both use decision trees as their base models.

GBTs train one tree at a time, so they can take longer to train than random forests. Random Forests in parallel. (smaller trees with GBTs)

Training more trees in a Random Forest reduces the likelihood of overfitting. More trees with GBTs increases the likelihood of overfitting.

Random Forests reduce variance by using more trees, GBTs reduce bias by using more trees.

2. Try with Random Forests

Error Message in Zeppelin:
Traceback (most recent call last): File "/tmp/zeppelin_pyspark.py", line 162, in <module> eval(compiledCode) File "<string>", line 1, in <module> File "/opt/spark/python/pyspark/mllib/__init__.py", line 25, in <module> import numpy ImportError: No module named numpy

Solution:
http://stackoverflow.com/questions/7818811/import-error-no-module-named-numpy

Download the latest file from http://sourceforge.net/projects/numpy/files/NumPy/

> wget http://tcpdiag.dl.sourceforge.net/project/numpy/NumPy/1.10.0/numpy-1.10.0.tar.gz

> sudo python setup.py install

Verify the installation
>python
python>>>import numpy
python>>>exit()

Error Message in Zeppelin Logs
ERROR [2015-10-06 14:14:40,447] ({qtp1852584274-48} NotebookServer.java[runParagraph]:630) - Exception from run
org.apache.zeppelin.interpreter.InterpreterException: pyspark interpreter not found
at org.apache.zeppelin.notebook.NoteInterpreterLoader.get(NoteInterpreterLoader.java:148)
at org.apache.zeppelin.notebook.Note.run(Note.java:282)
at org.apache.zeppelin.socket.NotebookServer.runParagraph(NotebookServer.java:628)
at org.apache.zeppelin.socket.NotebookServer.onMessage(NotebookServer.java:126)

Solution:
http://blog.cloudera.com/blog/2015/07/how-to-install-apache-zeppelin-on-cdh/

>mvn clean package -Pspark-1.5 -Dpyspark -Dspark.version=1.5.0 -Dhadoop.version=2.7.0 -Phadoop-2.6 -Pyarn -DskipTests -P build-distr

Try these codes in zeppelin.
%pyspark
sc.parallelize([1,2,3]).count()

Exception:
Error from python worker: /usr/bin/python: No module named pyspark PYTHONPATH was: /home/carl/tool/hadoop-2.7.1/temp/nm-local-dir/usercache/carl/filecache/20/spark-assembly-1.5.0-hadoop2.6.0.jar java.io.EOFException

Solution:
http://stackoverflow.com/questions/30824818/what-to-set-spark-home-to

Add this in zeppelin configuration file.
export PYTHONPATH="$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH"

It should be right there. But the VMs are slow. So I did not make it perfectly working. I may try this in later version.

3. Set up Single Mode
Following the nodes here http://sillycat.iteye.com/blog/2247102
Only these configuration for zeppelin in local MODE
export SPARK_HOME="/opt/spark"
. ${SPARK_HOME}/conf/spark-env.sh
export ZEPPELIN_CLASSPATH="${SPARK_CLASSPATH}"
export PYTHONPATH="$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH"

And the single mode is working great for me. And the speed is also much better than in the VMs.

4. Random Forest Sample on Zeppelin
%pyspark

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils

%pyspark

data = MLUtils.loadLibSVMFile(sc, "/opt/spark/data/mllib/sample_libsvm_data.txt")
(trainingData, testData) = data.randomSplit([0.7, 0.3])

%pyspark

model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='variance', maxDepth=4, maxBins=32)

%pyspark

predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testMSE = labelsAndPredictions.map(lambda (v, p): (v - p) * (v - p)).sum() / float(testData.count())
print('Test Mean Squared Error = ' + str(testMSE))
print('Learned regression forest model:')
print(model.toDebugString())

References:
http://spark.apache.org/docs/latest/mllib-ensembles.html

Setup Zeppelin Again with Python
http://sillycat.iteye.com/blog/2247102