Prediction(6)PyLib and Machine Learning

本文详细介绍了如何在Python环境中利用Zeppelin进行Spark操作,包括安装配置、单节点模式设置、随机森林算法应用及遇到错误的解决方法。重点覆盖了Spark集成与Zeppelin的整合步骤,提供了从基础配置到复杂算法应用的全面指南。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Prediction(6)PyLib and Machine Learning

1. Introduction
An ensemble method will create a model composed of a set of other base models. Gradientboostedtrees and RandomForest both use decision trees as their base models.

GBTs train one tree at a time, so they can take longer to train than random forests. Random Forests in parallel. (smaller trees with GBTs)

Training more trees in a Random Forest reduces the likelihood of overfitting. More trees with GBTs increases the likelihood of overfitting.

Random Forests reduce variance by using more trees, GBTs reduce bias by using more trees.

2. Try with Random Forests

Error Message in Zeppelin:
Traceback (most recent call last): File "/tmp/zeppelin_pyspark.py", line 162, in <module> eval(compiledCode) File "<string>", line 1, in <module> File "/opt/spark/python/pyspark/mllib/__init__.py", line 25, in <module> import numpy ImportError: No module named numpy

Solution:
http://stackoverflow.com/questions/7818811/import-error-no-module-named-numpy

Download the latest file from http://sourceforge.net/projects/numpy/files/NumPy/

> wget http://tcpdiag.dl.sourceforge.net/project/numpy/NumPy/1.10.0/numpy-1.10.0.tar.gz

> sudo python setup.py install

Verify the installation
>python
python>>>import numpy
python>>>exit()

Error Message in Zeppelin Logs
ERROR [2015-10-06 14:14:40,447] ({qtp1852584274-48} NotebookServer.java[runParagraph]:630) - Exception from run
org.apache.zeppelin.interpreter.InterpreterException: pyspark interpreter not found
at org.apache.zeppelin.notebook.NoteInterpreterLoader.get(NoteInterpreterLoader.java:148)
at org.apache.zeppelin.notebook.Note.run(Note.java:282)
at org.apache.zeppelin.socket.NotebookServer.runParagraph(NotebookServer.java:628)
at org.apache.zeppelin.socket.NotebookServer.onMessage(NotebookServer.java:126)

Solution:
http://blog.cloudera.com/blog/2015/07/how-to-install-apache-zeppelin-on-cdh/

>mvn clean package -Pspark-1.5 -Dpyspark -Dspark.version=1.5.0 -Dhadoop.version=2.7.0 -Phadoop-2.6 -Pyarn -DskipTests -P build-distr

Try these codes in zeppelin.
%pyspark
sc.parallelize([1,2,3]).count()

Exception:
Error from python worker: /usr/bin/python: No module named pyspark PYTHONPATH was: /home/carl/tool/hadoop-2.7.1/temp/nm-local-dir/usercache/carl/filecache/20/spark-assembly-1.5.0-hadoop2.6.0.jar java.io.EOFException

Solution:
http://stackoverflow.com/questions/30824818/what-to-set-spark-home-to

Add this in zeppelin configuration file.
export PYTHONPATH="$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH"

It should be right there. But the VMs are slow. So I did not make it perfectly working. I may try this in later version.

3. Set up Single Mode
Following the nodes here http://sillycat.iteye.com/blog/2247102
Only these configuration for zeppelin in local MODE
export SPARK_HOME="/opt/spark"
. ${SPARK_HOME}/conf/spark-env.sh
export ZEPPELIN_CLASSPATH="${SPARK_CLASSPATH}"
export PYTHONPATH="$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH"

And the single mode is working great for me. And the speed is also much better than in the VMs.

4. Random Forest Sample on Zeppelin
%pyspark

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils

%pyspark

data = MLUtils.loadLibSVMFile(sc, "/opt/spark/data/mllib/sample_libsvm_data.txt")
(trainingData, testData) = data.randomSplit([0.7, 0.3])

%pyspark

model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='variance', maxDepth=4, maxBins=32)

%pyspark

predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testMSE = labelsAndPredictions.map(lambda (v, p): (v - p) * (v - p)).sum() / float(testData.count())
print('Test Mean Squared Error = ' + str(testMSE))
print('Learned regression forest model:')
print(model.toDebugString())

References:
http://spark.apache.org/docs/latest/mllib-ensembles.html

Setup Zeppelin Again with Python
http://sillycat.iteye.com/blog/2247102
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值