1. 安装python,安装好后查看python版本
$ python --version
Python 2.7.6
从下面的pyspark.sh中可以看出,默认是支持2.7的python(spark版本是spark-1.6.0-bin-hadoop2.6)
if hash python2.7 2>/dev/null; then
# Attempt to use Python 2.7, if installed:
DEFAULT_PYTHON="python2.7"
else
DEFAULT_PYTHON="python"
fi
2.运行pyspark
/usr/local/spark$ bin/pyspark
提示IPYTHON,IPYTHON_OPTION已经被替换了,因此设置环境变量:
export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
16/01/24 09:34:51 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-bf84dcd6-0789-4ceb-b950-288d6617955c
16/01/24 09:34:51 INFO MemoryStore: MemoryStore started with capacity 517.4 MB
16/01/24 09:34:51 INFO SparkEnv: Registering OutputCommitCoordinator
16/01/24 09:34:52 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/01/24 09:34:52 INFO SparkUI: Started SparkUI at http://192.168.0.101:4040
16/01/24 09:34:52 INFO Executor: Starting executor ID driver on host localhost
16/01/24 09:34:52 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 54466.
16/01/24 09:34:52 INFO NettyBlockTransferService: Server created on 54466
16/01/24 09:34:52 INFO BlockManagerMaster: Trying to register BlockManager
16/01/24 09:34:52 INFO BlockManagerMasterEndpoint: Registering block manager localhost:54466 with 517.4 MB RAM, BlockManagerId(driver, localhost, 54466)
16/01/24 09:34:52 INFO BlockManagerMaster: Registered BlockManager
3. lines=sc.textFile("README.md")
发现找不到sc,还是老老实实先载入下面的对象
from pyspark import SparkContext, SparkConf
sc =SparkContext()
同时删除下面的环境变量
export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"
16/01/24 09:35:44 WARN SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
16/01/24 09:35:44 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 147.1 KB, free 147.1 KB)
16/01/24 09:35:44 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 14.3 KB, free 161.4 KB)
16/01/24 09:35:44 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:54466 (size: 14.3 KB, free: 517.4 MB)
16/01/24 09:35:44 INFO SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2
<pre name="code" class="html">>
>> lines.count()
首先注意文件的位置是否正确,是根目录还是自定义的目录/homedir/README.md下
然后考虑如下log错误信息
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/spark/python/pyspark/rdd.py", line 1004, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/usr/local/spark/python/pyspark/rdd.py", line 995, in sum return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add) File "/usr/local/spark/python/pyspark/rdd.py", line 869, in fold vals = self.mapPartitions(func).collect() File "/usr/local/spark/python/pyspark/rdd.py",
line 771, in collect port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__ File "/usr/local/spark/python/pyspark/sql/utils.py", line 45, in deco return
f(*a, **kw) File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_valuepy4j.protocol.Py4JJavaError
以上错误可能是因为本地的hadoop集群还没有启动造成的
4.
from pyspark import SparkContext, SparkConfsc =SparkContext()lines=sc.textFile("/homedir/README.md")
错误:
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=PySparkShell, master=local[*]) created by __init__ at <ipython-input-2-0621806439bc>:2
修正方法:
sc = SparkContext.getOrCreate()替换sc =SparkContext()
5.
exit()退出pyspark
启动hadoop集群
/usr/local/hadoop/sbin/start-all.sh
3704 ResourceManager
3541 SecondaryNameNode
3194 NameNode
4155 Jps
3329 DataNode
3839 NodeManager
6. 再次启动pyspark
/usr/local/spark$ bin/pyspark
lines=sc.textFile("README.md")
>>> lines.count()
16/01/24 09:41:44 INFO HadoopRDD: Input split: hdfs://namenode:9000/user/tizen/README.md:1679+1680
16/01/24 09:41:44 INFO HadoopRDD: Input split: hdfs://namenode:9000/user/tizen/README.md:0+1679
16/01/24 09:41:44 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/01/24 09:41:44 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/01/24 09:41:44 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/01/24 09:41:44 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/01/24 09:41:44 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
16/01/24 09:41:46 INFO PythonRunner: Times: total = 1903, boot = 1648, init = 254, finish = 1
16/01/24 09:41:46 INFO PythonRunner: Times: total = 51, boot = 5, init = 45, finish = 1
16/01/24 09:41:46 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2124 bytes result sent to driver
16/01/24 09:41:46 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2124 bytes result sent to driver
16/01/24 09:41:46 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2361 ms on localhost (1/2)
16/01/24 09:41:46 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 2307 ms on localhost (2/2)
16/01/24 09:41:46 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/01/24 09:41:46 INFO DAGScheduler: ResultStage 0 (count at <stdin>:1) finished in 2.618 s
16/01/24 09:41:46 INFO DAGScheduler: Job 0 finished: count at <stdin>:1, took 2.970501 s
95
结果显示正常
ps:
可能出现找不到sc的问题,此时可以手动导入
from pyspark import SparkContext
sc=SparkContext()
lines=sc.textFile("README.md")
lines.count()