spark 命令行环境 python

本文介绍了如何在Spark 1.6.0环境中配置Python,包括安装Python 2.7.6,设置PYSPARK_DRIVER_PYTHON和PYSPARK_DRIVER_PYTHON_OPTS环境变量来使用IPython笔记本。遇到找不到`sc`对象的问题,通过调用`SparkContext.getOrCreate()`来解决。在启动和使用pyspark的过程中,需要注意本地Hadoop集群的状态,确保其已启动。文章还提供了解决`sc`未定义问题的方法,即手动导入`SparkContext`并创建实例。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1. 安装python,安装好后查看python版本

$ python --version
Python 2.7.6

从下面的pyspark.sh中可以看出,默认是支持2.7的python(spark版本是spark-1.6.0-bin-hadoop2.6)

if hash python2.7 2>/dev/null; then
  # Attempt to use Python 2.7, if installed:
  DEFAULT_PYTHON="python2.7"
else
  DEFAULT_PYTHON="python"
fi

2.运行pyspark

/usr/local/spark$ bin/pyspark

提示IPYTHON,IPYTHON_OPTION已经被替换了,因此设置环境变量:

export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS=notebook


 

16/01/24 09:34:51 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-bf84dcd6-0789-4ceb-b950-288d6617955c
16/01/24 09:34:51 INFO MemoryStore: MemoryStore started with capacity 517.4 MB
16/01/24 09:34:51 INFO SparkEnv: Registering OutputCommitCoordinator
16/01/24 09:34:52 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/01/24 09:34:52 INFO SparkUI: Started SparkUI at http://192.168.0.101:4040
16/01/24 09:34:52 INFO Executor: Starting executor ID driver on host localhost
16/01/24 09:34:52 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 54466.
16/01/24 09:34:52 INFO NettyBlockTransferService: Server created on 54466
16/01/24 09:34:52 INFO BlockManagerMaster: Trying to register BlockManager
16/01/24 09:34:52 INFO BlockManagerMasterEndpoint: Registering block manager localhost:54466 with 517.4 MB RAM, BlockManagerId(driver, localhost, 54466)
16/01/24 09:34:52 INFO BlockManagerMaster: Registered BlockManager
3.

lines=sc.textFile("README.md")

 

发现找不到sc,还是老老实实先载入下面的对象

from pyspark import SparkContext, SparkConf
sc =SparkContext()

同时删除下面的环境变量

export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"


 

16/01/24 09:35:44 WARN SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
16/01/24 09:35:44 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 147.1 KB, free 147.1 KB)
16/01/24 09:35:44 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 14.3 KB, free 161.4 KB)
16/01/24 09:35:44 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:54466 (size: 14.3 KB, free: 517.4 MB)
16/01/24 09:35:44 INFO SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2 

<pre name="code" class="html">>
 
>> lines.count()

首先注意文件的位置是否正确,是根目录还是自定义的目录/homedir/README.md下
然后考虑如下log错误信息
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/spark/python/pyspark/rdd.py", line 1004, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File "/usr/local/spark/python/pyspark/rdd.py", line 995, in sum return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add) File "/usr/local/spark/python/pyspark/rdd.py", line 869, in fold vals = self.mapPartitions(func).collect() File "/usr/local/spark/python/pyspark/rdd.py", line 771, in collect port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__ File "/usr/local/spark/python/pyspark/sql/utils.py", line 45, in deco return f(*a, **kw) File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_valuepy4j.protocol.Py4JJavaError


以上错误可能是因为本地的hadoop集群还没有启动造成的

4.

from pyspark import SparkContext, SparkConfsc =SparkContext()lines=sc.textFile("/homedir/README.md")

错误:

ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=PySparkShell, master=local[*]) created by __init__ at <ipython-input-2-0621806439bc>:2 

修正方法:

sc = SparkContext.getOrCreate()替换sc =SparkContext()

5.

exit()退出pyspark

启动hadoop集群

/usr/local/hadoop/sbin/start-all.sh

3704 ResourceManager
3541 SecondaryNameNode
3194 NameNode
4155 Jps
3329 DataNode
3839 NodeManager

6. 再次启动pyspark

/usr/local/spark$ bin/pyspark
lines=sc.textFile("README.md")
>>> lines.count()
16/01/24 09:41:44 INFO HadoopRDD: Input split: hdfs://namenode:9000/user/tizen/README.md:1679+1680
16/01/24 09:41:44 INFO HadoopRDD: Input split: hdfs://namenode:9000/user/tizen/README.md:0+1679
16/01/24 09:41:44 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/01/24 09:41:44 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/01/24 09:41:44 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/01/24 09:41:44 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/01/24 09:41:44 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
16/01/24 09:41:46 INFO PythonRunner: Times: total = 1903, boot = 1648, init = 254, finish = 1
16/01/24 09:41:46 INFO PythonRunner: Times: total = 51, boot = 5, init = 45, finish = 1
16/01/24 09:41:46 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2124 bytes result sent to driver
16/01/24 09:41:46 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2124 bytes result sent to driver
16/01/24 09:41:46 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2361 ms on localhost (1/2)
16/01/24 09:41:46 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 2307 ms on localhost (2/2)
16/01/24 09:41:46 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
16/01/24 09:41:46 INFO DAGScheduler: ResultStage 0 (count at <stdin>:1) finished in 2.618 s
16/01/24 09:41:46 INFO DAGScheduler: Job 0 finished: count at <stdin>:1, took 2.970501 s
95

结果显示正常


ps:

可能出现找不到sc的问题,此时可以手动导入

from pyspark import SparkContext

sc=SparkContext()

lines=sc.textFile("README.md")

lines.count()




评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值