1、配置过程
详细配置步骤参考:Windows和PC机上搭建Spark+Python开发环境的详细步骤
按照上述配置过程,当采用Anaconda 5.1 (Python3.6)+java1.7.0_79+spark2.0.1+Hadoop2.6.0进行配置时,出现如下错误:
AttributeError: 'module' Object has no attribute bool_
- 1
出现上述错误的可能原因:
- 没有完全按照教程下载相应的软件,主要考虑到和当前系统各种软件的兼容性
- spark2.0.1的版本还不支持Python3.6
解决办法:
采用Anaconda 4.2.0 (Python3.5)+java1.7.0_79+spark2.0.1+Hadoop2.6.0配置成功;
注意:在按照教程配置的过程中安装py4j软件时,需要将Jupiter Notebook关闭。
2、检验是否安装成功
from pyspark.sql import SparkSession
spark=SparkSession.builder\
.appName('My_App')\
.master('local')\
.getOrCreate()
df = spark.read.csv('example.csv',header=True)
df.printSchema()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
输出为数据描述信息:
root
|-- SHEDID: string (nullable = true)
|-- time: string (nullable = true)
|-- RT: string (nullable = true)
|-- LEASE: string (nullable = true)
- 1
- 2
- 3
- 4
- 5
3、 WordCount测试
import sys
from operator import add
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext(appName="PythonWordCount")
lines = sc.textFile('words.txt')
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print("%s: %i" % (word, count))
sc.stop()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17