参考书:《Spark快速大数据分析》
链接: https://pan.baidu.com/s/1-zb8J4nUbJ2fDQLuWrN46Q 提取码: 5eab 复制这段内容后打开百度网盘手机App,操作更方便哦
1、基本操作
exp1:
数据:student.txt
1 2 3 4 5
6 7 8 9 10
3 5 8
代码:
# encoding = utf-8
import findspark
findspark.init()
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
conf = SparkConf().setAppName("app")
sc = SparkContext(conf=conf)
lines = sc.textFile('./student.txt') # 读文件
print lines.first()
print lines.count()
lines_1 = lines.filter(lambda x: "5" in x) # 过滤
lines_2 = lines.filter(lambda x: "3" in x)
lines_union = lines_1.union(lines_2)
print lines_union.collect()
lines_flat = lines.flatMap(lambda x: x.split(" ")) # 拍平
print 'lines_flat:', lines_flat.collect()
lines_map = lines_flat.map(lambda x: float(x)) # map
print 'lines_map: ', lines_map.collect()
# reduceByKey就是对元素为KV对的RDD中Key相同的元素的Value进行
# binary_function的reduce操作,因此,Key相同的多个元素的值
# 被reduce为一个值,然后与原RDD中的Key组成一个新的KV对。
lines_reduce = lines_map.reduce(lambda x, y: x + y) # reduce
print lines_reduce
结果: