RDD转换运算
# 创建intRDD
intRDD = sc.parallelize([3,1,2,5,5,6])
intRDD.collect()
[3, 1, 2, 5, 5, 6]
# 创建stringRDD
stringRDD = sc.parallelize(['apple','pen','banana'])
stringRDD.collect()
['apple', 'pen', 'banana']
# map 对每个元素都进行运算操作
def addOne(x):
return (x*3)
intRDD.map(addOne).collect()
intRDD.map(lambda x:x+1).collect()
stringRDD.map(lambda x:'first:'+x).collect()
[9, 3, 6, 15, 18]
[4, 2, 3, 6, 7]
['first:apple', 'first:pen', 'first:banana']
# filter数字运算,筛选
intRDD.filter(lambda x: x>2).collect()
intRDD.filter(lambda x:0<x<5).collect()
intRDD.filter(lambda x:x>=5 or x<3).collect()
stringRDD.filter(lambda x: 'a' in x).collect()
[3, 5, 6]
[3, 1, 2]
[1, 2, 5, 6]
['apple', 'banana']
# distinct 删除重复元素
intRDD.distinct().collect()
# randomSplit 可以将整个集合元素以随机数的方式按照比例分为多个RDD
sRDD = intRDD.randomSplit([0.4,0.6])
sRDD[0].collect()
sRDD[1].collect()
# groupby可以按照传入的匿名函数规则将数据分为多个List
gRDD = intRDD.groupBy(lambda x: "even" if(x%2 == 0) else "odd").collect()
gRDD
[1, 5, 2, 6, 3]
[2, 6]
[3, 1, 5, 5]
[('even', <pyspark.resultiterable.ResultIterable at 0x7f7a3c387c18>),
('odd', <pyspark.resultiterable.ResultIterable at 0x7f7a3c387320>)]
intRDD1 = sc.parallelize([1,2,3,4])
intRDD2 = sc.parallelize([2,3,4])
intRDD3 = sc.parallelize([3,4,5])
# union并集运算
intRDD1.union(intRDD2).union(intRDD3).collect()
# intersection交集运算
intRDD1.intersection(intRDD2).intersection(intRDD3).collect()
# subtract差集运算
intRDD1.subtract(intRDD2).collect()
# cartesian笛卡尔乘积运算,(不是数字相乘,只是数字组合)
intRDD1.cartesian(intRDD2).collect()
[1, 2, 3, 4, 2, 3, 4, 3, 4, 5]
[3, 4]
[1]
[(1, 2),
(1, 3),
(1, 4),
(2, 2),
(2, 3),
(2, 4),
(3, 2),
(3, 3),
(3, 4),
(4, 2),
(4, 3),
(4, 4)]
RDD基本动作运算
# first()取出第一项
intRDD.first()
# take(2)取出前2项
intRDD.take(2)
# 从小到大排序取出前3项
intRDD.takeOrdered(3)
# 从大到小取出前3项
intRDD.takeOrdered(3,key=lambda x:-x)
# 统计运算
intRDD.stats()
display(intRDD.min())
display(intRDD.max())
display(intRDD.stdev())
display(intRDD.count())
display(intRDD.sum())
intRDD.mean()
3
[3, 1]
[1, 2, 3]
[6, 5, 5]
(count: 6, mean: 3.6666666666666665, stdev: 1.7950549357115015, max: 6.0, min: 1.0)
1
6
1.7950549357115015
6
22
3.6666666666666665
RDD基本转换运算
kvRDD = sc.parallelize([(1,2),(3,4),(4,5)])
kvRDD.collect()
[(1, 2), (3, 4), (4, 5)]
# 分别取出对应的key和values
display(kvRDD.keys().collect())
kvRDD.values().collect()
# filter key值操作
print(kvRDD.filter(lambda x:x[0]>2).collect())
# filter values值操作
kvRDD.filter(lambda x:x[1]>3).collect()
# 对values值进行map操作
kvRDD.mapValues(lambda x: x*2).collect()
# 根据key排序
print(kvRDD.sortByKey(ascending=True).collect())
print(kvRDD.sortByKey(ascending=False).collect())
# 根据kvalues排序
kvRDD.sortBy(lambda x:x[1],ascending=False).collect()
[1, 3, 4]
[2, 4, 5]
[(3, 4), (4, 5)]
[(3, 4), (4, 5)]
[(1, 4), (3, 8), (4, 10)]
、
[(1, 2), (3, 4), (4, 5)]
[(4, 5), (3, 4), (1, 2)]
[(4, 5), (3, 4), (1, 2)]
kvRDD1 = sc.parallelize([(1,2),(1,3),(3,4),(3,5),(4,6)])
# reducebykey按照相同的key值对values进行reduce(add)操作
kvRDD1.reduceByKey(lambda x,y:x+y).collect()
# join 按照key相同的将values组合起来
kvRDD2 = sc.parallelize([(3,8)])
print(kvRDD1.join(kvRDD2).collect())
# leftOuterJoin 左边是主表
print(kvRDD1.leftOuterJoin(kvRDD2).collect())
# rightOuterJoin右边是主表
kvRDD1.rightOuterJoin(kvRDD2).collect()
# 在kvRDD2中删除与kvRDD1中key值相同的项
kvRDD1.subtractByKey(kvRDD2).collect()
[(4, 6), (1, 5), (3, 9)]
[(3, (4, 8)), (3, (5, 8))]
[(1, (2, None)), (1, (3, None)), (3, (4, 8)), (3, (5, 8)), (4, (6, None))]
[(3, (4, 8)), (3, (5, 8))]
[(1, 2), (1, 3), (4, 6)]
key-values动作运算
display(kvRDD1.first())
display(kvRDD1.take(3))
# 计算每一个key的项数
kvRDD1.countByKey()
# 创建key-values字典
kv = kvRDD1.collectAsMap()
kv
# 输入key值来查找values
kvRDD1.lookup(1)
# RDD 持久化
intRDD.persist()
#是否存入内存
intRDD.is_cached
# 取消持久化
intRDD.unpersist()
#eg:
intRDDdisk = sc.parallelize([3,2,5,3])
intRDDdisk.persist(StorageLevel.MEMORY_AND_DISK)
intRDDdisk.is_cached
intRDDdisk.unpersist()
(1, 2)
[(1, 2), (1, 3), (3, 4)]
defaultdict(int, {1: 2, 3: 2, 4: 1})
{1: 3, 3: 5, 4: 6}
[2, 3]
ParallelCollectionRDD[15] at parallelize at PythonRDD.scala:475
True
ParallelCollectionRDD[15] at parallelize at PythonRDD.scala:475
ParallelCollectionRDD[263] at parallelize at PythonRDD.scala:475
True
ParallelCollectionRDD[263] at parallelize at PythonRDD.scala:475