RDD Operations(操作)
1.定义
1.1 transformations
主要做转换操作,可以从一个已经存在的数据集,创建一个新的数据集(RDD是不可变的),例如从RDDA => RDDB
transformation是lazy形式的,比如rdd.map().filter().map().filter(),map()跟filter()都是lazy操作,并不会产生计算,仅仅是记录了transformation的操作,不产生结果.
1.2 actions
主要是把transformations之后的RDD执行,然后返回一个结果值.
transformations例子:
RDDA (1,2,3,4,5) ==> map(_+1) ==> RDDB(2,3,4,5,6)
actions例子:
RDDA (1,2,3,4,5) ==> reduce(a+b) ==> 15
这样就使得spark的计算更加的高效.
1.3 persist/cache
可以通过persist或者cache把RDD持久化到内存或磁盘中,这样Spark能够让元素在集群中更快的被使用在将来的某一时刻被使用.
2.RDD常用算子
1.transformations算子
1.1map
Return a new distributed dataset formed by passing each element of the source through a function func.
对RDD中的每个元素都执行一个function,并返回一个新的数据集
scala> val a = sc.parallelize(1 to 9)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[25] at parallelize at <console>:24
//对rdd中的每个元素进行X2的操作
scala> val b = a.map(x => x * 2)
b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[27] at map at <console>:25
//执行action操作
scala> b.collect
res17: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18)
scala> val a = sc.parallelize(List("dog","tiger","lion","cat","panda"))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[28] at parallelize at <console>:24
scala> val b = a.map(x => (x,1))
b: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[29] at map at <console>:25
scala> b.collect
res18: Array[(String, Int)] = Array((dog,1), (tiger,1), (lion,1), (cat,1), (panda,1))
1.2filter
Return a new dataset formed by selecting those elements of the source on which funcreturns true.
对RDD中的元素进行过滤,并返回一个新的数据集
scala> val a = sc.parallelize(1 to 10)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[30] at parallelize at <console>:24
//过滤出rdd中的偶数
scala> a.filter(_ % 2 == 0).collect
res20: Array[Int] = Array(2, 4, 6, 8, 10)
scala> val a = sc.parallelize(1 to 6)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[32] at parallelize at <console>:24
//将RDD中的元素X2
scala> val mapRdd = a.map(_ * 2)
mapRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[33] at map at <console>:25
//过滤出RDD中大于5的元素
scala> val filterRDD = mapRdd.filter(_ > 5)
filterRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[34] at filter at <console>:25
//执行rdd的操作
scala> filterRDD.collect
res22: Array[Int] = Array(6, 8, 10, 12)
链式编程
rdd.map().filter.map().filter().collect
scala> sc.parallelize(1 to 6).map(_ * 2).filter(_ > 5).collect
res23: Array[Int] = Array(6, 8, 10, 12)
1.3mapValues
key不动,只动value
scala> val a = sc.parallelize(List("dog","tiger","lion","cat","panda"))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[38] at parallelize at <console>:24
scala> val b = a.map(x => (x,x.length)).collect
res27: Array[(String, Int)] = Array((dog,3), (tiger,5), (lion,4), (cat,3), (panda,5))
scala> b.mapValues(_ + 1).collect
res28: Array[(String, Int)] = Array((dog,4), (tiger,6), (lion,5), (cat,4), (panda,6))
1.4flatMap
简单来说就是将函数应用到Seq里的所有元素,并将函数产生的集合里的元素取出来,组成一个新的集合.
在操作过程中还会进行扁平化处理,可以理解成flatMap = flatten + map;
scala> sc.parallelize(List(1,2,3,4,5)).flatMap(1 to _).collect
res2: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5)
2.actions算子
2.1count
统计元素个数
val a = sc.parallelize(List("dog","tiger","lion","cat","panda"))
scala> a.count
res29: Long = 5
2.2reduce
Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
对集合当中的元素进行归约操作
//累加求和
scala> val a = sc.parallelize(1 to 10).reduce(_ + _)
a: Int = 55
//如果把加号换成减号,做的操作就是1-2-3-4....-10
scala> val a = sc.parallelize(1 to 10).reduce(_ - _)
a: Int = -53
2.3first
Return the first element of the dataset (similar to take(1)).
返回数据集里的第一个元素
scala> sc.parallelize(List("dog","tiger","lion","cat","panda")).first
res31: String = dog
//跟take(1)操作类似,不过两者还是有区别
scala> sc.parallelize(List("dog","tiger","lion","cat","panda")).take(1)
res32: Array[String] = Array(dog)
3.top
//取前X个元素,默认降序排列
scala> sc.parallelize(Array(6,7,8,9,10)).top(2)
res34: Array[Int] = Array(10, 9)
scala> sc.parallelize(List("dog","tiger","lion","cat","panda")).top(2)
res35: Array[String] = Array(tiger, panda)
//如果想升序则需要通过隐式转换来实现自定义排序
scala> implicit val myOrder = implicitly[Ordering[Int]].reverse
myOrder: scala.math.Ordering[Int] = scala.math.Ordering$$anon$4@5973d19c
scala> sc.parallelize(Array(6,7,8,9,10)).top(2)
res36: Array[Int] = Array(6, 7)
3.Join在Spark Core中的使用
scala> val a = sc.parallelize(Array(("A","a1"),("B","b1"),("C","c1"),("D","d1"),("E","e1"),("F","f1"),("F","f2")))
a: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[0] at parallelize at <console>:24
val b = sc.parallelize(Array(("A","a2"),("B","b1"),("C","c2"),("C","c3"),("E","e2"),("F","f1")))
b: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[1] at parallelize at <console>:24
scala> a.join(b).collect
res2: Array[(String, (String, String))] = Array((A,(a1,a2)), (B,(b1,b1)), (C,(c1,c2)), (C,(c1,c3)), (E,(e1,e2)), (F,(f1,f1)), (F,(f2,f1)))
此时这个join相当于inner join,只返回左右都匹配上的数据
scala> a.leftOuterJoin(b).collect
res4: Array[(String, (String, Option[String]))] = Array((A,(a1,Some(a2))), (B,(b1,Some(b1))), (C,(c1,Some(c2))), (C,(c1,Some(c3))), (D,(d1,None)), (E,(e1,Some(e2))), (F,(f1,Some(f1))), (F,(f2,Some(f1))))
leftOuterJoin时,返回左表所有,左边所有的元素都应该存在,但是右边的元素不一定有,简单来说就是左边的全部跟右边关联
scala> a.rightOuterJoin(b).collect
res8: Array[(String, (Option[String], String))] = Array((A,(Some(a1),a2)), (B,(Some(b1),b1)), (C,(Some(c1),c2)), (C,(Some(c1),c3)), (E,(Some(e1),e2)), (F,(Some(f1),f1)), (F,(Some(f2),f1)))
rightOuterJoin则与leftOuterJoin相反,返回右表所有,右边的元素全部存在,左边的不一定全存在.
scala> a.fullOuterJoin(b).collect
res9: Array[(String, (Option[String], Option[String]))] = Array((A,(Some(a1),Some(a2))), (B,(Some(b1),Some(b1))), (C,(Some(c1),Some(c2))), (C,(Some(c1),Some(c3))), (D,(Some(d1),None)), (E,(Some(e1),Some(e2))), (F,(Some(f1),Some(f1))), (F,(Some(f2),Some(f1))))
左右两边全都有
4.Spark Core的Word Count词频统计剖析
//读取外部数据源
scala> val log = sc.textFile("file:///opt/scripts/shell_test/test.txt")
log: org.apache.spark.rdd.RDD[String] = file:///opt/scripts/shell_test/test.txt MapPartitionsRDD[18] at textFile at <console>:24
//将读取的内容用空格分割,并将元素转化成(X,1),方便统计个数
scala> splits.map(x => (x,1)).collect
res22: Array[(String, Int)] = Array((hello,1), (world,1), (hello,1), (world,1), (welcome,1), (hello,1))
//进行reduceByKey操作
scala> splits.map(x => (x,1)).reduceByKey(_+_).collect
res29: Array[(String, Int)] = Array((hello,3), (welcome,1), (world,2))
//升序排列结果
scala> splits.map(x => (x,1)).reduceByKey(_+_).sortBy(_._2).collect
res30: Array[(String, Int)] = Array((welcome,1), (world,2), (hello,3))
//降序排列结果,如果要降序排列的话则需要先进行隐式转换
scala> implicit val myOrder = implicitly[Ordering[Int]].reverse
myOrder: scala.math.Ordering[Int] = scala.math.Ordering$$anon$4@343627e1
scala> splits.map(x => (x,1)).reduceByKey(_+_).sortBy(_._2).collect
res40: Array[(String, Int)] = Array((hello,3), (world,2), (welcome,1))
进行reduceByKey操作的时候,进行了下面的操作
(hello,1) (hello,1) (hello,1) ==> (hello,3)
(world,1) (world,1) ==> (world,2)
(welcome,1) ==> (welcome,1)
5.算子补充介绍
5.1substract
substract是做减法操作
Return an RDD with the elements from
this
that not inother
返回一个RDD在这个元素集不在另一个元素集
scala> val a = sc.parallelize(1 to 5)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[53] at parallelize at <console>:26
scala> val b = sc.parallelize(2 to 3)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[54] at parallelize at <console>:26
scala> a.subtract(b).collect
res43: Array[Int] = Array(1, 4, 5)
5.2intersection
Return the intersection of this RDD and anothor one.
返回跟另外一个RDD交叉的部分,简单来说就是取交集
scala> a.intersection(b).collect
res0: Array[Int] = Array(2, 3)
5.3cartesian
Return the Cartesian product of this RDD and another that is ,the RDD of all pairs of …
返回这个RDD跟另一个RDD的笛卡尔积组成的键值对
scala> a.cartesian(b).collect
res1: Array[(Int, Int)] = Array((1,2), (1,3), (2,2), (2,3), (3,2), (3,3), (4,2), (4,3), (5,2), (5,3))