Spark5-RDD操作

本文围绕Spark Core中RDD操作展开,介绍了RDD的transformations、actions和persist/cache定义。详细阐述了常用算子,如map、filter等transformations算子,count、reduce等actions算子。还讲解了Join在Spark Core的使用、Word Count词频统计剖析及补充算子substract、intersection等。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

RDD Operations(操作)

1.定义

1.1 transformations

主要做转换操作,可以从一个已经存在的数据集,创建一个新的数据集(RDD是不可变的),例如从RDDA => RDDB

transformation是lazy形式的,比如rdd.map().filter().map().filter(),map()跟filter()都是lazy操作,并不会产生计算,仅仅是记录了transformation的操作,不产生结果.

1.2 actions

主要是把transformations之后的RDD执行,然后返回一个结果值.

transformations例子:

RDDA (1,2,3,4,5) ==> map(_+1) ==> RDDB(2,3,4,5,6)

actions例子:

RDDA (1,2,3,4,5) ==> reduce(a+b) ==> 15

这样就使得spark的计算更加的高效.

1.3 persist/cache

可以通过persist或者cache把RDD持久化到内存或磁盘中,这样Spark能够让元素在集群中更快的被使用在将来的某一时刻被使用.

2.RDD常用算子

1.transformations算子
1.1map

Return a new distributed dataset formed by passing each element of the source through a function func.

对RDD中的每个元素都执行一个function,并返回一个新的数据集

scala> val a = sc.parallelize(1 to 9)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[25] at parallelize at <console>:24
//对rdd中的每个元素进行X2的操作
scala> val b = a.map(x => x * 2)
b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[27] at map at <console>:25
//执行action操作
scala> b.collect
res17: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18)
scala> val a = sc.parallelize(List("dog","tiger","lion","cat","panda"))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[28] at parallelize at <console>:24

scala> val b = a.map(x => (x,1))
b: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[29] at map at <console>:25

scala> b.collect
res18: Array[(String, Int)] = Array((dog,1), (tiger,1), (lion,1), (cat,1), (panda,1))
1.2filter

Return a new dataset formed by selecting those elements of the source on which funcreturns true.

对RDD中的元素进行过滤,并返回一个新的数据集

scala> val a = sc.parallelize(1 to 10)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[30] at parallelize at <console>:24
//过滤出rdd中的偶数
scala> a.filter(_ % 2 == 0).collect
res20: Array[Int] = Array(2, 4, 6, 8, 10)
scala> val a = sc.parallelize(1 to 6)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[32] at parallelize at <console>:24
//将RDD中的元素X2
scala> val mapRdd = a.map(_ * 2)
mapRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[33] at map at <console>:25
//过滤出RDD中大于5的元素
scala> val filterRDD = mapRdd.filter(_ > 5)
filterRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[34] at filter at <console>:25
//执行rdd的操作
scala> filterRDD.collect
res22: Array[Int] = Array(6, 8, 10, 12)

链式编程 rdd.map().filter.map().filter().collect

scala> sc.parallelize(1 to 6).map(_ * 2).filter(_ > 5).collect
res23: Array[Int] = Array(6, 8, 10, 12)
1.3mapValues

key不动,只动value

scala> val a = sc.parallelize(List("dog","tiger","lion","cat","panda"))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[38] at parallelize at <console>:24

scala> val b = a.map(x => (x,x.length)).collect
res27: Array[(String, Int)] = Array((dog,3), (tiger,5), (lion,4), (cat,3), (panda,5))

scala> b.mapValues(_ + 1).collect
res28: Array[(String, Int)] = Array((dog,4), (tiger,6), (lion,5), (cat,4), (panda,6))
1.4flatMap

简单来说就是将函数应用到Seq里的所有元素,并将函数产生的集合里的元素取出来,组成一个新的集合.

在操作过程中还会进行扁平化处理,可以理解成flatMap = flatten + map;

scala> sc.parallelize(List(1,2,3,4,5)).flatMap(1 to _).collect
res2: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5)
2.actions算子
2.1count

统计元素个数

val a = sc.parallelize(List("dog","tiger","lion","cat","panda"))
scala> a.count
res29: Long = 5
2.2reduce

Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.

对集合当中的元素进行归约操作

//累加求和
scala> val a = sc.parallelize(1 to 10).reduce(_ + _)
a: Int = 55

//如果把加号换成减号,做的操作就是1-2-3-4....-10
scala> val a = sc.parallelize(1 to 10).reduce(_ - _)
a: Int = -53
2.3first

Return the first element of the dataset (similar to take(1)).

返回数据集里的第一个元素

scala> sc.parallelize(List("dog","tiger","lion","cat","panda")).first
res31: String = dog
//跟take(1)操作类似,不过两者还是有区别
scala> sc.parallelize(List("dog","tiger","lion","cat","panda")).take(1)
res32: Array[String] = Array(dog)
3.top
//取前X个元素,默认降序排列
scala> sc.parallelize(Array(6,7,8,9,10)).top(2)
res34: Array[Int] = Array(10, 9)

scala> sc.parallelize(List("dog","tiger","lion","cat","panda")).top(2)
res35: Array[String] = Array(tiger, panda)
//如果想升序则需要通过隐式转换来实现自定义排序
scala> implicit val myOrder = implicitly[Ordering[Int]].reverse
myOrder: scala.math.Ordering[Int] = scala.math.Ordering$$anon$4@5973d19c

scala> sc.parallelize(Array(6,7,8,9,10)).top(2)
res36: Array[Int] = Array(6, 7)

3.Join在Spark Core中的使用

scala> val a = sc.parallelize(Array(("A","a1"),("B","b1"),("C","c1"),("D","d1"),("E","e1"),("F","f1"),("F","f2")))
a: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[0] at parallelize at <console>:24

val b = sc.parallelize(Array(("A","a2"),("B","b1"),("C","c2"),("C","c3"),("E","e2"),("F","f1")))
b: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[1] at parallelize at <console>:24

scala> a.join(b).collect
res2: Array[(String, (String, String))] = Array((A,(a1,a2)), (B,(b1,b1)), (C,(c1,c2)), (C,(c1,c3)), (E,(e1,e2)), (F,(f1,f1)), (F,(f2,f1)))

此时这个join相当于inner join,只返回左右都匹配上的数据

scala> a.leftOuterJoin(b).collect
res4: Array[(String, (String, Option[String]))] = Array((A,(a1,Some(a2))), (B,(b1,Some(b1))), (C,(c1,Some(c2))), (C,(c1,Some(c3))), (D,(d1,None)), (E,(e1,Some(e2))), (F,(f1,Some(f1))), (F,(f2,Some(f1))))

leftOuterJoin时,返回左表所有,左边所有的元素都应该存在,但是右边的元素不一定有,简单来说就是左边的全部跟右边关联

scala> a.rightOuterJoin(b).collect
res8: Array[(String, (Option[String], String))] = Array((A,(Some(a1),a2)), (B,(Some(b1),b1)), (C,(Some(c1),c2)), (C,(Some(c1),c3)), (E,(Some(e1),e2)), (F,(Some(f1),f1)), (F,(Some(f2),f1)))

rightOuterJoin则与leftOuterJoin相反,返回右表所有,右边的元素全部存在,左边的不一定全存在.

scala> a.fullOuterJoin(b).collect
res9: Array[(String, (Option[String], Option[String]))] = Array((A,(Some(a1),Some(a2))), (B,(Some(b1),Some(b1))), (C,(Some(c1),Some(c2))), (C,(Some(c1),Some(c3))), (D,(Some(d1),None)), (E,(Some(e1),Some(e2))), (F,(Some(f1),Some(f1))), (F,(Some(f2),Some(f1))))

左右两边全都有

4.Spark Core的Word Count词频统计剖析

//读取外部数据源
scala> val log = sc.textFile("file:///opt/scripts/shell_test/test.txt")
log: org.apache.spark.rdd.RDD[String] = file:///opt/scripts/shell_test/test.txt MapPartitionsRDD[18] at textFile at <console>:24
//将读取的内容用空格分割,并将元素转化成(X,1),方便统计个数
scala> splits.map(x => (x,1)).collect
res22: Array[(String, Int)] = Array((hello,1), (world,1), (hello,1), (world,1), (welcome,1), (hello,1))
//进行reduceByKey操作
scala> splits.map(x => (x,1)).reduceByKey(_+_).collect
res29: Array[(String, Int)] = Array((hello,3), (welcome,1), (world,2))
//升序排列结果
scala> splits.map(x => (x,1)).reduceByKey(_+_).sortBy(_._2).collect
res30: Array[(String, Int)] = Array((welcome,1), (world,2), (hello,3))
//降序排列结果,如果要降序排列的话则需要先进行隐式转换
scala> implicit val myOrder = implicitly[Ordering[Int]].reverse
myOrder: scala.math.Ordering[Int] = scala.math.Ordering$$anon$4@343627e1

scala> splits.map(x => (x,1)).reduceByKey(_+_).sortBy(_._2).collect
res40: Array[(String, Int)] = Array((hello,3), (world,2), (welcome,1))

进行reduceByKey操作的时候,进行了下面的操作

(hello,1) (hello,1) (hello,1) ==> (hello,3)

(world,1) (world,1) ==> (world,2)

(welcome,1) ==> (welcome,1)

5.算子补充介绍

5.1substract

substract是做减法操作

Return an RDD with the elements from this that not in other

返回一个RDD在这个元素集不在另一个元素集

scala> val a = sc.parallelize(1 to 5)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[53] at parallelize at <console>:26

scala> val b = sc.parallelize(2 to 3)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[54] at parallelize at <console>:26

scala> a.subtract(b).collect
res43: Array[Int] = Array(1, 4, 5)
5.2intersection

Return the intersection of this RDD and anothor one.

返回跟另外一个RDD交叉的部分,简单来说就是取交集

scala> a.intersection(b).collect
res0: Array[Int] = Array(2, 3)
5.3cartesian

Return the Cartesian product of this RDD and another that is ,the RDD of all pairs of …

返回这个RDD跟另一个RDD的笛卡尔积组成的键值对

scala> a.cartesian(b).collect
res1: Array[(Int, Int)] = Array((1,2), (1,3), (2,2), (2,3), (3,2), (3,3), (4,2), (4,3), (5,2), (5,3))
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值