RDD算子2

本文通过一系列示例展示了Apache Spark中RDD的基本操作和高级功能,包括数据分区、聚合操作、键值对处理、集合操作等,帮助读者深入理解Spark RDD API的使用方法。

http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html

http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html

//让我们先用分区标签打印出RDD的内容
scala> val z = sc.parallelize(List(1,2,3,4,5,6), 2)
z: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> def myfunc(index: Int, iter: Iterator[(Int)]) : Iterator[String] = {
     |   iter.map(x => "[partID:" +  index + ", val: " + x + "]")
     | }
myfunc: (index: Int, iter: Iterator[Int])Iterator[String]

scala> z.mapPartitionsWithIndex(myfunc).collect
res0: Array[String] = Array([partID:0, val: 1], [partID:0, val: 2], [partID:0, val: 3], [partID:1, val: 4], [partID:1, val: 5], [partID:1, val: 6])

scala> z.aggregate(0)(math.max(_, _), _ + _)
res1: Int = 9                                                                   

scala> z.aggregate(5)(math.max(_, _), _ + _)
res2: Int = 16

scala> val z = sc.parallelize(List("a","b","c","d","e","f"),2)
z: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[2] at parallelize at <console>:24

scala> def myfunc(index: Int, iter: Iterator[(String)]) : Iterator[String] = {
     |   iter.map(x => "[partID:" +  index + ", val: " + x + "]")
     | }
myfunc: (index: Int, iter: Iterator[String])Iterator[String]

scala> z.mapPartitionsWithIndex(myfunc).collect
res3: Array[String] = Array([partID:0, val: a], [partID:0, val: b], [partID:0, val: c], [partID:1, val: d], [partID:1, val: e], [partID:1, val: f])

scala> z.aggregate("")(_ + _, _+_)
res4: String = defabc

scala> z.aggregate("")(_ + _, _+_)
res5: String = defabc

scala> z.aggregate("")(_ + _, _+_)
res6: String = defabc

scala> z.aggregate("x")(_ + _, _+_)
res7: String = xxabcxdef

scala> val z = sc.parallelize(List("12","23","345","4567"),2)
z: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[4] at parallelize at <console>:24

scala> z.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y)
res8: String = 42

scala> z.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y)
res9: String = 42

scala> z.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y)
res10: String = 24

scala> z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res11: String = 11

scala> z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res12: String = 11

scala> z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res13: String = 11

scala> val z = sc.parallelize(List("12","23","345",""),2)
z: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[5] at parallelize at <console>:24

scala> z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res14: String = 10

scala> z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res15: String = 10

scala> z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res16: String = 10

scala> z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res17: String = 01

scala> val z = sc.parallelize(List("12","23","","345"),2)
z: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[6] at parallelize at <console>:24

scala> z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res18: String = 11                                                              

scala> val pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12), ("mouse", 2)), 2)
pairRDD: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[7] at parallelize at <console>:24

scala> def myfunc(index: Int, iter: Iterator[(String, Int)]) : Iterator[String] = {
     |   iter.map(x => "[partID:" +  index + ", val: " + x + "]")
     | }
myfunc: (index: Int, iter: Iterator[(String, Int)])Iterator[String]

scala> pairRDD.mapPartitionsWithIndex(myfunc).collect
res19: Array[String] = Array([partID:0, val: (cat,2)], [partID:0, val: (cat,5)], [partID:0, val: (mouse,4)], [partID:1, val: (cat,12)], [partID:1, val: (dog,12)], [partID:1, val: (mouse,2)])

scala> pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect
res20: Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6))              

scala> pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect
res21: Array[(String, Int)] = Array((dog,100), (cat,200), (mouse,200))

scala> val x = sc.parallelize(List(1,2,3,4,5))
x: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at parallelize at <console>:24

scala> val y = sc.parallelize(List(6,7,8,9,10))
y: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at parallelize at <console>:24

scala> x.cartesian(y).collect
res22: Array[(Int, Int)] = Array((1,6), (1,7), (1,8), (1,9), (1,10), (2,6), (2,7), (2,8), (2,9), (2,10), (3,6), (3,7), (3,8), (3,9), (3,10), (4,6), (5,6), (4,7), (5,7), (4,8), (5,8), (4,9), (4,10), (5,9), (5,10))

scala> val y = sc.parallelize(1 to 10, 10)
y: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[14] at parallelize at <console>:24

scala> val z = y.coalesce(2, false)
z: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[15] at coalesce at <console>:26

scala> z.partitions.length
res23: Int = 2

scala> val a = sc.parallelize(List(1, 2, 1, 3), 1)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[16] at parallelize at <console>:24

scala> val b = a.map((_, "b"))
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[17] at map at <console>:26

scala> val c = a.map((_, "c"))
c: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[18] at map at <console>:26

scala> b.cogroup(c).collect
res24: Array[(Int, (Iterable[String], Iterable[String]))] = Array((1,(CompactBuffer(b, b),CompactBuffer(c, c))), (3,(CompactBuffer(b),CompactBuffer(c))), (2,(CompactBuffer(b),CompactBuffer(c))))

scala> val b = a.map((_, "b")).collect
b: Array[(Int, String)] = Array((1,b), (2,b), (1,b), (3,b))

scala> val c = a.map((_, "c")).collect
c: Array[(Int, String)] = Array((1,c), (2,c), (1,c), (3,c))

scala> val d = a.map((_, "d"))
d: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[23] at map at <console>:26

scala> val a = sc.parallelize(List(1, 2, 1, 3), 1)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[24] at parallelize at <console>:24

scala> val b = a.map((_, "b"))
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[25] at map at <console>:26

scala> val c = a.map((_, "c"))
c: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[26] at map at <console>:26

scala> val d = a.map((_, "d"))
d: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[27] at map at <console>:26

scala> b.cogroup(c, d).collect
res26: Array[(Int, (Iterable[String], Iterable[String], Iterable[String]))] = Array((1,(CompactBuffer(b, b),CompactBuffer(c, c),CompactBuffer(d, d))), (3,(CompactBuffer(b),CompactBuffer(c),CompactBuffer(d))), (2,(CompactBuffer(b),CompactBuffer(c),CompactBuffer(d))))

scala> val x = sc.parallelize(List((1, "apple"), (2, "banana"), (3, "orange"), (4, "kiwi")), 2)
x: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[30] at parallelize at <console>:24

scala> val y = sc.parallelize(List((5, "computer"), (1, "laptop"), (1, "desktop"), (4, "iPad")), 2)
y: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[31] at parallelize at <console>:24

scala> x.cogroup(y).collect
res27: Array[(Int, (Iterable[String], Iterable[String]))] = Array((4,(CompactBuffer(kiwi),CompactBuffer(iPad))), (2,(CompactBuffer(banana),CompactBuffer())), (1,(CompactBuffer(apple),CompactBuffer(laptop, desktop))), (3,(CompactBuffer(orange),CompactBuffer())), (5,(CompactBuffer(),CompactBuffer(computer))))

scala> val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
c: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[34] at parallelize at <console>:24

scala> c.collect()
res32: Array[String] = Array(Gnu, Cat, Rat, Dog, Gnu, Rat)

scala> val a = sc.parallelize(List(1, 2, 1, 3), 1)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[35] at parallelize at <console>:24

scala> a.zip(a)
res33: org.apache.spark.rdd.RDD[(Int, Int)] = ZippedPartitionsRDD2[36] at zip at <console>:27

scala> a.zip(a).collect
res34: Array[(Int, Int)] = Array((1,1), (2,2), (1,1), (3,3))

scala> val b = a.zip(a)
b: org.apache.spark.rdd.RDD[(Int, Int)] = ZippedPartitionsRDD2[38] at zip at <console>:26

scala> b.collectAsMap
res35: scala.collection.Map[Int,Int] = Map(2 -> 2, 1 -> 1, 3 -> 3)

scala> val a = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[39] at parallelize at <console>:24

scala> val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[40] at parallelize at <console>:24

scala> val c = b.zip(a)
c: org.apache.spark.rdd.RDD[(Int, String)] = ZippedPartitionsRDD2[41] at zip at <console>:28

scala> c.collect
res36: Array[(Int, String)] = Array((1,dog), (1,cat), (2,gnu), (2,salmon), (2,rabbit), (1,turkey), (2,wolf), (2,bear), (2,bee))

scala> val d = c.combineByKey(List(_), (x:List[String], y:String) => y :: x, (x:List[String], y:List[String]) => x ::: y)
d: org.apache.spark.rdd.RDD[(Int, List[String])] = ShuffledRDD[42] at combineByKey at <console>:34

scala> d.collect
res37: Array[(Int, List[String])] = Array((1,List(cat, dog, turkey)), (2,List(rabbit, salmon, gnu, bee, bear, wolf)))

scala> val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
c: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[43] at parallelize at <console>:24

scala> c.context
res38: org.apache.spark.SparkContext = org.apache.spark.SparkContext@2895e34d

scala> val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
c: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[44] at parallelize at <console>:24

scala> c.count
res39: Long = 4

scala> val a = sc.parallelize(1 to 10000, 20)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[45] at parallelize at <console>:24

scala> val b = a++a++a++a++a
b: org.apache.spark.rdd.RDD[Int] = UnionRDD[49] at $plus$plus at <console>:26

scala> b.count
res40: Long = 50000                                                             

scala> b.countApproxDistinct(0.1)
res41: Long = 8224                                                              

scala> b.countApproxDistinct(0.05)
res42: Long = 9760                                                              

scala> b.countApproxDistinct(0.01)
res43: Long = 9947                                                              

scala> b.countApproxDistinct(0.001)
res44: Long = 10000                                                             

scala> val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[50] at parallelize at <console>:24

scala> b.countByValue
res45: scala.collection.Map[Int,Long] = Map(5 -> 1, 1 -> 6, 6 -> 1, 2 -> 3, 7 -> 1, 3 -> 1, 8 -> 1, 4 -> 2)

 

<think>我们正在讨论的是RDD算子的数量,但根据引用的内容,实际上用户可能混淆了“RDD算子”和“RDD分区中的元素数量”这两个概念。 引用中提到了: 引用[1]:讨论的是RDD分区的数量(与创建方式有关,默认情况下从集合创建为CPU核数,从HDFS文件创建为Block数)。 引用[2]:展示了一个RDD经过flatMap算子转换后,分区数量不变(rdd1rdd2的分区数都是2)。 引用[3]:展示了一个RDD(由150的集合创建)的分区数量为15。 用户的问题:“我想查询RDD算子的数量”。这里需要澄清: 1. RDD算子(Transformations和Actions)是Spark RDD上的操作函数,如map、filter、reduce等,它们的数量是指Spark提供了多少种这样的操作。 2. 而用户引用的内容都是在讨论RDD分区的数量以及分区中元素的数量。 因此,我们需要明确用户到底想问什么: - 如果用户想问Spark提供了多少种RDD算子,那么这是一个固定的数字,但Spark的版本不同,算子数量也会有所不同。而且,算子可以分为转换(Transformations)和行动(Actions)两大类。 - 如果用户想问的是一个特定的RDD经过转换后,其分区数量是多少,那么就需要根据具体的创建方式和转换算子来分析。 根据用户提供的引用内容,他们似乎在讨论分区数量(partitions.size)而不是算子的种类数量。所以,我们首先要确认用户的问题。 但是,用户明确说:“我想查询RDD算子的数量”。所以,我们应按照这个字面意思来解释。 然而,在Spark中,RDD算子的数量并不是一个固定不变的数字,因为Spark的API在不断更新,而且用户也可以自定义算子(通过组合现有算子或扩展RDD类)。但是,我们可以给出Spark核心算子的大致情况。 回答策略: 1. 首先解释RDD算子分为两类:转换(Transformations)和行动(Actions)。 2. 说明算子的数量会随着Spark版本变化而变化,因此不给出绝对数字,而是给出常见算子的大致数量和分类。 3. 如果用户实际上是想问RDD的分区数量,那么我们也需要根据引用内容解释分区数量是如何确定的。 考虑到用户引用了分区数量的例子,我们可以在回答中同时包含这两方面的信息,但以回答算子数量为主,并澄清两者的区别。 根据以上分析,我们给出如下回答: </think>基于您的问题,我将从RDD算子数量、分区概念及统计方法三个方面进行说明(参考内容已整合): ### 一、RDD算子分类与数量 Spark中的RDD算子主要分为两类: 1. **转换算子(Transformations)** 如 `map()`, `filter()`, `flatMap()` 等,这类算子会生成新的RDD(延迟执行)。 常见转换算子约 **30+** 种(不同Spark版本略有差异)。 2. **行动算子(Actions)** 如 `count()`, `collect()`, `reduce()` 等,这类算子触发实际计算并返回值。 常见行动算子约 **20+** 种。 > ⚠️ 注意:算子数量并非固定值,Spark不同版本会增减API,用户也可自定义扩展算子[^1][^2]。 --- ### 二、RDD分区与元素统计 您引用的内容实际描述的是**分区数量**和**分区内元素统计**,与算子数量不同: 1. **分区数量确定规则**(引用[1][3]) - 从集合创建:默认=程序分配的CPU核数(例:`sc.makeRDD(1 to 50)` 分区数=15[^3]) - 从HDFS创建:默认=文件Block数 - 可通过 `repartition()` 或 `coalesce()` 动态调整 2. **分区元素统计方法** ```scala // 示例:统计rdd2各分区元素数量 val rdd2 = rdd1.flatMap(_.split(" ")) // 分区数继承自rdd1[^2] rdd2.mapPartitions(iter => Iterator(iter.size)).collect() ``` 输出结果如:`Array(分区1元素数, 分区2元素数, ...)` --- ### 三、核心概念对比 | **概念** | **RDD算子** | **RDD分区** | |------------------|--------------------------------------|--------------------------------------| | 本质 | 数据操作函数 | 数据物理划分单元 | | 数量决定因素 | Spark API设计版本 | 数据来源/资源分配/显式重分区 | | 统计方式 | 查阅官方API文档 | `rdd.partitions.size`[^2][^3] |
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值