目录
1、map vs mapPartition
map:作用于每一个元素,迭代次数==>元素数
mapPartition:作用于每一个分区,迭代次数==>分区数
==>因此,对于数据库创建、对象创建等操作,优选mapPartition
mapPartitionWithIndex:返回分区index
val rdd = sc.parallelize(List(1, 2, 3), 2)
rdd.map(x => {
println("map迭代次数:----------")//执行三次
}).collect()
rdd.mapPartitions(partition => {
println("mapPartition迭代次数:-----------------")//执行两次
partition.filter(_ < 0)
}).collect()
rdd.mapPartitionsWithIndex((index, partition) => {
partition.map(x => println("分区规则:元素%partition" + index + ": " + x)) //打印出分区index
}).collect()
2、filter:过滤
对于多条件:rdd.filter(x).filter(y)...==>filter(x&&y&&...)
val rdd = sc.parallelize(List(1, 2, 3))
rdd.filter(_ % 2 != 0).filter(_ > 1).foreach(println(_))
rdd.filter(x => x % 2 != 0 && x > 1).foreach(println(_))
3、zip:拉链 要求元素分数和分区数相同
val rdd1 = sc.parallelize(List(1, 2, 3))
val rdd2 = sc.parallelize(List("a", "b", "c"))
rdd1.zip(rdd2).foreach(println(_))
//数量不一样 Can only zip RDDs with same number of elements in each partition
val rdd3 = sc.parallelize(List("a", "b", "c", "d"))
rdd1.zip(rdd3).foreach(println(_))
//分区数不一样 Can't zip RDDs with unequal numbers of partitions: List(1, 3)
val rdd4 = sc.parallelize(List("a", "b", "c"), 3)
rdd1.zip(rdd4).foreach(println(_))
rdd1.zipWithIndex().collect() //rdd.zip(List(0,1,2...)==> 打编号
4、差并交
val rddLeft = sc.parallelize(List(1, 2, 3, 4, 5))
val rddRight = sc.parallelize(List(4, 5, 6, 7, 8, 8))
rddLeft.union(rddRight).collect() //并集,不改变分区
rddLeft.intersection(rddRight).collect() //交集 底层调用的内连接
rddLeft.subtract(rddRight).collect() //差集
5、distinct:去重
如果不使用distinct算子如何去重?
val rddRight = sc.parallelize(List(4, 5, 6, 7, 8, 8))
rddRight.distinct().collect() //去重
//不使用distinct
rddRight.map((_, null)).reduceByKey((x, y) => x).map(_._1).collect()
rddRight.map((_, null)).groupByKey().map(_._1).collect()
//distinct源码实现方式
//rdd.map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
6、排序
sortBy:作用于RDD
sortByKey:作用于KV类型RDD,通过隐式转换调用
默认升序,降序传参ascending=false
val rdd = sc.parallelize(List(7, 5, 7, 6, 8, 8))
rdd.sortBy(x => x).collect()
val kvrdd = sc.parallelize(List(("a", 12), ("c", 20), ("b", 1)))
kvrdd.sortByKey().collect()
//如果kvrdd想按v来排序,实现方式
kvrdd.sortBy(_._2).collect() //1、使用sortBy
kvrdd.map(x => (x._2, x._1)).sortByKey().map(x => (x._2, x._1)).collect() //使用map调换位置排序
kvrdd.sortBy(_._2, false).collect() //降序 默认方式
kvrdd.sortBy(-_._2).collect() //降序 推荐方式,针对于数值类型
7、reduceByKey vs groupByKey
reduceByKey:对相同key的value做fun,reduceByKey会先做本地combine,然后再shuffle
groupByKey:按key进行分组 ,groupByKeY所有的元素都会shuffle
==>都可以用于WC统计,但是reduceByKey的shuffle数据量会小一些
val rdd = sc.textFile("file:///opt/mydata/olddata/data1.txt") //file: 4.0K
rdd.flatMap(_.split("\t")).map((_, 1))
.groupByKey().mapValues(_.sum).collect() //Shuffle Read:200B
rdd.flatMap(_.split("\t")).map((_, 1))
.reduceByKey(_ + _).collect() //Shuffle Read:198.0 B
8、join:连接:内连接、左连接、右连接、全连接
val rddLeft = sc.parallelize(List(("a", "hz"), ("b", "sh"), ("c", "bj")))
val rddRight = sc.parallelize(List(("a", 1), ("b", 2), ("d", 3)))
rddLeft.join(rddRight).collect() //内连接 底层调用cogroup
rddLeft.leftOuterJoin(rddRight).collect() //左连接
rddLeft.rightOuterJoin(rddRight).collect() // 右连接
rddLeft.fullOuterJoin(rddRight).collect() //全连接
9、重分区
coalesce(numPartitions:Int,shuffle:Boolean=false):默认缩小分区(不shuffle),传递true可以增加分区(shuffle)
rePartition:底层调用coalesce(numPartition,true)==>变大变小都会执行shuffle
val rdd = sc.parallelize(List(7, 5, 7, 6, 8, 8), 3)
println("原始分区数========>" + rdd.partitions.length)
rdd.mapPartitionsWithIndex((index, partition) => {
partition.map(x => println(index + "==>" + x))
}).collect()
println(
"coalesce 设置为2========>" +
rdd.coalesce(2).partitions.size
)
println(
"coalesce 设置为3========>" +
rdd.coalesce(3, true).partitions.length
)
println(
"repartition 设置为4========>" +
rdd.repartition(4).partitions.length //底层调用coalesce(numPartitions,true)
)