coalesce()
def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null): RDD[T]
该函数用于将RDD进行重分区,使用
HashPartitioner
第一个参数为重分区的数目,第二个为是否进行shuffle,默认为false
只传入第一个参数,表示降低RDD中partitions(分区)数量为numPartitions,numPartitions要小于RDD原分区数量
若传入的numPartitions值大于RDD原分区数量,则第二个参数必须设置为true,否则无效
scala> val data = sc.textFile("/data/spark_rdd.txt", 2) data: org.apache.spark.rdd.RDD[String] = /data/spark_rdd.txt MapPartitionsRDD[19] at textFile at <console>:24 scala> data.partitions.size res11: Int = 2 scala> val new_data = data.coalesce(1) new_data: org.apache.spark.rdd.RDD[String] = CoalescedRDD[20] at coalesce at <console>:26 scala> new_data.partitions.size res12: Int = 1 scala> val new_data = data.coalesce(4) new_data: org.apache.spark.rdd.RDD[String] = CoalescedRDD[21] at coalesce at <console>:26 scala> new_data.partitions.size res13: Int = 2 scala> val new_data = data.coalesce(4, true) new_data: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[25] at coalesce at <console>:26 scala> new_data.partitions.size res14: Int = 4
repartition()
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]
Reshuffle(重新洗牌)RDD 中的数据以创建或者更多的 partitions(分区)并将每个分区中的数据尽量保持均匀
该操作总是通过网络来 shuffles 所有的数据
该函数其实就是
coalesce()
函数中第二个参数为true
的实现scala> val data = sc.textFile("/data/spark_rdd.txt", 2) data: org.apache.spark.rdd.RDD[String] = /data/spark_rdd.txt MapPartitionsRDD[1] at textFile at <console>:24 scala> data.partitions.size res0: Int = 2 scala> val data_1 = data.repartition(1) data_1: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[5] at repartition at <console>:26 scala> data_1.partitions.size res2: Int = 1 scala> val data_2 = data.repartition(4) data_2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at repartition at <console>:26 scala> data_2.partitions.size res3: Int = 4