SparkRDD Operations(二) coalesce && repartition

最新推荐文章于 2021-01-12 23:00:38 发布

原创最新推荐文章于 2021-01-12 23:00:38 发布 · 218 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#RDD

Spark 同时被 2 个专栏收录

10 篇文章

订阅专栏

Spark

9 篇文章

订阅专栏

本文详细探讨了Spark中RDD算子coalesce和repartition的工作原理及应用场景，对比了两者在数据重分布、分区数调整方面的差异，帮助读者理解如何在不同场景下选择合适的算子以优化数据处理效率。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1.概念理解

首先我们看下官方对这两个算子的介绍

coalesce(numPartitions)	Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.
repartition(numPartitions)	Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.

coalesce
将RDD中的分区数减少为numPartitions。过滤大型数据集后，可以更有效地运行操作。
repartition
随机重新调整RDD中的数据以创建更多或更少的分区并在它们之间进行平衡。这总是随机播放网络上的所有数据。

1.1 区别

首先这两个算子都会将RDD的分区进行重新划分,repartition只是coalesce接口中shuffle为true的简易实现(待会源码看)

coalesce 会对原有的RDD 分区,进行合并操作,假设原有的RDD 分区为5,现在我指定coalesce(3) ,那么执行之后的分区会变成 3个
repartition 会对原来的RDD分区进行重新分区操作,会增加到指定的分区书,假设原有的RDD 分区为5,现在我指定repartition(10)那么执行之后的分区会变成 10个

1.2 源码查看这两个关系

repartition

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }

coalesce

def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
    require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
    if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](
          mapPartitionsWithIndexInternal(distributePartition, isOrderSensitive = true),
          new HashPartitioner(numPartitions)),
        numPartitions,
        partitionCoalescer).values
    } else {
      new CoalescedRDD(this, numPartitions, partitionCoalescer)
    }
  }

从源码来看 repartition只是coalesce接口中shuffle为true的简易实现

2.实际操作

指定RDD分区数

object RDDOpAPP {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("RDDOpAPP").setMaster("local[2]")
    val sc = new SparkContext(conf)

    val stus = new ListBuffer[String]()

    for (i <- 1 to 10) {
      stus.append("员工编号：" + i)
    }
    val ss = sc.parallelize(stus, 3)
    ss.mapPartitionsWithIndex((index, partition) => {
      val emps = new ListBuffer[String]
      while (partition.hasNext) {
        emps += ("~~~" + partition.next() + " , 原部门:[" + (index + 1) + "]")
      }
      emps.iterator
    }).foreach(println)

    sc.stop()
  }
}
/**
~~~员工编号：1 , 原部门:[1]
~~~员工编号：4 , 原部门:[2]
~~~员工编号：2 , 原部门:[1]
~~~员工编号：5 , 原部门:[2]
~~~员工编号：6 , 原部门:[2]
~~~员工编号：3 , 原部门:[1]
~~~员工编号：7 , 原部门:[3]
~~~员工编号：8 , 原部门:[3]
~~~员工编号：9 , 原部门:[3]
~~~员工编号：10 , 原部门:[3]
**/

可以看出来,我们队员工人数指定了3个分区,现在是将员工分配到了3个部门中

coalesce

 val ss = sc.parallelize(stus, 3)
    ss.coalesce(2).mapPartitionsWithIndex((index, partition) => {
      val emps = new ListBuffer[String]
      while (partition.hasNext) {
        emps += ("~~~" + partition.next() + " , 原部门:[" + (index + 1) + "]")
      }
      emps.iterator
    }).foreach(println)
/*
~~~员工编号：4 , 原部门:[2]
~~~员工编号：1 , 原部门:[1]
~~~员工编号：5 , 原部门:[2]
~~~员工编号：2 , 原部门:[1]
~~~员工编号：6 , 原部门:[2]
~~~员工编号：3 , 原部门:[1]
~~~员工编号：7 , 原部门:[2]
~~~员工编号：8 , 原部门:[2]
~~~员工编号：9 , 原部门:[2]
~~~员工编号：10 , 原部门:[2]
*/

进行coalesce操作的时候,会对原有的分区进行合并

repartition

val ss = sc.parallelize(stus, 3)
    ss.repartition(5).mapPartitionsWithIndex((index, partition) => {
      val emps = new ListBuffer[String]
      while (partition.hasNext) {
        emps += ("~~~" + partition.next() + " , 原部门:[" + (index + 1) + "]")
      }
      emps.iterator
    }).foreach(println)

/*
~~~员工编号：6 , 原部门:[1]
~~~员工编号：1 , 原部门:[2]
~~~员工编号：10 , 原部门:[1]
~~~员工编号：3 , 原部门:[4]
~~~员工编号：4 , 原部门:[4]
~~~员工编号：8 , 原部门:[4]
~~~员工编号：2 , 原部门:[3]
~~~员工编号：7 , 原部门:[3]
~~~员工编号：5 , 原部门:[5]
~~~员工编号：9 , 原部门:[5]
*/

进行repartition操作的时候,会对原有的分区进行重新分区