Spark RDD中的coalesce缩减分区和repartition扩大分区_spark增加减少分区算子-优快云博客

本文介绍了Spark中RDD的分区调整方法，包括使用coalesce减少分区数及repartition增加分区数。详细解析了两种方法的参数含义及其源码实现，帮助读者理解如何平衡任务负载，避免数据倾斜。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Spark RDD中的coalesce缩减分区和repartition扩大分区

RDD是Spark中重要数据结构，在日常使用如果我们的分区内数量量很小，但是分区数量过大，这会导致Spark的task任务变多，加大资源的使用，另外，如果数据量过大，但是分区数少，excetor执行的任务少，但是每个task任务大，执行的耗时会提高，于是我们考虑一个合适的task任务来取适中的task。
通常我们会用coalesce 来缩减分区，用repartiton来扩大分区
coalesce源码如下：

def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)
    : RDD[T] = withScope {
  if (shuffle) {
    /** Distributes elements evenly across output partitions, starting from a random partition. */
    val distributePartition = (index: Int, items: Iterator[T]) => {
      var position = (new Random(index)).nextInt(numPartitions)
      items.map { t =>
        // Note that the hash code of the key will just be the key itself. The HashPartitioner
        // will mod it with the number of total partitions.
        position = position + 1
        (position, t)
      }
    } : Iterator[(Int, T)]

    // include a shuffle step so that our upstream tasks are still distributed
    new CoalescedRDD(
      new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
      new HashPartitioner(numPartitions)),
      numPartitions).values
  } else {
    new CoalescedRDD(this, numPartitions)
  }
}

rePartition源码如下：

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }

在coalesce源码中，共有两个参数，其中第一个表示重分区后的分区数量，第二个表示是否执行shuffle过程
根据数量缩减分区,缩减分区,共有两个参数,
1.第一个参数是缩减后的分区数,
2.第二个表示是否执行shuffle过程(false表示表示不执行shuffle过程,true表示执行shuffle过程,默认是不执行shuffle过程)
3.如果不执行shuffle过程,那么分区之间的数据就是,分区之间的数据量是不一样的,几个分区直接并入剩下的分区中,可能产生数据倾斜,
4.如果执行shuffle过程,那么分区之间的数据量是一致的,不会产生数据倾斜
5.如果使用coalesce增大分区, 如果选择不用shuffle,那么分区是没有意义的,并且不会分区,也不会扩大分区,所以使用coalsece扩大分区,必须选择shuffle为true

reParition源码中，底层是通过coalesce来实现的，只不过是把shuffle默认成true，来实现重分区后数据均衡