Spark-Core_Shuffle详解

本文详细介绍了Spark中的Shuffle机制,包括其触发原因、背景、性能影响以及如何通过不同算子如coalesce和repartition进行数据重新分布。同时,通过实际代码示例展示了ByKey、reduceByKey和aggregateByKey等算子的使用方法。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Spark-Core_Shuffle详解

1.Shuffle operations

Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.

Spark中的某些操作会触发称为shuffle的事件。shuffle是Spark的重新分配数据的机制,因此它可以跨分区进行不同的分组。这通常涉及跨执行程序和机器复制数据,使得混洗成为复杂且昂贵的操作。

例子: call records ==> 这个月打了多少电话(统计);相当于做了一个wordcount统计.

(天时间+拨打,1) ==> reduceByKey(得到结果);相同的天时间+拨打 ==> shuffle到同一个reduce上,进行累加操作.

因此只有具有特定特征的数据,必须要将相同特征的数据分发到一个节点之上,才能进行累加操作.

为什么说shuffle是一个成本高且复杂的操作?

shuffle涉及跨执行程序和机器复制数据

2.Background

To understand what happens during the shuffle we can consider the example of the reduceByKey operation. The reduceByKey operation generates a new RDD where all values for a single key are combined into a tuple - the key and the result of executing a reduce function against all values associated with that key. The challenge is that not all values for a single key necessarily reside on the same partition, or even the same machine, but they must be co-located to compute the result.

为了理解在shuffle期间发生的事情,我们可以考虑reduceByKey操作的示例。reduceByKey操作生成一个新的RDD,其中单个键的所有值都组合成一个tuple - 键和对与该键关联的所有值执行reduce函数的结果。挑战在于,并非单个key的所有值都位于同一个分区,甚至是同一个机器上,但它们必须位于同一位置才能计算结果,也就是最终计算的时候结果是跨分区的。

Operations which can cause a shuffle include repartition operations like repartition and coalesce, ‘ByKey operations (except for counting) like groupByKey and reduceByKey, and join operations like cogroup and join.

Performance Impact

The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O. To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the data, and a set of reduce tasks to aggregate it. This nomenclature comes from MapReduce and does not directly relate to Spark’s map and reduce operations.

Shuffle是一项昂贵的操作,因为它涉及磁盘I / O,数据序列化和网络I / O.为了组织shuffle的数据,Spark生成了一系列tasks(stage)以组织数据,以及一组reduce任务来聚合它。这个术语来自MapReduce,并不直接与Spark的map和reduce操作相关。

Internally, results from individual map tasks are kept in memory until they can’t fit. Then, these are sorted based on the target partition and written to a single file. On the reduce side, tasks read the relevant sorted blocks.

在内部,单个map的tasks的结果会保留在内存中,直到它们扛不住。然后,这些基于目标分区进行排序并写入单个文件。在reduce方面,tasks读取相关的排序块。

Certain shuffle operations can consume significant amounts of heap memory since they employ in-memory data structures to organize records before or after transferring them. Specifically, reduceByKey and aggregateByKey create these structures on the map side, and 'ByKey operations generate these on the reduce side. When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased garbage collection.

某些shuffle操作会消耗大量的堆内存,因为它们使用内存中的数据结构来在传输记录之前或之后组织记录。具体来说,reduceByKey和aggregateByKey在map侧创建这些结构,并且ByKey操作在reduce侧生成这些结构。当数据不适合内存时,Spark会将这些表溢出到磁盘,从而导致磁盘I / O的额外开销和垃圾收集增加。

Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used and are garbage collected. This is done so the shuffle files don’t need to be re-created if the lineage is re-computed. Garbage collection may happen only after a long period of time, if the application retains references to these RDDs or if GC does not kick in frequently. This means that long-running Spark jobs may consume a large amount of disk space. The temporary storage directory is specified by the spark.local.dirconfiguration parameter when configuring the Spark context.

Shuffle还会在磁盘上生成大量中间文件。从Spark 1.3开始,这些文件将被保留,直到不再使用相应的RDD并进行垃圾回收。这样做是为了在重新计算时不需要重新创建shuffle文件。如果应用程序保留对这些RDD的引用或GC不经常启动,则垃圾收集可能仅在很长一段时间后才会发生。这意味着长时间运行的Spark作业可能会占用大量磁盘空间。配置SparkContext时,spark.local.dir配置参数指定临时存储目录。

3.shuffle测试

scala> val info = sc.textFile("file:///opt/scripts/test_data/test.txt")
info: org.apache.spark.rdd.RDD[String] = file:///opt/scripts/test_data/test.txt MapPartitionsRDD[1] at textFile at <console>:24
//使用coalesce将rdd的分区数变为4
scala> val info3 = info.coalesce(4,true)
info3: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at coalesce at <console>:25
//查看rdd的分区数
scala> info3.partitions.length
res1: Int = 4
//执行action操作,方便
scala> info3.collect
res2: Array[String] = Array(world       welcome hello, hello    world   hello) 

然后在WEB UI界面能看到因为我们用了coalesce算子,而coalesce算子是会进行shuffle重新分区的,所以生成了两个stage;使用repartition算子也会进行shuffle操作,本质上都是对数据进行redistribute操作.

Tips:如果想降低RDD的分区数,推荐使用coalesce算子,因为coalesce能避免shuffle动作的进行.

4.代码中的使用

scala开发spark应用程序实现shuffle过程

mapPartitionsWithIndex算子能拿到分区的id

package com.ruozedata.spark.scala

import org.apache.spark.{SparkConf, SparkContext}

import scala.collection.mutable.ListBuffer

object ReparitionApp {

  def main (args: Array[String]): Unit = {

    val sparkConf = new SparkConf()
    sparkConf.setAppName("ReparitionApp").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)

    val onepiece = sc.parallelize(List("路飞","山治","索隆","乔巴","娜美","乌索普","罗宾","布鲁克"),3)

    onepiece.mapPartitionsWithIndex((index,partition) => {

      val people = new ListBuffer[String]
      while(partition.hasNext){
        people += ("小组编号:" + (index + 1) + "---成员:" + partition.next())
      }

      people.iterator

    }).foreach(println)

    sc.stop()
  }

}

使用coalesce将成员的分区数指定为2

package com.ruozedata.spark.scala

import org.apache.spark.{SparkConf, SparkContext}

import scala.collection.mutable.ListBuffer

object ReparitionApp {

  def main (args: Array[String]): Unit = {

    val sparkConf = new SparkConf()
    sparkConf.setAppName("ReparitionApp").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)

    val onepiece = sc.parallelize(List("路飞","山治","索隆","乔巴","娜美","乌索普","罗宾","布鲁克"),3)

    onepiece.coalesce(2).mapPartitionsWithIndex((index,partition) => {

      val people = new ListBuffer[String]
      while(partition.hasNext){
        people += ("小组编号:" + (index + 1) + "---成员:" + partition.next())
      }

      people.iterator

    }).foreach(println)

    sc.stop()
  }

}

使用repartition将成员的分区数指定为4

package com.ruozedata.spark.scala

import org.apache.spark.{SparkConf, SparkContext}

import scala.collection.mutable.ListBuffer

object ReparitionApp {

  def main (args: Array[String]): Unit = {

    val sparkConf = new SparkConf()
    sparkConf.setAppName("ReparitionApp").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)

    val onepiece = sc.parallelize(List("路飞","山治","索隆","乔巴","娜美","乌索普","罗宾","布鲁克","甚平"),3)

    onepiece.repartition(4).mapPartitionsWithIndex((index,partition) => {

      val people = new ListBuffer[String]
      while(partition.hasNext){
        people += ("小组编号:" + (index + 1) + "---成员:" + partition.next())
      }

      people.iterator

    }).foreach(println)

    sc.stop()
  }

}

coalesce主要应用是对小文件的合并,例如说,RDD1有300个分区,然后进行filter操作,filter是窄依赖操作,分区数就不会变化,如果filter过滤之后的分区数依旧是300,但是分区数决定了输出文件的个数,那么这时候输出文件的个数是巨大的,并且每个文件中的数据量也很小就造成了小文件的问题;此时我们可以借助于coalesce算子收敛分区,使用repartition的话可以提高并行度.

5.ByKey算子的使用

scala>sc.textFile("file:///opt/scripts/test_data/test.txt").flatMap(_.split("\t")).map((_,1)).reduceByKey(_+_).collect()
res4: Array[(String, Int)] = Array((hello,3), (welcome,1), (world,2))

在WEB UI界面我们能看到遇到reduceByKey算子的时候进行了shuffle操作,拆成了两个stage

//reduceByKey算子
scala>     sc.textFile("file:///opt/scripts/test_data/test.txt").flatMap(_.split("\t")).map((_,1)).reduceByKey(_+_)
res4: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[18] at reduceByKey at <console>:25
//groupByKey算子
scala> sc.textFile("file:///opt/scripts/test_data/test.txt").flatMap(_.split("\t")).map((_,1)).groupByKey()
res5: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[18] at groupByKey at <console>:25

reduceByKey算子执行之后,RDD的类型是org.apache.spark.rdd.RDD[(String, Int)],而reduceByKey算子执行之后,RDD的类型是org.apache.spark.rdd.RDD[(String, Iterable[Int])];groupByKey的类型跟Mapreduce中reduce类的参数类型相似,而reduce类中values的数据格式是(1,1,1,1,1)这种.

scala> sc.textFile("file:///opt/scripts/test_data/test.txt").flatMap(_.split("\t")).map((_,1)).groupByKey().collect.foreach(println)
(hello,CompactBuffer(1, 1, 1))
(welcome,CompactBuffer(1))
(world,CompactBuffer(1, 1))

这时候我们要实现wc操作,只需要把value的值都加起来

scala>     sc.textFile("file:///opt/scripts/test_data/test.txt").flatMap(_.split("\t")).map((_,1)).groupByKey().map(i => (i._1,i._2.sum)).collect()
res11: Array[(String, Int)] = Array((hello,3), (welcome,1), (world,2))

这时候我们在观察一下WEB UI界面中groupByKey的执行流程.可以发现两者的区别是reduceByKey会先进行一个combine聚合操作再将聚合之后的结果进行shuffle,而groupByKey是直接将数据进行shuffle,这也是reduceByKey进行shuffle操作的数据量要比groupByKey少的原因

6.底层源码查看

  def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = 		         self.withScope {
    combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
  }

  def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
    // groupByKey shouldn't use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
  }

通过底层源码我们可以发现二者都调用了combineByKeyWithClassTag这个方法,但是groupByKey中mapSideCombine = false,也就是不进行一个combine操作.

7.aggregateByKey算子

scala>     sc.textFile("file:///opt/scripts/test_data/test.txt").flatMap(_.split("\t")).map((_,1)).aggregateByKey(0)(_+_,_+_).collect
res26: Array[(String, Int)] = Array((hello,3), (welcome,1), (world,2))

aggregateByKey对PairRDD中相同的Key值进行聚合操作,在聚合过程中同样使用了一个中立的初始值.也就是aggregateByKey(0)中括号内的值0

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值