spark-core_21: SortShuffleManager的初始化源码分析

最新推荐文章于 2023-10-28 20:41:49 发布

原创最新推荐文章于 2023-10-28 20:41:49 发布 · 348 阅读

0 ·

CC 4.0 BY-SA版权

spark 同时被 2 个专栏收录

38 篇文章

订阅专栏

core

29 篇文章

订阅专栏

本文深入探讨了Spark中SortShuffleManager的工作原理及其相对于HashShuffleManager的优势。SortShuffleManager通过减少磁盘文件数量及优化数据读取流程提高了shuffle性能。

1，SparkEnv初始过时，通过反射的方式默认将SortShuffleManager实例化出来

// Let the userspecify short names for shuffle managers
//使用sort shuffle做为mapper和reducer网络传输
val shortShuffleMgrNames = Map(
"hash" -> "org.apache.spark.shuffle.hash.HashShuffleManager",
"sort" -> "org.apache.spark.shuffle.sort.SortShuffleManager",
"tungsten-sort" -> "org.apache.spark.shuffle.sort.SortShuffleManager")
val shuffleMgrName = conf.get("spark.shuffle.manager", "sort")
val shuffleMgrClass =shortShuffleMgrNames.getOrElse(shuffleMgrName.toLowerCase, shuffleMgrName)
val shuffleManager =instantiateClass[ShuffleManager](shuffleMgrClass)

2，先看一下SortShuffleManager的父类的ShuffleManager

/**
* Pluggable interface for shufflesystems. A ShuffleManager is created in SparkEnv on the driver and on each executor, based on thespark.shuffle.manager setting. The driver registers shuffles with it, and executors (or tasks runninglocally in the driver) can ask to read and write data.
*
* NOTE: this will be instantiated bySparkEnv so its constructor can take a SparkConf and
* boolean isDriver as parameters.
*
* ShuffleManager的可插拔接口。一个ShuffleManager是在SparkEnv的driver和每个executor上创建的，通过spark.shuffle.manager 属性来指定不同的shuffle。
* driver用它注册shuffles，而executor(或本地运行的任务)可以要求读取和写入数据。
*
* 注意:这将由SparkEnv实例化，因此它的构造函数可以使用SparkConf和布尔isDriver作为参数。
*
* 在Spark 1.2以前，ShuffleManager是HashShuffleManager有两方面不足：
* 1、shuffle文件过多的问题，文件过多一是会造成文件系统压力过大，二是降低IO的吞吐量，生成的文件个数= Map task * Reduce task
    2、为每个bucket引入的DiskObjectWriter所带来的buffer内存开销，会随着mapper和reducer增加而增加。reducer数量很多的话，buffler的内存开销也很厉害。
    后面引进了spark.shuffle.consolidateFiles时，将相同类型bucket放到一个文件中，这样得到的文件个数= spark core number *Reducer task
*
* Spark 1.2以后，默认的ShuffleManager改成了SortShuffleManager。优点：
* 每个Task在进行shuffle操作时，虽然也会产生较多的临时磁盘文件，但是最后会将所有的临时文件合并（merge）成一个磁盘文件，因此每个shuffleMapTask输出只有一个磁盘文件。在下一个stage的shuffle read task拉取自己的数据时，只要根据索引读取每个磁盘文件中的部分数据即可。
*
* 网友总结：
* ShuffleManager的主要功能是做mapper输出与reducer输入的桥梁，所以getWriter和getReader是它的主要接口。
* 大流程：
*  1）需求方：当一个Stage依赖于一个shuffleMap的结果，那它在DAG分解的时候就能识别到这个依赖，并注册到shuffleManager；
*  2）供应方：也就是shuffleMap在结束后，会将自己的结果注册到shuffleManager，并通知说自己已经结束了。
*  3）这样，shuffleManager就将shuffle两段连接了起来。
*/
private[spark] trait ShuffleManager{
/**
   * Register a shuffle with the managerand obtain a handle for it to pass to tasks.
    * 向manager注册一次shuffle，并获得一个处理任务的句柄。
   */
def registerShuffle[K, V, C](
      shuffleId: Int,
      numMaps: Int,
      dependency: ShuffleDependency[K, V, C]): ShuffleHandle

/** Get a writer for a given partition. Called onexecutors by map tasks.
    * 为给定的分区获取一个writer。通过map task调用executors。
    * */
def getWriter[K, V](handle:ShuffleHandle, mapId: Int, context:TaskContext): ShuffleWriter[K, V]

/**
   * Get a reader for a range of reducepartitions (startPartition to endPartition-1, inclusive).
   * Called on executors by reduce tasks.
    * 为一系列reduce分区获取一个reader(startPartition到 endPartition-1（包括） )。通过reduce tasks调用executors
   */
def getReader[K, C](
      handle: ShuffleHandle,
      startPartition: Int,
      endPartition: Int,
      context: TaskContext): ShuffleReader[K, C]
/**
    * Remove a shuffle's metadata fromthe ShuffleManager.
    * 从ShuffleManager删除一个shuffle的元数据。
    * @return true if the metadataremoved successfully, otherwise false.
    */
def unregisterShuffle(shuffleId: Int): Boolean

/**
   * Return a resolver capable ofretrieving shuffle block data based on block coordinates.
    * 返回一个能够根据块坐标来检索shuffle块数据的解析器。
   */
def shuffleBlockResolver: ShuffleBlockResolver

/** Shut down this ShuffleManager. */
def stop(): Unit
}

3,再简单看一上SoreShuffleManager相关注释

* 每个reducer可以对应多个map的输入，reducer会去取每个map中的Block,这个过程称为shuffle,每个shuffle也对应shuffleId
* 1，MapOutputTrackerMaster是用来记录每个stage中ShuffleMapTasks的map out输出
*   a,shuffleReader读取shuffle文件之前就是去请求MapOutputTrackerMaster 要自己处理的数据在哪里
*   b,MapOutputTracker给它返回一批 MapOutputTrackerWorker的列表（地址，port等信息）
* 2，MapOutputTrackerWorker是仅仅作为cache用来执行shuffle计算
*######################################################
*一、基于SortShuffleManager的排序shuffle，进来的记录会按目标分区的ids进行排序，然后写到一个单个map output的文件。
* reducers会取得邻近的分区，从这个文件中读取map output文件的部分数据。
* 这种情况下在mpaoutput特别大时不适合内存时，会将排序后的子集切分到磁盘，这些子文件会合并成最终的输出文件

*二、SortShuffleManager有两种不同的方式来生产map output的文件
* 序列化排序:当3个条件保持不变时使用:
* 1 .shuffle依赖指定非聚合或输出排序。
* 2. shuffle序列化支持序列化值的重定位（目前支持KryoSerializer和Spark SQL的自定义序列化程序）。
* 3 .shuffle产生的输出分区少于16777216。
* 反序列化排序:用于处理所有其他情况。
*
* 在Spark 1.2以前，ShuffleManager是HashShuffleManager有着一个非常严重的弊端：
* 就是会产生大量的中间磁盘文件，进而由大量的磁盘IO操作影响了性能。
*
* Spark 1.2以后，默认的ShuffleManager改成了SortShuffleManager。优点：
* 每个Task在进行shuffle操作时，虽然也会产生较多的临时磁盘文件，但是最后会将所有的临时文件合并（merge）成一个磁盘文件，
* 因此每个shuffleMapTask输出只有一个磁盘文件。在下一个stage的shuffle read task拉取自己的数据时，只要根据索引读取每个磁盘文件中
* 的部分数据即可。
*
*/
private[spark] class SortShuffleManager(conf:SparkConf) extends ShuffleManager with Logging{
/**
    * 优化：spark.shuffle.spill
    *  如果为true，在shuffle期间通过溢出数据到磁盘来降低内存使用总量，溢出阈值是由spark.shuffle.memoryFraction指定的。
    */
if (!conf.getBoolean("spark.shuffle.spill", true)) {
    // spark.shuffle.spill 被设置为false，但是这个配置被1.6版本忽略，因为Spark 1.6 Shuffle将在必要时继续写到磁盘。
    logWarning(
      "spark.shuffle.spill was set to false, but thisconfiguration is ignored as of Spark 1.6+." + "Shuffle will continue to spill to disk when necessary.")
}

/**
   * A mapping from shuffle ids to thenumber of mappers producing output for those shuffles.
    * 从shuffle ids 映射一定数量的mapper输入给这个shuffles
   */
private[this] val numMapsForShuffle = new ConcurrentHashMap[Int, Int]()

/** SortShuffleManager通过持有的IndexShuffleBlockManager间接操作BlockManager中的DiskBlockManager将map结果写入本地，并且根据shuffleId，mapId写入索引文件，
    * 也能通过MapOutputTrackerMaster中维护的mapStatuses从本地或者其他远程节点读取文件。
    */
override val shuffleBlockResolver = new IndexShuffleBlockResolver(conf)

/**
   * Register a shuffle with the managerand obtain a handle for it to pass to tasks.
    *
    * SortShuffleManager运行原理SortShuffleManager的运行机制主要分成两种，一种是普通运行机制，另一种是bypass运行机制。当shuffle read task的数量小于等于spark.shuffle.sort.bypassMergeThreshold
    * 参数的值时（默认为200），就会启用bypass机制。
    *
    * 该方法被ShuffleDependency初始化时调用
   */
override def registerShuffle[K, V, C](
      shuffleId: Int,
    numMaps: Int,
      dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
    if (SortShuffleWriter.shouldBypassMergeSort(SparkEnv.get.conf, dependency)) {
      // If there are fewer thanspark.shuffle.sort.bypassMergeThreshold partitions and we don't need map-sideaggregation, then write numPartitions files directly and just concatenate themat the end. This avoids doing serialization and deserialization twice to merge together the spilled files, which would happenwith the normal code path. The downside is having multiple files open at a timeand thus more memory allocated to buffers.
      /** 如果小于 spark.shuffle.sort.bypassMergeThreshold分区数，我们不需要map端的聚合，然后直接编写numPartitions文件，
        * 并在最后将它们连接起来。这避免了序列化和反序列化两次，以合并溢出的文件，这将发生在正常的代码路径上。
        * 缺点是在一个时间内打开多个文件，从而分配更多的内存到缓冲区。
        * */
      new BypassMergeSortShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
      // Otherwise, try to buffer map outputs in a serializedform, since this is more efficient:
      // 否则，尝试以序列化形式缓冲映射输出，因为这样更有效:
      new SerializedShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } else {
      // Otherwise, buffer map outputs in a deserialized form:
      // 否则，缓冲映射输出以反序列化形式输出:
      new BaseShuffleHandle(shuffleId, numMaps, dependency)
    }
}

/**ShuffleManager的主要功能是做mapper输出与reducer输入的桥梁，所以getWriter和getReader是它的主要接口。
   * Get a reader for a range of reducepartitions (startPartition to endPartition-1, inclusive).
   * Called on executors by reduce tasks.
    * 为一系列reduce分区获取一个reader(startPartition 到 endPartition-1（包括） )。通过reduce tasks调用executors
   */
override def getReader[K, C](
      handle: ShuffleHandle,
      startPartition: Int,
      endPartition: Int,
      context: TaskContext): ShuffleReader[K, C] = {
    new BlockStoreShuffleReader(
     handle.asInstanceOf[BaseShuffleHandle[K, _, C]], startPartition, endPartition, context)
}

/** Get a writer for a given partition. Called onexecutors by map tasks.
    * 为给定的分区获取一个writer。通过map task调用executors。
    * */
override def getWriter[K, V](
      handle: ShuffleHandle,
      mapId: Int,
      context: TaskContext): ShuffleWriter[K, V] = {
    numMapsForShuffle.putIfAbsent(
      handle.shuffleId, handle.asInstanceOf[BaseShuffleHandle[_, _, _]].numMaps)
    val env= SparkEnv.get
    handle match {
      case unsafeShuffleHandle:SerializedShuffleHandle[K @unchecked, V @unchecked] =>
      new UnsafeShuffleWriter(
          env.blockManager,
          shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
          context.taskMemoryManager(),
          unsafeShuffleHandle,
          mapId,
          context,
          env.conf)
      case bypassMergeSortHandle:BypassMergeSortShuffleHandle[K @unchecked, V @unchecked] =>
        new BypassMergeSortShuffleWriter(
          env.blockManager,
          shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
          bypassMergeSortHandle,
          mapId,
          context,
          env.conf)
      case other:BaseShuffleHandle[K @unchecked, V @unchecked, _]=>
        new SortShuffleWriter(shuffleBlockResolver, other, mapId, context)
    }
}

/** Remove a shuffle's metadata from the ShuffleManager.
    * 从ShuffleManager移除shuffle的元数据 */
override def unregisterShuffle(shuffleId: Int): Boolean = {
    Option(numMapsForShuffle.remove(shuffleId)).foreach { numMaps =>
      (0 untilnumMaps).foreach { mapId =>
        shuffleBlockResolver.removeDataByMap(shuffleId, mapId)
      }
    }
    true
}

/** Shut down this ShuffleManager. */
override def stop(): Unit = {
    shuffleBlockResolver.stop()
}
}

private[spark] objectSortShuffleManager extends Logging {

/**
* The maximum number of shuffle outputpartitions that SortShuffleManager supports when
   * buffering map outputs in aserialized form. This is an extreme defensive programming measure, since it'sextremely unlikely that a single shuffle produces over 16 million outputpartitions.
    *
    * 当以序列化形式缓冲map输出时，SortShuffleManager支持的最大的shuffle输出分区数。这是一种极端的防御性编程措施，因为单次shuffle就不太可能产生超过1600万的输出分区。
    *
   * */
val MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE=
    PackedRecordPointer.MAXIMUM_PARTITION_ID + 1

/**
   * Helper method for determiningwhether a shuffle should use an optimized serialized shuffle path or whether itshould fall back to the original path that operates on deserialized objects.
    *
    * 帮助确定shuffle是否应该使用经过优化的序列化的shuffle路径，或者它是否应该回到运行在反序列化对象上的原始路径。
    */
def canUseSerializedShuffle(dependency: ShuffleDependency[_, _, _]): Boolean= {
    val shufId= dependency.shuffleId
    val numPartitions= dependency.partitioner.numPartitions
    val serializer= Serializer.getSerializer(dependency.serializer)
    if (!serializer.supportsRelocationOfSerializedObjects){
      log.debug(s"Can't use serialized shuffle for shuffle $shufIdbecause the serializer, " +
        s"${serializer.getClass.getName}, does not support object relocation")
      false
    } elseif (dependency.aggregator.isDefined){
      log.debug(
        s"Can't use serialized shuffle for shuffle $shufIdbecause an aggregator is defined")
      false
    } elseif (numPartitions >MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE) {
      log.debug(s"Can't use serialized shuffle for shuffle $shufIdbecause it has more than " +
        s"$MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE partitions")
      false
    } else{
      log.debug(s"Can use serialized shuffle for shuffle $shufId")
      true
    }
}
}

/**
* Subclass of [[BaseShuffleHandle]],used to identify when we've chosen to use the
* serialized shuffle.
* BaseShuffleHandle的子类，用于识别我们何时选择使用序列化的shuffle。
*/
private[spark] class SerializedShuffleHandle[K, V](
shuffleId: Int,
numMaps: Int,
dependency: ShuffleDependency[K, V, V])
extends BaseShuffleHandle(shuffleId, numMaps, dependency){
}

/**
* Subclass of [[BaseShuffleHandle]],used to identify when we've chosen to use the
* bypass merge sort shuffle path.
* BaseShuffleHandle的子类，用于识别我们选择何时使用旁路归并排序。
*/
private[spark] class BypassMergeSortShuffleHandle[K, V](
shuffleId: Int,
numMaps: Int,
dependency: ShuffleDependency[K, V, V])
extends BaseShuffleHandle(shuffleId, numMaps, dependency){
}