Spark Shuffle Read 阶段里的 fetch block 源码分析

最新推荐文章于 2025-07-28 15:19:24 发布

原创

最新推荐文章于 2025-07-28 15:19:24 发布 · 1.6k 阅读

3 ·

CC 4.0 BY-SA版权

文章标签：

#shuffleRead

一、前言

目的是分析Spark Shuffle Read 阶段里的 fetch block 流程，看是否有优化的空间或者优化的配置参数

1. 相关版本：Spark Master branch(2018.10, compiled-version spark-2.5.0，相关的测试设置了spark.shuffle.sort.bypassMergeThreshold 1 和 YARN-client 模式) ，HiBench-6.0 and Hadoop-2.7.1

2. 建议先了解Spark 的 RDD、DAG、Shuffle(MapoutputTrack/MapoutputTrackMaster/MapoutputTrackWorker)、Memory 的基本概念。

3. 本故事基于HiBench 的 Terasort Test Case,我配置了spark.shuffle.sort.bypassMergeThreshold 1，代码走的是SerializedShuffleHandle(即UnsafeShuffle）

二、故事要从 ResultTask 说起。。。

1) ResultTask 里调用read (callout ) 也就是 BlockStoreShuffleReader.read()，即下面调用栈里的

read:102, BlockStoreShuffleReader (org.apache.spark.shuffle)

next:400, ShuffleBlockFetcherIterator (org.apache.spark.storage)
hasNext:31, CompletionIterator (org.apache.spark.util)
hasNext:37, InterruptibleIterator (org.apache.spark)
insertAll:199, ExternalSorter (org.apache.spark.util.collection)
read:102, BlockStoreShuffleReader (org.apache.spark.shuffle)
compute:105, ShuffledRDD (org.apache.spark.rdd)
computeOrReadCheckpoint:324, RDD (org.apache.spark.rdd)
iterator:288, RDD (org.apache.spark.rdd)
compute:52, MapPartitionsRDD (org.apache.spark.rdd)
computeOrReadCheckpoint:324, RDD (org.apache.spark.rdd)
iterator:288, RDD (org.apache.spark.rdd)
runTask:90, ResultTask (org.apache.spark.scheduler)
run:121, Task (org.apache.spark.scheduler)
apply:402, Executor$TaskRunner$$anonfun$10 (org.apache.spark.executor)
tryWithSafeFinally:1360, Utils$ (org.apache.spark.util)
run:408, Executor$TaskRunner (org.apache.spark.executor)

2）BlockStoreShuffleReader.read() 先是调用 mapOutputTracker.getMapSizesByExecutorId(handle.shuffleId, startPartition, endPartition) 获得 mapStatus也就是Iterator[(BlockManagerId, Seq[(BlockId, Long)])]，该mapStatus作为 ShuffleBlockFetcherIterator 入参，其实就是Map Output的meta data，格式为 [(BlockManagerId, Seq[(BlockId, Long)])]

  override def read(): Iterator[Product2[K, C]] = {
    val wrappedStreams = new ShuffleBlockFetcherIterator(
      context,
      blockManager.shuffleClient,
      blockManager,
      mapOutputTracker.getMapSizesByExecutorId(handle.shuffleId, startPartition, endPartition),
      serializerManager.wrapStream,
      // Note: we use getSizeAsMb when no suffix is provided for backwards compatibility
      SparkEnv.get.conf.getSizeAsMb("spark.reducer.maxSizeInFlight", "48m") * 1024 * 1024,
      SparkEnv.get.conf.getInt("spark.reducer.maxReqsInFlight", Int.MaxValue),
      SparkEnv.get.conf.get(config.REDUCER_MAX_BLOCKS_IN_FLIGHT_PER_ADDRESS),
      SparkEnv.get.conf.get(config.MAX_REMOTE_BLOCK_SIZE_FETCH_TO_MEM),
      SparkEnv.get.conf.getBoolean("spark.shuffle.detectCorrupt", true))
。。。
}

cluster里用的是 MapoutputTrackerWorker 的 getMapSizesByExecutorId的定义如下：

/**
 * Executor-side client for fetching map output info from the driver's MapOutputTrackerMaster.
 * Note that this is not used in local-mode; instead, local-mode Executors access the
 * MapOutputTrackerMaster directly (which is possible because the master and worker share a comon
 * superclass).
 */
private[spark] class MapOutputTrackerWorker(conf: SparkConf) extends MapOutputTracker(conf) {

  val mapStatuses: Map[Int, Array[MapStatus]] =
    new ConcurrentHashMap[Int, Array[MapStatus]]().asScala

  /** Remembers which map output locations are currently being fetched on an executor. */
  private val fetching = new HashSet[Int]

  // Get blocks sizes by executor Id. Note that zero-sized blocks are excluded in the result.
  override def getMapSizesByExecutorId(shuffleId: Int, startPartition: Int, endPartition: Int)
      : Iterator[(BlockManagerId, Seq[(BlockId, Long)])] = {
    logDebug(s"Fetching outputs for shuffle $shuffleId, partitions $startPartition-$endPartition")
    val statuses = getStatuses(shuffleId)
    try {
      MapOutputTracker.convertMapStatuses(shuffleId, startPartition, endPartition, statuses)
    } catch {
      case e: MetadataFetchFailedException =>
        // We experienced a fetch failure so our mapStatuses cache is outdated; clear it:
        mapStatuses.clear()
        throw e
    }
  }

3）BlockStoreShuffleReader.read() new 了 ShuffleBlockFetcherIterator 对象，而new 对象中得到了格式为 [(BlockManagerId, Seq[(BlockId, Long)])] 的 Map Output的meta data 用来初始化ShuffleBlockFetcherIterator 的私有变量，然后对象初始化会调用私有函数innitialize().

 private[this] def initialize(): Unit = {
    // Add a task completion callback (called in both success case and failure case) to cleanup.
    context.addTaskCompletionListener[Unit](_ => cleanup())

    // Split local and remote blocks.
    val remoteRequests = splitLocalRemoteBlocks()
    // Add the remote requests into our queue in a random order
    fetchRequests ++= Utils.randomize(remoteRequests)
    assert ((0 == reqsInFlight) == (0 == bytesInFlight),
      "expected reqsInFlight = 0 but found reqsInFlight = " + reqsInFlight +
      ", expected bytesInFlight = 0 but found bytesInFlight = " + bytesInFlight)

    // Send out initial requests for blocks, up to our maxBytesInFlight
    fetchUpToMaxBytes()

    val numFetches = remoteRequests.size - fetchRequests.size
    logInfo("Started " + numFetches + " remote fetches in" + Utils.getUsedTimeMs(startTime))

    // Get Local Blocks
    fetchLocalBlocks()
    logDebug("Got local blocks in " + Utils.getUsedTimeMs(startTime))
  }

4) 接着分析ShuffleBlockFetcherIterator.innitialize() 函数，它会使用 mapStatus (也就是Iterator[(BlockManagerId, Seq[(BlockId, Long)])])，先调用 splitLocalRemoteBlocks 函数区分从mapStatus里区分是localBlocks 还是 remoteBlocks ，区分条件是executorId => if (address.executorId == blockManager.block