Spark Shuffle Read 阶段里的 fetch block 源码分析

Spark Shuffle Read 优化实践

 

一、前言

目的是分析Spark Shuffle Read 阶段里的 fetch block 流程,看是否有优化的空间或者优化的配置参数

1. 相关版本:Spark Master branch(2018.10, compiled-version spark-2.5.0, 相关的测试设置了spark.shuffle.sort.bypassMergeThreshold   1  和 YARN-client 模式) ,HiBench-6.0 and Hadoop-2.7.1

2. 建议先了解Spark 的 RDD、DAG、Shuffle(MapoutputTrack/MapoutputTrackMaster/MapoutputTrackWorker)、Memory 的基本概念。

3. 本故事基于HiBench 的 Terasort Test Case,我配置了spark.shuffle.sort.bypassMergeThreshold    1,代码走的是SerializedShuffleHandle(即UnsafeShuffle)

二、故事要从 ResultTask 说起。。。

1) ResultTask 里调用read (callout ) 也就是 BlockStoreShuffleReader.read(),即下面调用栈里的

read:102, BlockStoreShuffleReader (org.apache.spark.shuffle)

next:400, ShuffleBlockFetcherIterator (org.apache.spark.storage)
hasNext:31, CompletionIterator (org.apache.spark.util)
hasNext:37, InterruptibleIterator (org.apache.spark)
insertAll:199, ExternalSorter (org.apache.spark.util.collection)
read:102, BlockStoreShuffleReader (org.apache.spark.shuffle)
compute:105, ShuffledRDD (org.apache.spark.rdd)
computeOrReadCheckpoint:324, RDD (org.apache.spark.rdd)
iterator:288, RDD (org.apache.spark.rdd)
compute:52, MapPartitionsRDD (org.apache.spark.rdd)
computeOrReadCheckpoint:324, RDD (org.apache.spark.rdd)
iterator:288, RDD (org.apache.spark.rdd)
runTask:90, ResultTask (org.apache.spark.scheduler)
run:121, Task (org.apache.spark.scheduler)
apply:402, Executor$TaskRunner$$anonfun$10 (org.apache.spark.executor)
tryWithSafeFinally:1360, Utils$ (org.apache.spark.util)
run:408, Executor$TaskRunner (org.apache.spark.executor)


2)BlockStoreShuffleReader.read() 先是 调用 mapOutputTracker.getMapSizesByExecutorId(handle.shuffleId, startPartition, endPartition) 获得 mapStatus也就是Iterator[(BlockManagerId, Seq[(BlockId, Long)])],该mapStatus作为 ShuffleBlockFetcherIterator 入参,其实就是Map Output的meta data,格式为 [(BlockManagerId, Seq[(BlockId, Long)])]

  override def read(): Iterator[Product2[K, C]] = {
    val wrappedStreams = new ShuffleBlockFetcherIterator(
      context,
      blockManager.shuffleClient,
      blockManager,
      mapOutputTracker.getMapSizesByExecutorId(handle.shuffleId, startPartition, endPartition),
      serializerManager.wrapStream,
      // Note: we use getSizeAsMb when no suffix is provided for backwards compatibility
      SparkEnv.get.conf.getSizeAsMb("spark.reducer.maxSizeInFlight", "48m") * 1024 * 1024,
      SparkEnv.get.conf.getInt("spark.reducer.maxReqsInFlight", Int.MaxValue),
      SparkEnv.get.conf.get(config.REDUCER_MAX_BLOCKS_IN_FLIGHT_PER_ADDRESS),
      SparkEnv.get.conf.get(config.MAX_REMOTE_BLOCK_SIZE_FETCH_TO_MEM),
      SparkEnv.get.conf.getBoolean("spark.shuffle.detectCorrupt", true))
。。。
}

cluster里用的是 MapoutputTrackerWorker 的 getMapSizesByExecutorId的定义如下:

/**
 * Executor-side client for fetching map output info from the driver's MapOutputTrackerMaster.
 * Note that this is not used in local-mode; instead, local-mode Executors access the
 * MapOutputTrackerMaster directly (which is possible because the master and worker share a comon
 * superclass).
 */
private[spark] class MapOutputTrackerWorker(conf: SparkConf) extends MapOutputTracker(conf) {

  val mapStatuses: Map[Int, Array[MapStatus]] =
    new ConcurrentHashMap[Int, Array[MapStatus]]().asScala

  /** Remembers which map output locations are currently being fetched on an executor. */
  private val fetching = new HashSet[Int]

  // Get blocks sizes by executor Id. Note that zero-sized blocks are excluded in the result.
  override def getMapSizesByExecutorId(shuffleId: Int, startPartition: Int, endPartition: Int)
      : Iterator[(BlockManagerId, Seq[(BlockId, Long)])] = {
    logDebug(s"Fetching outputs for shuffle $shuffleId, partitions $startPartition-$endPartition")
    val statuses = getStatuses(shuffleId)
    try {
      MapOutputTracker.convertMapStatuses(shuffleId, startPartition, endPartition, statuses)
    } catch {
      case e: MetadataFetchFailedException =>
        // We experienced a fetch failure so our mapStatuses cache is outdated; clear it:
        mapStatuses.clear()
        throw e
    }
  }


3)BlockStoreShuffleReader.read() new 了 ShuffleBlockFetcherIterator 对象,而new 对象中得到了格式为 [(BlockManagerId, Seq[(BlockId, Long)])] 的 Map Output的meta data 用来初始化ShuffleBlockFetcherIterator 的私有变量,然后对象初始化会调用私有函数innitialize().

 private[this] def initialize(): Unit = {
    // Add a task completion callback (called in both success case and failure case) to cleanup.
    context.addTaskCompletionListener[Unit](_ => cleanup())

    // Split local and remote blocks.
    val remoteRequests = splitLocalRemoteBlocks()
    // Add the remote requests into our queue in a random order
    fetchRequests ++= Utils.randomize(remoteRequests)
    assert ((0 == reqsInFlight) == (0 == bytesInFlight),
      "expected reqsInFlight = 0 but found reqsInFlight = " + reqsInFlight +
      ", expected bytesInFlight = 0 but found bytesInFlight = " + bytesInFlight)

    // Send out initial requests for blocks, up to our maxBytesInFlight
    fetchUpToMaxBytes()

    val numFetches = remoteRequests.size - fetchRequests.size
    logInfo("Started " + numFetches + " remote fetches in" + Utils.getUsedTimeMs(startTime))

    // Get Local Blocks
    fetchLocalBlocks()
    logDebug("Got local blocks in " + Utils.getUsedTimeMs(startTime))
  }


4) 接着分析ShuffleBlockFetcherIterator.innitialize() 函数,它会使用 mapStatus (也就是Iterator[(BlockManagerId, Seq[(BlockId, Long)])]),先调用 splitLocalRemoteBlocks 函数区分从mapStatus里区分是localBlocks 还是 remoteBlocks ,区分条件是executorId => if (address.executorId == blockManager.block

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值