一、前言
目的是分析Spark Shuffle Read 阶段里的 fetch block 流程,看是否有优化的空间或者优化的配置参数
1. 相关版本:Spark Master branch(2018.10, compiled-version spark-2.5.0, 相关的测试设置了spark.shuffle.sort.bypassMergeThreshold 1 和 YARN-client 模式) ,HiBench-6.0 and Hadoop-2.7.1
2. 建议先了解Spark 的 RDD、DAG、Shuffle(MapoutputTrack/MapoutputTrackMaster/MapoutputTrackWorker)、Memory 的基本概念。
3. 本故事基于HiBench 的 Terasort Test Case,我配置了spark.shuffle.sort.bypassMergeThreshold 1,代码走的是SerializedShuffleHandle(即UnsafeShuffle)
二、故事要从 ResultTask 说起。。。
1) ResultTask 里调用read (callout ) 也就是 BlockStoreShuffleReader.read(),即下面调用栈里的
read:102, BlockStoreShuffleReader (org.apache.spark.shuffle)
next:400, ShuffleBlockFetcherIterator (org.apache.spark.storage)
hasNext:31, CompletionIterator (org.apache.spark.util)
hasNext:37, InterruptibleIterator (org.apache.spark)
insertAll:199, ExternalSorter (org.apache.spark.util.collection)
read:102, BlockStoreShuffleReader (org.apache.spark.shuffle)
compute:105, ShuffledRDD (org.apache.spark.rdd)
computeOrReadCheckpoint:324, RDD (org.apache.spark.rdd)
iterator:288, RDD (org.apache.spark.rdd)
compute:52, MapPartitionsRDD (org.apache.spark.rdd)
computeOrReadCheckpoint:324, RDD (org.apache.spark.rdd)
iterator:288, RDD (org.apache.spark.rdd)
runTask:90, ResultTask (org.apache.spark.scheduler)
run:121, Task (org.apache.spark.scheduler)
apply:402, Executor$TaskRunner$$anonfun$10 (org.apache.spark.executor)
tryWithSafeFinally:1360, Utils$ (org.apache.spark.util)
run:408, Executor$TaskRunner (org.apache.spark.executor)
2)BlockStoreShuffleReader.read() 先是 调用 mapOutputTracker.getMapSizesByExecutorId(handle.shuffleId, startPartition, endPartition) 获得 mapStatus也就是Iterator[(BlockManagerId, Seq[(BlockId, Long)])],该mapStatus作为 ShuffleBlockFetcherIterator 入参,其实就是Map Output的meta data,格式为 [(BlockManagerId, Seq[(BlockId, Long)])]
override def read(): Iterator[Product2[K, C]] = {
val wrappedStreams = new ShuffleBlockFetcherIterator(
context,
blockManager.shuffleClient,
blockManager,
mapOutputTracker.getMapSizesByExecutorId(handle.shuffleId, startPartition, endPartition),
serializerManager.wrapStream,
// Note: we use getSizeAsMb when no suffix is provided for backwards compatibility
SparkEnv.get.conf.getSizeAsMb("spark.reducer.maxSizeInFlight", "48m") * 1024 * 1024,
SparkEnv.get.conf.getInt("spark.reducer.maxReqsInFlight", Int.MaxValue),
SparkEnv.get.conf.get(config.REDUCER_MAX_BLOCKS_IN_FLIGHT_PER_ADDRESS),
SparkEnv.get.conf.get(config.MAX_REMOTE_BLOCK_SIZE_FETCH_TO_MEM),
SparkEnv.get.conf.getBoolean("spark.shuffle.detectCorrupt", true))
。。。
}
cluster里用的是 MapoutputTrackerWorker 的 getMapSizesByExecutorId的定义如下:
/**
* Executor-side client for fetching map output info from the driver's MapOutputTrackerMaster.
* Note that this is not used in local-mode; instead, local-mode Executors access the
* MapOutputTrackerMaster directly (which is possible because the master and worker share a comon
* superclass).
*/
private[spark] class MapOutputTrackerWorker(conf: SparkConf) extends MapOutputTracker(conf) {
val mapStatuses: Map[Int, Array[MapStatus]] =
new ConcurrentHashMap[Int, Array[MapStatus]]().asScala
/** Remembers which map output locations are currently being fetched on an executor. */
private val fetching = new HashSet[Int]
// Get blocks sizes by executor Id. Note that zero-sized blocks are excluded in the result.
override def getMapSizesByExecutorId(shuffleId: Int, startPartition: Int, endPartition: Int)
: Iterator[(BlockManagerId, Seq[(BlockId, Long)])] = {
logDebug(s"Fetching outputs for shuffle $shuffleId, partitions $startPartition-$endPartition")
val statuses = getStatuses(shuffleId)
try {
MapOutputTracker.convertMapStatuses(shuffleId, startPartition, endPartition, statuses)
} catch {
case e: MetadataFetchFailedException =>
// We experienced a fetch failure so our mapStatuses cache is outdated; clear it:
mapStatuses.clear()
throw e
}
}
3)BlockStoreShuffleReader.read() new 了 ShuffleBlockFetcherIterator 对象,而new 对象中得到了格式为 [(BlockManagerId, Seq[(BlockId, Long)])] 的 Map Output的meta data 用来初始化ShuffleBlockFetcherIterator 的私有变量,然后对象初始化会调用私有函数innitialize().
private[this] def initialize(): Unit = {
// Add a task completion callback (called in both success case and failure case) to cleanup.
context.addTaskCompletionListener[Unit](_ => cleanup())
// Split local and remote blocks.
val remoteRequests = splitLocalRemoteBlocks()
// Add the remote requests into our queue in a random order
fetchRequests ++= Utils.randomize(remoteRequests)
assert ((0 == reqsInFlight) == (0 == bytesInFlight),
"expected reqsInFlight = 0 but found reqsInFlight = " + reqsInFlight +
", expected bytesInFlight = 0 but found bytesInFlight = " + bytesInFlight)
// Send out initial requests for blocks, up to our maxBytesInFlight
fetchUpToMaxBytes()
val numFetches = remoteRequests.size - fetchRequests.size
logInfo("Started " + numFetches + " remote fetches in" + Utils.getUsedTimeMs(startTime))
// Get Local Blocks
fetchLocalBlocks()
logDebug("Got local blocks in " + Utils.getUsedTimeMs(startTime))
}
4) 接着分析ShuffleBlockFetcherIterator.innitialize() 函数,它会使用 mapStatus (也就是Iterator[(BlockManagerId, Seq[(BlockId, Long)])]),先调用 splitLocalRemoteBlocks 函数区分从mapStatus里区分是localBlocks 还是 remoteBlocks ,区分条件是executorId => if (address.executorId == blockManager.block