Spark源码(三)DAGScheduler源码分析
前面的系列文章提到过定义RDD之后,我们就可以在Action中使用RDD。Action是向应用程序返回值,或向存储系统导出数据的那些操作,例如,count、collect、save。在Spark中,只有在动作第一次使用RDD时,才会计算RDD,也就是延迟计算。这样在构建RDD时,运行时通过管道的方式传输多个转换。
一次action算子操作会触发RDD的延迟计算,我们把这样的一次计算称作一个Job。窄依赖和宽依赖的概念我们也讲到过:
- 窄依赖是指每个parent RDD的partition最多被child RDD的一个partition使用
- 宽依赖是指每个parent RDD的partition被多个child RDD的partition使用
窄依赖每个child RDD的partition的生成操作都是可以并行的,而宽依赖则需要所有的parent parition shuffle计算完毕后再继续执行下一步操作。
由于在RDD的一系列Transformation中,有相当一部分的连续转换都是窄依赖,那么这些partition都是可以并行计算的。而宽依赖则无法并行计算。所以Spark将宽依赖为划分界限,将Job划分为多个Stage。一个Stage里面的转换任务,我们可以把它抽象成TaskSet。一个TaskSet里面有多个Task,它们的Transformation操作都是相同的,只是操作的是数据集中的不同子数据集。
将Job按照Stage划分完毕后,就可以准备提交任务了。

如上图所示,关于Task任务的执行,有两个调度器是我们要重点关注的,一个是DAGScheduler,一个是TaskScheduler。
TaskScheduler作用是为创建它的SparkContext调度任务,即从DAGScheduler接收不同Stage的任务,并向集群提交这些任务。
DAGScheduler在不同的资源管理框架下的实现是相同的,DAGScheduler将这组Task划分完成后,会将这组Task提交到TaskScheduler。
TaskScheduler通过Cluster Manager在集群的某个Worker的Executor上启动任务。
从起始阶段看下在SparkContext中是如何创建这两个调度器的:
val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
_schedulerBackend = sched
_taskScheduler = ts
_dagScheduler = new DAGScheduler(this)
先用createTaskScheduler创建TaskScheduler,后面new了一个DAGScheduler。在Spark程序中,使用RDD的action算子,会触发Spark任务,以reduce算子为例:
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark Pi")
val spark = new SparkContext(conf)
val slices = if (args.length > 0) args(0).toInt else 2
val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
val count = spark.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
spark.stop()
}
}
点进去看一下reduce算子的源码,可以看到调用了SparkContext的runJob方法
/**
* Reduces the elements of this RDD using the specified commutative and
* associative binary operator.
*/
def reduce(f: (T, T) => T): T = withScope {
val cleanF = sc.clean(f)
val reducePartition: Iterator[T] => Option[T] = iter => {
if (iter.hasNext) {
Some(iter.reduceLeft(cleanF))
} else {
None
}
}
var jobResult: Option[T] = None
val mergeResult = (index: Int, taskResult: Option[T]) => {
if (taskResult.isDefined) {
jobResult = jobResult match {
case Some(value) => Some(f(value, taskResult.get))
case None => taskResult
}
}
}
sc.runJob(this, reducePartition, mergeResult)
// Get the final result out of our Option, or throw an exception if the RDD was empty
jobResult.getOrElse(throw new UnsupportedOperationException("empty collection"))
}
SparkContext的runJob方法,其中调用了多层的runJob方法
第一层嵌套
/**
* Run a job on all partitions in an RDD and pass the results to a handler function.
*
* @param rdd target RDD to run tasks on
* @param processPartition a function to run on each partition of the RDD
* @param resultHandler callback to pass each result to
*/
def runJob[T, U: ClassTag](
rdd: RDD[T],
processPartition: Iterator[T] => U,
resultHandler: (Int, U) => Unit)
{
val processFunc = (context: TaskContext, iter: Iterator[T]) => processPartition(iter)
runJob[T, U](rdd, processFunc, 0 until rdd.partitions.length, resultHandler)
}
第二层嵌套runJob
/**
* Run a function on a given set of partitions in an RDD and pass the results to the given
* handler function. This is the main entry point for all actions in Spark.
*
* @param rdd target RDD to run tasks on
* @param func a function to run on each partition of the RDD
* @param partitions set of partitions to run on; some jobs may not want to compute on all
* partitions of the target RDD, e.g. for operations like `first()`
* @param resultHandler callback to pass each result to
*/
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
resultHandler: (Int, U) => Unit): Unit = {
if (stopped.get()) {
throw new IllegalStateException("SparkContext has been shutdown")
}
val callSite = getCallSite
val cleanedFunc = clean(func)
logInfo("Starting job: " + callSite.shortForm)
if (conf.getBoolean("spark.logLineage", false)) {
logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
}
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
progressBar.foreach(_.finishAll())
rdd.doCheckpoint()
}
关键点来了:我们可以看到实际上是DAGScheduler调用了runJob方法,我们再跟进去看一下:
/**
* Run an action job on the given RDD and pass all the results to the resultHandler function as
* they arrive.
*
* @param rdd target RDD to run tasks on
* @param func a function to run on each partition of the RDD
* @param partitions set of partitions to run on; some jobs may not want to compute on all
* partitions of the target RDD, e.g. for operations like first()
* @param callSite where in the user program this job was called
* @param resultHandler callback to pass each result to
* @param properties scheduler properties to attach to this job, e.g. fair scheduler pool name
*
* @note Throws `Exception` when the job fails
*/
def runJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): Unit = {
val start = System.nanoTime
val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
ThreadUtils.awaitReady(waiter.completionFuture, Duration.Inf)
waiter.completionFuture.value.get match {
case scala.util.Success(_) =>
logInfo("Job %d finished: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
case scala.util.Failure(exception) =>
logInfo("Job %d failed: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
// SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler.
val callerStackTrace = Thread.currentThread().getStackTrace.tail
exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
throw exception
}
}
可以看到DAGScheduler.runJob中调用了submitJob方法,用于提交job,该方法返回一个JobWaiter,用于等待DAGScheduler任务完成。再进入submitJob方法看一下是什么代码逻辑:
/**
* Submit an action job to the scheduler.
*
* @param rdd target RDD to run tasks on
* @param func a function to run on each partition of the RDD
* @param partitions set of partitions to run on; some jobs may not want to compute on all
* partitions of the target RDD, e.g. for operations like first()
* @param callSite where in the user program this job was called
* @param resultHandler callback to pass each result to
* @param properties scheduler properties to attach to this job, e.g. fair scheduler pool name
*
* @return a JobWaiter object that can be used to block until the job finishes executing
* or can be used to cancel the job.
*
* @throws IllegalArgumentException when partitions ids are illegal
*/
def submitJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): JobWaiter[U] = {
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
SerializationUtils.clone(properties)))
waiter
}
/**
* Put the event into the event queue. The event thread will process it later.
*/
def post(event: E): Unit = {
eventQueue.put(event)
}
这里只贴出了重点代码,省略了一部分与我们要将的部分无关的代码。submitJob方法调用eventProcessLoop.post将JobSubmitted事件添加到DAGScheduler事件队列中。
我们进入eventProcessLoop看下,发现这是一个DAGSchedulerEventProcessLoop实例,
private[scheduler] val eventProcessLoop = new DAGSchedulerEventProcessLoop(this)
taskScheduler.setDAGScheduler(this)
在这个结构中,前面我们通过post方法提交的事件,在这里是通过onReceive方法来接收解决的。
/**
* The main event loop of the DAG scheduler.
*/
override def onReceive(event: DAGSchedulerEvent): Unit = {
val timerContext = timer.time()
try {
doOnReceive(event)
} finally {
timerContext.stop()
}
}
再来看doOnReceive方法:
private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
...
}
提交的JobSubmitted事件,实际上调用了DAGScheduler的handleJobSubmitted方法。终于找到了DAGScheduler调度的核心入口handleJobSubmitted
private[scheduler] def handleJobSubmitted(jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
callSite: CallSite,
listener: JobListener,
properties: Properties) {
var finalStage: ResultStage = null
try {
// New stage creation may throw an exception if, for example, jobs are run on a
// HadoopRDD whose underlying HDFS files have been deleted.
// 使用触发job的最后一个RDD,创建finalStage
finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
} catch {
case e: Exception =>
logWarning("Creating new stage failed due to exception - job: " + jobId, e)
listener.jobFailed(e)
return
}
//用finalStage创建一个job,这个job的最后一个stage就是finalStage
val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
clearCacheLocs()
logInfo("Got job %s (%s) with %d output partitions".format(
job.jobId, callSite.shortForm, partitions.length))
logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
logInfo("Parents of final stage: " + finalStage.parents)
logInfo("Missing parents: " + getMissingParentStages(finalStage))
//将Job相关信息加入内存缓冲中
val jobSubmissionTime = clock.getTimeMillis()
jobIdToActiveJob(jobId) = job
activeJobs += job
finalStage.setActiveJob(job)
val stageIds = jobIdToStageIds(jobId).toArray
val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
listenerBus.post(
SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
// 使用submitStage方法提交finalStage。这个方法会提交第一个stage,并将其他的stage放到waitingStages队列
submitStage(finalStage)
}
- 使用触发job的最后一个rdd,创建finalStage。stage有两种,一种是ShuffleMapStage,另一种是ResultStage。ShuffleMapStage用来处理Shuffle,ResultStage是最后执行的Stage,用于任务的收尾工作。ResultStage之前的stage都是ShuffleMapStage
- 用finalStage创建一个Job,这个job的最后一个stage就叫做finalStage
- 将Job相关信息加入内存缓存中
- 第四步使用submitStage方法提交finalStage
接下来具体深入到submitStage方法代码逻辑
/** Submits stage, but first recursively submits any missing parents. */
private def submitStage(stage: Stage) {
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
logDebug("submitStage(" + stage + ")")
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
val missing = getMissingParentStages(stage).sortBy(_.id)
logDebug("missing: " + missing)
//这里会反复递归调用,直到最开始的stage,它没有父stage,此时就会提交这个最开始的stage,就是stage0
if (missing.isEmpty) {
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
submitMissingTasks(stage, jobId.get)
} else {
//如果有父stage,递归调用submitStage方法,提交父stage
for (parent <- missing) {
submitStage(parent)
}
//并且将当前的stage添加到waitingStages等待执行的队列中
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id, None)
}
}
其中的逻辑是:
- 对stage的active job id进行验证,如果不存在则调用abortStage终止
- 在提交stage之前,判断当前stage是否为waitingStage、runningStage、failedStage,如果是这几种的一种,则走到abortStage终止。
- 调用getMissingParentStages方法,获取当前stage的所有未提交的父stage。如果不存在stage未提交的父stage,调用submitMissingTasks方法,提交当前stage所有未提交的task。如果存在未提交的父Stage,递归调用submitStage提交所有未提交的父Stage(即如果这个stage一直有未提交的父stage,一直遍历调用submitStage,直到最开始的stage0),并将当前stage加入waitingStages等待执行队列中。
其中调用的getMissingParentStages方法,涉及到了stage的划分算法,先给代码:
/*
获取某个stage的父stage
这个方法对于一个stage,如果它的最后一个rdd的所有依赖,都是窄依赖,就不会创建任何新的stage
但只要发现这个stage的rdd宽依赖了某个rdd,那么就用宽依赖的那个rdd创建一个新的stage
然后将新的stage返回
*/
private def getMissingParentStages(stage: Stage): List[Stage] = {
val missing = new HashSet[Stage]
val visited = new HashSet[RDD[_]]
// We are manually maintaining a stack here to prevent StackOverflowError
// caused by recursively visiting
val waitingForVisit = new Stack[RDD[_]]
def visit(rdd: RDD[_]) {
if (!visited(rdd)) {
visited += rdd
val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
//遍历RDD的依赖
if (rddHasUncachedPartitions) {
for (dep <- rdd.dependencies) {
dep match {
case shufDep: ShuffleDependency[_, _, _] =>
//如果是宽依赖,那么使用宽依赖的RDD使用getOrCreateShuffleMapStage方法创建一个stage
//默认最后一个stage,不是shuffleMap stage
//finalStage之前所有的stage,都是ShuffleMapStage
val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId)
if (!mapStage.isAvailable) {
missing += mapStage
}
//如果是窄依赖,将rdd放入栈中,不创建新的stage
case narrowDep: NarrowDependency[_] =>
waitingForVisit.push(narrowDep.rdd)
}
}
}
}
}
//首先往栈中,推入了stage最后的一个rdd finalStage
waitingForVisit.push(stage.rdd)
while (waitingForVisit.nonEmpty) {
//对stage的最后一个rdd,调用自己内部定义的visit()方法
visit(waitingForVisit.pop())
}
missing.toList
}
stage划分算法逻辑:
- 创建一个存放RDD的栈
- 向栈中推入了stage的最后一个RDD
- 如果栈不为空,对RDD调用内部定义的visit方法
- 在visit方法中做了如下操作:
- 遍历传入rdd的依赖
- 如果是ShuffleDependency,调用getOrCreateShuffleMapStage,创建一个新的Stage
- 如果是NarrowDependency,只需要将RDD放入到栈中
- 将新的stage list返回
接来下重点看下getOrCreateShuffleMapStage,用于创建stage
方法中调用createShuffleMapStage方法
/**
* Gets a shuffle map stage if one exists in shuffleIdToMapStage. Otherwise, if the
* shuffle map stage doesn't already exist, this method will create the shuffle map stage in
* addition to any missing ancestor shuffle map stages.
*/
private def getOrCreateShuffleMapStage(
shuffleDep: ShuffleDependency[_, _, _],
firstJobId: Int): ShuffleMapStage = {
shuffleIdToMapStage.get(shuffleDep.shuffleId) match {
case Some(stage) =>
stage
case None =>
// Create stages for all missing ancestor shuffle dependencies.
getMissingAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>
// Even though getMissingAncestorShuffleDependencies only returns shuffle dependencies
// that were not already in shuffleIdToMapStage, it's possible that by the time we
// get to a particular dependency in the foreach loop, it's been added to
// shuffleIdToMapStage by the stage creation process for an earlier dependency. See
// SPARK-13902 for more information.
if (!shuffleIdToMapStage.contains(dep.shuffleId)) {
createShuffleMapStage(dep, firstJobId)
}
}
// Finally, create a stage for the given shuffle dependency.
// 为给定的shuffle依赖创建一个stage
createShuffleMapStage(shuffleDep, firstJobId)
}
}
createShuffleMapStage会new一个新的ShuffleMapStage实例
/**
* Creates a ShuffleMapStage that generates the given shuffle dependency's partitions. If a
* previously run stage generated the same shuffle data, this function will copy the output
* locations that are still available from the previous shuffle to avoid unnecessarily
* regenerating data.
*/
def createShuffleMapStage(shuffleDep: ShuffleDependency[_, _, _], jobId: Int): ShuffleMapStage = {
...
val stage = new ShuffleMapStage(
id, rdd, numTasks, parents, jobId, rdd.creationSite, shuffleDep, mapOutputTracker)
...
stage
}
自此,stage的创建和划分就完成了。
总结
DAGScheduler的stage划分算法十分重要,这样我们在分析程序性能时才能更好的定位具体是哪个stage运行慢或者有异常信息,针对特定的stage去排查问题。
本文深入解析Spark中DAGScheduler的工作原理,包括Stage的划分、Job的提交与执行流程,以及核心组件之间的交互机制。
1749

被折叠的 条评论
为什么被折叠?



