Spark源码（三）DAGScheduler源码分析

最新推荐文章于 2020-12-04 16:22:01 发布

原创最新推荐文章于 2020-12-04 16:22:01 发布 · 427 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#Spark #DAGScheduler

Spark源码专栏收录该内容

6 篇文章

订阅专栏

本文深入解析Spark中DAGScheduler的工作原理，包括Stage的划分、Job的提交与执行流程，以及核心组件之间的交互机制。

Spark源码（三）DAGScheduler源码分析

前面的系列文章提到过定义RDD之后，我们就可以在Action中使用RDD。Action是向应用程序返回值，或向存储系统导出数据的那些操作，例如，count、collect、save。在Spark中，只有在动作第一次使用RDD时，才会计算RDD，也就是延迟计算。这样在构建RDD时，运行时通过管道的方式传输多个转换。

一次action算子操作会触发RDD的延迟计算，我们把这样的一次计算称作一个Job。窄依赖和宽依赖的概念我们也讲到过：

窄依赖是指每个parent RDD的partition最多被child RDD的一个partition使用
宽依赖是指每个parent RDD的partition被多个child RDD的partition使用

窄依赖每个child RDD的partition的生成操作都是可以并行的，而宽依赖则需要所有的parent parition shuffle计算完毕后再继续执行下一步操作。

由于在RDD的一系列Transformation中，有相当一部分的连续转换都是窄依赖，那么这些partition都是可以并行计算的。而宽依赖则无法并行计算。所以Spark将宽依赖为划分界限，将Job划分为多个Stage。一个Stage里面的转换任务，我们可以把它抽象成TaskSet。一个TaskSet里面有多个Task，它们的Transformation操作都是相同的，只是操作的是数据集中的不同子数据集。

将Job按照Stage划分完毕后，就可以准备提交任务了。

在这里插入图片描述

如上图所示，关于Task任务的执行，有两个调度器是我们要重点关注的，一个是DAGScheduler，一个是TaskScheduler。

TaskScheduler作用是为创建它的SparkContext调度任务，即从DAGScheduler接收不同Stage的任务，并向集群提交这些任务。

DAGScheduler在不同的资源管理框架下的实现是相同的，DAGScheduler将这组Task划分完成后，会将这组Task提交到TaskScheduler。

TaskScheduler通过Cluster Manager在集群的某个Worker的Executor上启动任务。

从起始阶段看下在SparkContext中是如何创建这两个调度器的：

  val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
    _schedulerBackend = sched
    _taskScheduler = ts
    _dagScheduler = new DAGScheduler(this)

先用createTaskScheduler创建TaskScheduler，后面new了一个DAGScheduler。在Spark程序中，使用RDD的action算子，会触发Spark任务，以reduce算子为例：

object SparkPi {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("Spark Pi")
    val spark = new SparkContext(conf)
    val slices = if (args.length > 0) args(0).toInt else 2
    val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
    val count = spark.parallelize(1 until n, slices).map { i =>
      val x = random * 2 - 1
      val y = random * 2 - 1
      if (x*x + y*y < 1) 1 else 0
    }.reduce(_ + _)
    println("Pi is roughly " + 4.0 * count / n)
    spark.stop()
  }
}

点进去看一下reduce算子的源码，可以看到调用了SparkContext的runJob方法

/**
 * Reduces the elements of this RDD using the specified commutative and
 * associative binary operator.
 */
def reduce(f: (T, T) => T): T = withScope {
  val cleanF = sc.clean(f)
  val reducePartition: Iterator[T] => Option[T] = iter => {
    if (iter.hasNext) {
      Some(iter.reduceLeft(cleanF))
    } else {
      None
    }
  }
  var jobResult: Option[T] = None
  val mergeResult = (index: Int, taskResult: Option[T]) => {
    if (taskResult.isDefined) {
      jobResult = jobResult match {
        case Some(value) => Some(f(value, taskResult.get))
        case None => taskResult
      }
    }
  }
  sc.runJob(this, reducePartition, mergeResult)
  // Get the final result out of our Option, or throw an exception if the RDD was empty
  jobResult.getOrElse(throw new UnsupportedOperationException("empty collection"))
}

SparkContext的runJob方法，其中调用了多层的runJob方法

第一层嵌套

/**
 * Run a job on all partitions in an RDD and pass the results to a handler function.
 *
 * @param rdd target RDD to run tasks on
 * @param processPartition a function to run on each partition of the RDD
 * @param resultHandler callback to pass each result to
 */
def runJob[T, U: ClassTag](
    rdd: RDD[T],
    processPartition: Iterator[T] => U,
    resultHandler: (Int, U) => Unit)
{
  val processFunc = (context: TaskContext, iter: Iterator[T]) => processPartition(iter)
  runJob[T, U](rdd, processFunc, 0 until rdd.partitions.length, resultHandler)
}

第二层嵌套runJob

/**
 * Run a function on a given set of partitions in an RDD and pass the results to the given
 * handler function. This is the main entry point for all actions in Spark.
 *
 * @param rdd target RDD to run tasks on
 * @param func a function to run on each partition of the RDD
 * @param partitions set of partitions to run on; some jobs may not want to compute on all
 * partitions of the target RDD, e.g. for operations like `first()`
 * @param resultHandler callback to pass each result to
 */
def runJob[T, U: ClassTag](
    rdd: RDD[T],
    func: (TaskContext, Iterator[T]) => U,
    partitions: Seq[Int],
    resultHandler: (Int, U) => Unit): Unit = {
  if (stopped.get()) {
    throw new IllegalStateException("SparkContext has been shutdown")
  }
  val callSite = getCallSite
  val cleanedFunc = clean(func)
  logInfo("Starting job: " + callSite.shortForm)
  if (conf.getBoolean("spark.logLineage", false)) {
    logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
  }
  dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
  progressBar.foreach(_.finishAll())
  rdd.doCheckpoint()
}

关键点来了：我们可以看到实际上是DAGScheduler调用了runJob方法，我们再跟进去看一下：

/**
 * Run an action job on the given RDD and pass all the results to the resultHandler function as
 * they arrive.
 *
 * @param rdd target RDD to run tasks on
 * @param func a function to run on each partition of the RDD
 * @param partitions set of partitions to run on; some jobs may not want to compute on all
 *   partitions of the target RDD, e.g. for operations like first()
 * @param callSite where in the user program this job was called
 * @param resultHandler callback to pass each result to
 * @param properties scheduler properties to attach to this job, e.g. fair scheduler pool name
 *
 * @note Throws `Exception` when the job fails
 */
def runJob[T, U](
    rdd: RDD[T],
    func: (TaskContext, Iterator[T]) => U,
    partitions: Seq[Int],
    callSite: CallSite,
    resultHandler: (Int, U) => Unit,
    properties: Properties): Unit = {
  val start = System.nanoTime
  val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
  ThreadUtils.awaitReady(waiter.completionFuture, Duration.Inf)
  waiter.completionFuture.value.get match {
    case scala.util.Success(_) =>
      logInfo("Job %d finished: %s, took %f s".format
        (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
    case scala.util.Failure(exception) =>
      logInfo("Job %d failed: %s, took %f s".format
        (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
      // SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler.
      val callerStackTrace = Thread.currentThread().getStackTrace.tail
      exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
      throw exception
  }
}

可以看到DAGScheduler.runJob中调用了submitJob方法，用于提交job，该方法返回一个JobWaiter，用于等待DAGScheduler任务完成。再进入submitJob方法看一下是什么代码逻辑：

/**
 * Submit an action job to the scheduler.
 *
 * @param rdd target RDD to run tasks on
 * @param func a function to run on each partition of the RDD
 * @param partitions set of partitions to run on; some jobs may not want to compute on all
 *   partitions of the target RDD, e.g. for operations like first()
 * @param callSite where in the user program this job was called
 * @param resultHandler callback to pass each result to
 * @param properties scheduler properties to attach to this job, e.g. fair scheduler pool name
 *
 * @return a JobWaiter object that can be used to block until the job finishes executing
 *         or can be used to cancel the job.
 *
 * @throws IllegalArgumentException when partitions ids are illegal
 */
def submitJob[T, U](
    rdd: RDD[T],
    func: (TaskContext, Iterator[T]) => U,
    partitions: Seq[Int],
    callSite: CallSite,
    resultHandler: (Int, U) => Unit,
    properties: Properties): JobWaiter[U] = {
  eventProcessLoop.post(JobSubmitted(
    jobId, rdd, func2, partitions.toArray, callSite, waiter,
    SerializationUtils.clone(properties)))
  waiter
}

/**
 * Put the event into the event queue. The event thread will process it later.
 */
def post(event: E): Unit = {
  eventQueue.put(event)
}

这里只贴出了重点代码，省略了一部分与我们要将的部分无关的代码。submitJob方法调用eventProcessLoop.post将JobSubmitted事件添加到DAGScheduler事件队列中。

我们进入eventProcessLoop看下，发现这是一个DAGSchedulerEventProcessLoop实例，

private[scheduler] val eventProcessLoop = new DAGSchedulerEventProcessLoop(this)
taskScheduler.setDAGScheduler(this)

在这个结构中，前面我们通过post方法提交的事件，在这里是通过onReceive方法来接收解决的。

/**
 * The main event loop of the DAG scheduler.
 */
override def onReceive(event: DAGSchedulerEvent): Unit = {
  val timerContext = timer.time()
  try {
    doOnReceive(event)
  } finally {
    timerContext.stop()
  }
}

再来看doOnReceive方法：

private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
  case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
    dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
  ...
}

提交的JobSubmitted事件，实际上调用了DAGScheduler的handleJobSubmitted方法。终于找到了DAGScheduler调度的核心入口handleJobSubmitted

private[scheduler] def handleJobSubmitted(jobId: Int,
    finalRDD: RDD[_],
    func: (TaskContext, Iterator[_]) => _,
    partitions: Array[Int],
    callSite: CallSite,
    listener: JobListener,
    properties: Properties) {
  var finalStage: ResultStage = null
  try {
    // New stage creation may throw an exception if, for example, jobs are run on a
    // HadoopRDD whose underlying HDFS files have been deleted.
    // 使用触发job的最后一个RDD，创建finalStage
    finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
  } catch {
    case e: Exception =>
      logWarning("Creating new stage failed due to exception - job: " + jobId, e)
      listener.jobFailed(e)
      return
  }
  //用finalStage创建一个job，这个job的最后一个stage就是finalStage
  val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
  clearCacheLocs()
  logInfo("Got job %s (%s) with %d output partitions".format(
    job.jobId, callSite.shortForm, partitions.length))
  logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
  logInfo("Parents of final stage: " + finalStage.parents)
  logInfo("Missing parents: " + getMissingParentStages(finalStage))
  //将Job相关信息加入内存缓冲中
  val jobSubmissionTime = clock.getTimeMillis()
  jobIdToActiveJob(jobId) = job
  activeJobs += job
  finalStage.setActiveJob(job)
  val stageIds = jobIdToStageIds(jobId).toArray
  val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
  listenerBus.post(
    SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
  // 使用submitStage方法提交finalStage。这个方法会提交第一个stage，并将其他的stage放到waitingStages队列
  submitStage(finalStage)
}

使用触发job的最后一个rdd，创建finalStage。stage有两种，一种是ShuffleMapStage，另一种是ResultStage。ShuffleMapStage用来处理Shuffle，ResultStage是最后执行的Stage，用于任务的收尾工作。ResultStage之前的stage都是ShuffleMapStage
用finalStage创建一个Job，这个job的最后一个stage就叫做finalStage
将Job相关信息加入内存缓存中
第四步使用submitStage方法提交finalStage

接下来具体深入到submitStage方法代码逻辑

/** Submits stage, but first recursively submits any missing parents. */
private def submitStage(stage: Stage) {
  val jobId = activeJobForStage(stage)
  if (jobId.isDefined) {
    logDebug("submitStage(" + stage + ")")
    if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
      val missing = getMissingParentStages(stage).sortBy(_.id)
      logDebug("missing: " + missing)
      //这里会反复递归调用，直到最开始的stage，它没有父stage，此时就会提交这个最开始的stage，就是stage0
      if (missing.isEmpty) {
        logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
        submitMissingTasks(stage, jobId.get)
      } else {
        //如果有父stage，递归调用submitStage方法，提交父stage
        for (parent <- missing) {
          submitStage(parent)
        }
        //并且将当前的stage添加到waitingStages等待执行的队列中
        waitingStages += stage
      }
    }
  } else {
    abortStage(stage, "No active job for stage " + stage.id, None)
  }
}

其中的逻辑是：

对stage的active job id进行验证，如果不存在则调用abortStage终止
在提交stage之前，判断当前stage是否为waitingStage、runningStage、failedStage，如果是这几种的一种，则走到abortStage终止。
调用getMissingParentStages方法，获取当前stage的所有未提交的父stage。如果不存在stage未提交的父stage，调用submitMissingTasks方法，提交当前stage所有未提交的task。如果存在未提交的父Stage，递归调用submitStage提交所有未提交的父Stage（即如果这个stage一直有未提交的父stage，一直遍历调用submitStage，直到最开始的stage0），并将当前stage加入waitingStages等待执行队列中。

其中调用的getMissingParentStages方法，涉及到了stage的划分算法，先给代码：

/*
获取某个stage的父stage
这个方法对于一个stage，如果它的最后一个rdd的所有依赖，都是窄依赖，就不会创建任何新的stage
但只要发现这个stage的rdd宽依赖了某个rdd，那么就用宽依赖的那个rdd创建一个新的stage
然后将新的stage返回
*/
private def getMissingParentStages(stage: Stage): List[Stage] = {
  val missing = new HashSet[Stage]
  val visited = new HashSet[RDD[_]]
  // We are manually maintaining a stack here to prevent StackOverflowError
  // caused by recursively visiting
  val waitingForVisit = new Stack[RDD[_]]
  def visit(rdd: RDD[_]) {
    if (!visited(rdd)) {
      visited += rdd
      val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
      //遍历RDD的依赖
      if (rddHasUncachedPartitions) {
        for (dep <- rdd.dependencies) {
          dep match {
            case shufDep: ShuffleDependency[_, _, _] =>
            //如果是宽依赖，那么使用宽依赖的RDD使用getOrCreateShuffleMapStage方法创建一个stage
            //默认最后一个stage，不是shuffleMap stage
            //finalStage之前所有的stage，都是ShuffleMapStage
              val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId)
              if (!mapStage.isAvailable) {
                missing += mapStage
              }
            //如果是窄依赖，将rdd放入栈中，不创建新的stage
            case narrowDep: NarrowDependency[_] =>
              waitingForVisit.push(narrowDep.rdd)
          }
        }
      }
    }
  }
  //首先往栈中，推入了stage最后的一个rdd finalStage
  waitingForVisit.push(stage.rdd)
  while (waitingForVisit.nonEmpty) {
    //对stage的最后一个rdd，调用自己内部定义的visit()方法
    visit(waitingForVisit.pop())
  }
  missing.toList
}

stage划分算法逻辑：

创建一个存放RDD的栈
向栈中推入了stage的最后一个RDD
如果栈不为空，对RDD调用内部定义的visit方法
在visit方法中做了如下操作：
- 遍历传入rdd的依赖
- 如果是ShuffleDependency，调用getOrCreateShuffleMapStage，创建一个新的Stage
- 如果是NarrowDependency，只需要将RDD放入到栈中
- 将新的stage list返回

接来下重点看下getOrCreateShuffleMapStage，用于创建stage

方法中调用createShuffleMapStage方法

/**
 * Gets a shuffle map stage if one exists in shuffleIdToMapStage. Otherwise, if the
 * shuffle map stage doesn't already exist, this method will create the shuffle map stage in
 * addition to any missing ancestor shuffle map stages.
 */
private def getOrCreateShuffleMapStage(
    shuffleDep: ShuffleDependency[_, _, _],
    firstJobId: Int): ShuffleMapStage = {
  shuffleIdToMapStage.get(shuffleDep.shuffleId) match {
    case Some(stage) =>
      stage

    case None =>
      // Create stages for all missing ancestor shuffle dependencies.
      getMissingAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>
        // Even though getMissingAncestorShuffleDependencies only returns shuffle dependencies
        // that were not already in shuffleIdToMapStage, it's possible that by the time we
        // get to a particular dependency in the foreach loop, it's been added to
        // shuffleIdToMapStage by the stage creation process for an earlier dependency. See
        // SPARK-13902 for more information.
        if (!shuffleIdToMapStage.contains(dep.shuffleId)) {
          createShuffleMapStage(dep, firstJobId)
        }
      }
      // Finally, create a stage for the given shuffle dependency.
      // 为给定的shuffle依赖创建一个stage
      createShuffleMapStage(shuffleDep, firstJobId)
  }
}

createShuffleMapStage会new一个新的ShuffleMapStage实例

/**
 * Creates a ShuffleMapStage that generates the given shuffle dependency's partitions. If a
 * previously run stage generated the same shuffle data, this function will copy the output
 * locations that are still available from the previous shuffle to avoid unnecessarily
 * regenerating data.
 */
def createShuffleMapStage(shuffleDep: ShuffleDependency[_, _, _], jobId: Int): ShuffleMapStage = {
  ...
  val stage = new ShuffleMapStage(
    id, rdd, numTasks, parents, jobId, rdd.creationSite, shuffleDep, mapOutputTracker)

  ...
  stage
}

自此，stage的创建和划分就完成了。