spark源码学习（二）- DAGSchedular 划分job,提交stage的过程_dag scheduler遇到action算子还分job-优快云博客

本文详细介绍了Spark中DAGScheduler的工作流程，从SparkContext.runJob开始，经过DAGScheduler的runJob、submitJob方法，然后在EventLoop中处理任务提交。DAGScheduler会根据依赖关系划分Stage，并通过submitStage和submitMissingTasks方法提交Task。最后，DAGScheduler获取Task的首选执行位置，确保高效执行。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

背景

了解dagSchedular提交job，就需要了解什么是job,什么是stage,如果我们写了一段程序，其中调用了多个spark算子，但是我们知道，实际在计算的时候，只有在遇到action算子的时候，才会触发计算操作，而这个计算操作就是一个job，所以说一次action操作就会触发提交一个job，比如collect和first操作都会触发sparkcontext的runjob提交job的操作，代码如下：

def collect(): Array[T] = withScope {
    //调用的sparkcontext的runjob方法
    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
    Array.concat(results: _*)
  }

stage具体划分方式取决于依赖的类型，宽依赖还是窄依赖，一个job触发往往会包含一个或者多个stage，这些stage都会包含一个driver生成的一个jobId,具体的sparkcontext对象的runjob的源码，如下文所叙述。

附带Application，Driver，Job，Task，Stage介绍的一篇文章：点击打开链接

https://www.cnblogs.com/superhedantou/p/5699201.html

过程

1.sparkcontext.runjob 方法

方法主要调用dagschedular的runjob方法，将处理任务转至dagschedular进行上层任务调度阶段的处理，主要步骤包含：
（1）清理闭包（文末附带了介绍文章链接）
（2）dagScheduler.runJob

（3）rdd.doCheckpoint()

  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    }
    //核心代码
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
    progressBar.foreach(_.finishAll())

    //任务完成磁盘存储
    rdd.doCheckpoint()
  }

2.dagScheduler.runJob 方法

方法是调用submitJob方法提交任务，并等待结果处理，

  def runJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      resultHandler: (Int, U) => Unit,
      properties: Properties): Unit = {
    val start = System.nanoTime
    
    //提交job 其实是提交rdd和func任务
    //submit其实是向另外一个loop消息线程提交，而不是网络提交
    val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
    
    //结果处理
    ThreadUtils.awaitReady(waiter.completionFuture, Duration.Inf)
    waiter.completionFuture.value.get match {
      case scala.util.Success(_) =>
        logInfo("Job %d finished: %s, took %f s".format
          (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
      case scala.util.Failure(exception) =>
        logInfo("Job %d failed: %s, took %f s".format
          (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
        // SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler.
        val callerStackTrace = Thread.currentThread().getStackTrace.tail
        exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
        throw exception
    }
  }

3.DagSchedular.submitJob() 方法

该方法的核心是，向eventLoop线程，发送提交任务请求，主要步骤为：

(1) 检验分区
(2) 为本job生成一个新的jobId

(3) eventProcessLoop.post 发送事件

def submitJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      resultHandler: (Int, U) => Unit,
      properties: Properties): JobWaiter[U] = {
    // Check to make sure we are not launching a task on a partition that does not exist.

    //partitoin有效性检验
    val maxPartitions = rdd.partitions.length
    partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
      throw new IllegalArgumentException(
        "Attempting to access a non-existent partition: " + p + ". " +
          "Total number of partitions: " + maxPartitions)
    }

    
    //生成一个jobid
    val jobId = nextJobId.getAndIncrement()
    if (partitions.size == 0) {
      // Return immediately if the job is running 0 tasks
      return new JobWaiter[U](this, jobId, 0, resultHandler)
    }

    assert(partitions.size > 0)
    val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
    val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)

    //发送线程消息
    eventProcessLoop.post(JobSubmitted(
      jobId, rdd, func2, partitions.toArray, callSite, waiter,
      SerializationUtils.clone(properties)))
    waiter
  }

4.EventLoop

EventLoop核心是一个线程安全的LinkedBlockingQueue(介绍网址点击打开链接)，向EventLoop提交job,实际上就是向LinkedBlockingQueue对象中添加一个job事件，每一个队列中的事件在处理的时候，会反向调用DAGSchedular.onReceive()方法，没错！ DAGSchedular发送的消息给eventLoop,eventLoop又反向调用DAGSchedular，具体的原因可能如下：

5.dagScheduler.receive 方法

dagScheduler.onReceive()方式调用的doOnReceive方法，所以我们直接分析doOnReceive方法,doOnReceive根据不同的submit类型使用不同的handler处理,我们会匹配到JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) ，调用了dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)处理方法，源码如下:

private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {

	//我们此次分析所匹配到的case
    case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
      dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)

    case MapStageSubmitted(jobId, dependency, callSite, listener, properties) =>
      dagScheduler.handleMapStageSubmitted(jobId, dependency, callSite, listener, properties)

    case StageCancelled(stageId, reason) =>
      dagScheduler.handleStageCancellation(stageId, reason)

    case JobCancelled(jobId, reason) =>
      dagScheduler.handleJobCancellation(jobId, reason)

    case JobGroupCancelled(groupId) =>
      dagScheduler.handleJobGroupCancelled(groupId)

    //省略剩余代码
      
    
 }

6.DagSchedular.handleJobSubmitted 方法

提交任务job，主要就在这个方法中进行处理，划分并创建所有依赖的Stage,关联Stage和Job,调用submitStage方法提交等，核心思想都包含在这个方法中。

private[scheduler] def handleJobSubmitted(jobId: Int,
      finalRDD: RDD[_],
      func: (TaskContext, Iterator[_]) => _,
      partitions: Array[Int],
      callSite: CallSite,
      listener: JobListener,
      properties: Properties) {
    var finalStage: ResultStage = null
    try {
      // New stage creation may throw an exception if, for example, jobs are run on a
      // HadoopRDD whose underlying HDFS files have been deleted.

      // 创建最终stage   
      // 1.createResultStage方法会创建所有依赖（直接或者间接依赖stage）
      // 2.关联jobid和Stage
      finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
    } catch {
      case e: Exception =>
        logWarning("Creating new stage failed due to exception - job: " + jobId, e)
        listener.jobFailed(e)
        return
    }
    //使用jobid创建job
    val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)



    //清楚Rdd  parttition缓存记录
    clearCacheLocs()
    logInfo("Got job %s (%s) with %d output partitions".format(
      job.jobId, callSite.shortForm, partitions.length))
    logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
    logInfo("Parents of final stage: " + finalStage.parents)
    logInfo("Missing parents: " + getMissingParentStages(finalStage))

    val jobSubmissionTime = clock.getTimeMillis()

    //记录job
    jobIdToActiveJob(jobId) = job
    activeJobs += job

    //设置active  job
    finalStage.setActiveJob(job)
    val stageIds = jobIdToStageIds(jobId).toArray


    //stage 最后一次运行信息、重复运行次数的数据机构，当前只有失败以后重试的nextAttmptId
    val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))


    //这其实是个广播事件的事件，凡事注册了SparkListenerJobStart事件的listner都会监听到消息
    listenerBus.post(
      SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))

    //提交Stage
    submitStage(finalStage)
  }

7.DagSchedular.submitStage 方法

该方法会递归的调用自身，当前stage存在依赖的时候就会递归提交父stage,当不存在依赖父stage时候调用，submitMissingTasks()真正提交task。

private def submitStage(stage: Stage) {

    //是否关联jobID，在handleJobSubmitted方法中，我们创建了本次job的所有的stage，并且关联了本次jobid
    val jobId = activeJobForStage(stage)
    if (jobId.isDefined) {
      logDebug("submitStage(" + stage + ")")

      //验证stage是否已经提交过了
      if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
        val missing = getMissingParentStages(stage).sortBy(_.id)
        logDebug("missing: " + missing)
        if (missing.isEmpty) {
          logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")

          //没有依赖的stage直接准备提交taskset
          submitMissingTasks(stage, jobId.get)
        } else {
          for (parent <- missing) {

            //存在依赖的stage,递归提交
            submitStage(parent)
          }
          waitingStages += stage
        }
      }
    } else {
      abortStage(stage, "No active job for stage " + stage.id, None)
    }
  }

7.DagSchedular.submitMissingTasks 方法

该方法处理tasksets提交前的准备工作：1.获取本地行信息 2.处理task任务对象

private def submitMissingTasks(stage: Stage, jobId: Int) {
    logDebug("submitMissingTasks(" + stage + ")")

    // First figure out the indexes of partition ids to compute.
    val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()

    // Use the scheduling pool, job group, description, etc. from an ActiveJob associated
    // with this Stage
    val properties = jobIdToActiveJob(jobId).properties

    runningStages += stage
    // SparkListenerStageSubmitted should be posted before testing whether tasks are
    // serializable. If tasks are not serializable, a SparkListenerStageCompleted event
    // will be posted, which should always come after a corresponding SparkListenerStageSubmitted
    // event.
    stage match {
      case s: ShuffleMapStage =>
        outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
      case s: ResultStage =>
        outputCommitCoordinator.stageStart(
          stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
    }
    
    //获取本地行信息 调用  getPreferredLocs方法
    val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
      stage match {
        case s: ShuffleMapStage =>
          partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
        case s: ResultStage =>
          partitionsToCompute.map { id =>
            val p = s.partitions(id)
            (id, getPreferredLocs(stage.rdd, p))
          }.toMap
      }
    } catch {
      case NonFatal(e) =>
        stage.makeNewStageAttempt(partitionsToCompute.size)
        listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
        abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
        runningStages -= stage
        return
    }
    //更新stage的执行次数记录信息----- _latestInfo
    stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)
    //发送广播
    listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))



    //此处省略一些task处理代码

    
    //调用taskScheduler.submitTasks 提交tasksets
    if (tasks.size > 0) {
      
      //提交
      taskScheduler.submitTasks(new TaskSet(
        tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
      stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
    } else {
      // Because we posted SparkListenerStageSubmitted earlier, we should mark
      // the stage as completed here in case there are no tasks to run
      markStageAsFinished(stage, None)

      val debugString = stage match {
        case stage: ShuffleMapStage =>
          s"Stage ${stage} is actually done; " +
            s"(available: ${stage.isAvailable}," +
            s"available outputs: ${stage.numAvailableOutputs}," +
            s"partitions: ${stage.numPartitions})"
        case stage : ResultStage =>
          s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"
      }
      logDebug(debugString)

      submitWaitingChildStages(stage)
    }
  }

7.DagSchedular.getPreferredLocs 方法

 private[spark]
  def getPreferredLocs(rdd: RDD[_], partition: Int): Seq[TaskLocation] = {
    getPreferredLocsInternal(rdd, partition, new HashSet)
  }

7.DagSchedular.getPreferredLocsInternal方法

该方法为实际获取本地心信息的方法，通过该方法DAGSchedular 可以确定task具体的运行位置(ip,executorId),本地行信息获取顺序为：
（1） DagSchedular的内存数据结构中--cacheLocs

(2) 调用rdd自身的缓存信息确定，rdd.getPreferredLocations

（3）如果是窄依赖，则递归调用获取父rdd对应partition的本地行信息

private def getPreferredLocsInternal(
      rdd: RDD[_],
      partition: Int,
      visited: HashSet[(RDD[_], Int)]): Seq[TaskLocation] = {
    // 是否已经访问过
    if (!visited.add((rdd, partition))) {
      // Nil has already been returned for previously visited partitions.
      return Nil
    }
    // 获取Schedular存储的缓存信息
    //  1.获取DAGSchedular.cacheLocs结构存储了rdd对应的partition存储情况，有则直接返回
    //  2.DAGSchedular还拥有blockManagerMaster对象，还可以从blockManager获取存储信息
    val cached = getCacheLocs(rdd)(partition)
    if (cached.nonEmpty) {
      return cached
    }
    // 获取RDD自身的本地信息
    val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toList
    if (rddPrefs.nonEmpty) {
      return rddPrefs.map(TaskLocation(_))
    }

    // 通过窄依赖获取父rdd对应partition的本地性信息
    rdd.dependencies.foreach {
      case n: NarrowDependency[_] =>
        for (inPart <- n.getParents(partition)) {
          val locs = getPreferredLocsInternal(n.rdd, inPart, visited)
          if (locs != Nil) {
            return locs
          }
        }

      case _ =>
    }

    Nil
  }