spark源码(五)--job触发流程以及stage划分算法

最新推荐文章于 2023-08-15 10:55:29 发布

山高水长~

最新推荐文章于 2023-08-15 10:55:29 发布

阅读量489

点赞数

分类专栏： spark

本文链接：https://blog.youkuaiyun.com/weixin_44024821/article/details/95809981

版权

spark 专栏收录该内容

16 篇文章

订阅专栏

本文围绕Spark运行app展开，介绍了taskScheduler和DAGScheduler的任务，重点解析DAGScheduler源码。先找到sparkContext的runJob()方法，再进入DAGScheduler的相关方法。handleJobSubmitted()是调度开始，提交Job分4步。还介绍了stage划分算法submitStage()，其核心是宽度优先遍历。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

master在完成资源分配后，运行一个application的条件就成熟了，下面需要解析的就是如何运行一个app。我们之前提到过，在初始化sparkContext中会初始化一个taskScheduler和DAGScheduler，taskScheduler的主要任务是创建task调度池实现task级别的任务调度和联系master进行app注册；DAGScheduler则是面向stage的高层次的调度，负责application的job的划分。下面我们就看下DAGScheduler的源码。
路径是 core\src\main\scala\org\apache\spark\scheduler\DAGScheduler.scala
但是我们这次不直接找这个类，因为刚刚说了，sparkContext初始化的时候会初始化DAGScheduler，我们先找到sparkContext中的runJob()方法

/**
   * Run a function on a given set of partitions in an RDD and pass the results to the given handler function. This is the main entry point for all actions in Spark.
   * 在RDD一系列分区上运行一个函数并且将结果传递给下一个处理函数。这是spark中action算子的主要入口
   * 
   * @param rdd target RDD to run tasks on
   * @param func a function to run on each partition of the RDD
   * @param partitions set of partitions to run on; some jobs may not want to compute on all  partitions of the target RDD, e.g. for operations like `first()`
   * @param resultHandler callback to pass each result to
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    }
    // 直接调用了dagScheduler的runJob()方法
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
    progressBar.foreach(_.finishAll())
    rdd.doCheckpoint()
  }

在我们去看dagScheduler的runJob()方法之前，我们先看下DAGScheler的官方定义，先对它有个整体认识，再具体方法

**
 * The high-level scheduling layer that implements stage-oriented scheduling. It computes a DAG of
 * stages for each job, keeps track of which RDDs and stage outputs are materialize
 * d, and finds a  minimal schedule to run the job. It then submits stages as TaskSets to an underlying
 * TaskScheduler implementation that runs them on the cluster. A TaskSet contains fully independent
 * tasks that can run right away based on the data that's already on the cluster (e.g. map output
 * files from previous stages), though it may fail if this data becomes unavailable.
   面向stage的高层次的调度。它会给每个job都计算一个DAG有向无环图 追踪每个rdd的输出是否被物化了，并且寻找执行job的一个最小调度
   以taskset的方式提交一个stage并且以TaskScheduler的实现类在集群上执行。

 * In addition to coming up with a DAG of stages, the DAGScheduler also determines the preferred
 * locations to run each task on, based on the current cache status, and passes these to the
 * low-level TaskScheduler. Furthermore, it handles failures due to shuffle output files being
 * lost, in which case old stages may need to be resubmitted. Failures *within* a stage that are
 * not caused by shuffle file loss are handled by the TaskScheduler, which will retry each task
 * a small number of times before cancelling the whole stage.
   除了处理stage的DAG,DAGScheduler还决定运行每个task的最佳位置，基于当前的缓存状态，将这些最佳位置提供给底层的TaskSchedulerImpl去运行
   此外，他还会处理由于shuffle输出文件丢失导致的失败，这种情况下，旧的stage需要重新被提交。在一个stage内，不是由于shuffle输出文件丢失导致的失败，
   会被TaskScheduler去处理，他会多次运行task,直到取消整个stage

 */

接下来进入DAGScheduler的runJob()方法

/**
   * Run an action job on the given RDD and pass all the results to the resultHandler function as
   * they arrive.
   *在给定的RDD上运行一个action job 并且把结果传递给resultHandler 
   */
  def runJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      resultHandler: (Int, U) => Unit,
      properties: Properties): Unit = {
    val start = System.nanoTime
    // 核心是调用了submitJob()方法
    val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
    ThreadUtils.awaitReady(waiter.completionFuture, Duration.Inf)
  }

submitJob()方法

def submitJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      resultHandler: (Int, U) => Unit,
      properties: Properties): JobWaiter[U] = {
    // 确认不会在不存在的partition上启动task
    val maxPartitions = rdd.partitions.length
    partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
      throw new IllegalArgumentException()
    }
    assert(partitions.size > 0)
    val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
    val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
    // 在这里调用了JobSubmitted()方法
    eventProcessLoop.post(JobSubmitted(
      jobId, rdd, func2, partitions.toArray, callSite, waiter,
      SerializationUtils.clone(properties)))
    waiter
  }

JobSubmitted()实际上是一个接口，在DAGScheduler()方法中有对应的方法，叫做handleJobSubmitted()

 private[scheduler] def handleJobSubmitted(jobId: Int,
      finalRDD: RDD[_],
      func: (TaskContext, Iterator[_]) => _,
      partitions: Array[Int],
      callSite: CallSite,
      listener: JobListener,
      properties: Properties) {
    var finalStage: ResultStage = null
    try {
      // 首先用触发job的最后一个RDD创建一个resultStage
      finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
    } catch {
       // 进行异常捕捉，因为直接创建一个finalStage 可能会报错。比如job可能会运行一个hadoopDD但是HDFS文件却被删了。
    }
    // Job submitted, clear internal data.
    barrierJobIdToNumTasksCheckFailures.remove(jobId)

    // 用finalstage创建一个job
    val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
    /**
    * cacheLocas：缓存的每个RDD的所有分区的位置信息，最终建立已经缓存的分区号和位置信息序列映射。为什么分区号和位置还会是序列呢？
    * 因为每一个分区可能存在多个副本机制，因此RDD的每一个分区的BLock可能存在多个节点的BlockManager上，因此是序列
    */
    // 这里将rdd缓存的位置信息全部清除
    clearCacheLocs()

    // job添加到内存
    val jobSubmissionTime = clock.getTimeMillis()
    jobIdToActiveJob(jobId) = job
    activeJobs += job
    finalStage.setActiveJob(job)
    val stageIds = jobIdToStageIds(jobId).toArray
    val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
    listenerBus.post(
      SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
    // 重要！！！stage提交算法
    submitStage(finalStage)
  }

handleJobSubmitted()方法是DAGScheduler调度的开始。提交Job总结为4步：
1.用触发job的最后一个RDD创建一个finalstage(stage分为两种，除了job的最后一个stage即resultStage外其他都是shuffleMapStage)
2.用finalstage创建一个job (job以action算子划分)
3.将job加入到DAGScheduler缓存
4.递归提交stage

下面就是spark中stage的划分算法submitStage()

private def submitStage(stage: Stage) {
    val jobId = activeJobForStage(stage)   // 获取finalstage对应的jobid
    if (jobId.isDefined) {
      if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
        // 获取当前stage的父stage
        val missing = getMissingParentStages(stage).sortBy(_.id)
        if (missing.isEmpty) {
          logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
          // 如果没有父stage说明整个job就只有一个satge 直接提交这个stage中的task
          submitMissingTasks(stage, jobId.get)
        } else {
          // 核心！！！如果有父stage,遍历所有的父stage，递归寻找stage的父stage
          for (parent <- missing) {
            submitStage(parent)
          }
          waitingStages += stage    // 并把当前stage加入到等待调度的队列中
        }
      }
    }

看一下是怎么实现stage的划分的，也就是getMissingParentStages()方法

/**
*这是一个宽度优先遍历的算法(sparkstreaming中划分stage也是这种算法)。通过递归调用遍历上游依赖，直到找到需要进行实际计算的最小集合
*/
// 接收一个stage,返回的是一个stage的集合，也就是该stage所有的父stage
private def getMissingParentStages(stage: Stage): List[Stage] = {
    val missing = new HashSet[Stage]
    val visited = new HashSet[RDD[_]]
    val waitingForVisit = new ArrayStack[RDD[_]]    // 手动维护一个栈 避免因为递归调用导致内存溢出
  
    def visit(rdd: RDD[_]) {     // 定义一个遍历rdd的方法，接收一个RDD
      if (!visited(rdd)) {
        visited += rdd
        val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
        if (rddHasUncachedPartitions) {  // 如果这个RDD中有没有缓存的partition就遍历这个RDD的依赖  这是为啥？
          for (dep <- rdd.dependencies) {
            dep match {
              case shufDep: ShuffleDependency[_, _, _] =>       // 遍历这个RDD的依赖，如果是宽依赖，那就说明是2个stage 就创建一个shuffleMapStage。除了最后一个stage是resultStage，其他都是shuffleMapStage
                val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId)
                if (!mapStage.isAvailable) {
                  missing += mapStage  // 把这个shuffleMapStage加入父stage集合返回
                }
              case narrowDep: NarrowDependency[_] =>  
                waitingForVisit.push(narrowDep.rdd)   // 如果是窄依赖，那么就把窄依赖的这个rdd推入栈中
            }
          }
        }
      }
    }
   
    waitingForVisit.push(stage.rdd)    // 首先往栈中推入了该stage的rdd
    // 只要还有rdd没有遍历到，就循环遍历这个rdd
    while (waitingForVisit.nonEmpty) {
      visit(waitingForVisit.pop())  // 依次遍历该栈中所有的rdd，这里注意，如果从waitingForVisit中pop()出来了一个rdd，遍历他的依赖发现是个窄依赖，那么会将他窄依赖的那个rdd也推入到waitingForVisit这个栈中
    }
    missing.toList   // 将父stage返回
  }

小结：submitStage()的核心思想是宽度优先遍历(BFS)。首先提交一个resultStage,然后遍历resultStage的rdd的依赖，有宽依赖就是父stage,否则就是同一个stage。然后再递归调用submitStage遍历他的父stage

job触发流程以及stage划分算法重要方法梳理(从上到下按顺序执行)：

-> DAGScheduler.runJob->submitJob(rdd, func, partitions, callSite, resultHandler,
       properties)
    
-> DAGScheduler.eventProcessLoop.post(JobSubmitted(
		   jobId, rdd, func2, partitions.toArray, callSite, waiter,
		   SerializationUtils.clone(properties))) 

-> DAGSchedulerEventProcessLoop-> case JobSubmitted(jobId, dependency, callSite, listener, properties)->

-> DAGScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)

-> DAGScheduler.createResultStage(finalRDD, func, partitions, jobId, callSite)  (创建resultStage)

-> DAGScheduler.submitStage(finalStage)  (广度优先算法，建立无环图)

-> DAGScheduler.getMissingParentStages(stage)  

-> DAGScheduler.submitMissingTasks(stage, jobId.get)