Spark源码分析之五：Task调度（一）

最新推荐文章于 2021-12-18 18:57:11 发布

原创

最新推荐文章于 2021-12-18 18:57:11 发布 · 5.3k 阅读

4 ·

CC 4.0 BY-SA版权

在前四篇博文中，我们分析了Job提交运行总流程的第一阶段Stage划分与提交，它又被细化为三个分阶段：

1、Job的调度模型与运行反馈；

2、Stage划分；

3、Stage提交：对应TaskSet的生成。

Stage划分与提交阶段主要是由DAGScheduler完成的，而DAGScheduler负责Job的逻辑调度，主要职责也即DAG图的分解，按照RDD间是否为shuffle dependency，将整个Job划分为一个个stage，并将每个stage转化为tasks的集合--TaskSet。

接下来我们要讲的第二阶段Task调度与执行，则是Spark中Job的物理调度，它实际上分为两个主要阶段：

1、Task调度；

2、Task运行。

下面，我们分析下Task的调度。我们知道，在第一阶段的末尾，stage被提交后，每个stage被转化为一组task的集合--TaskSet，而紧接着，则调用taskScheduler.submitTasks()提交这些tasks，而TaskScheduler的主要职责，则是负责Job物理调度阶段--Task调度。TaskScheduler为scala中的一个trait，你可以简单的把它理解为Java中的接口，目前它仅仅有一个实现类TaskSchedulerImpl。

TaskScheduler负责低层次任务的调度，每个TaskScheduler为一个特定的SparkContext调度tasks。这些调度器获取到由DAGScheduler为每个stage提交至他们的一组Tasks，并负责将这些tasks发送到集群，以执行它们，在它们失败时重试，并减轻掉队情况（类似MapReduce的推测执行原理吧，在这里留个疑问）。这些调度器返回一些事件events给DAGScheduler。其源码如下：

/**
 * Low-level task scheduler interface, currently implemented exclusively by
 * [[org.apache.spark.scheduler.TaskSchedulerImpl]].
 * This interface allows plugging in different task schedulers. Each TaskScheduler schedules tasks
 * for a single SparkContext. These schedulers get sets of tasks submitted to them from the
 * DAGScheduler for each stage, and are responsible for sending the tasks to the cluster, running
 * them, retrying if there are failures, and mitigating stragglers. They return events to the
 * DAGScheduler.
 * 
 */
private[spark] trait TaskScheduler {

  private val appId = "spark-application-" + System.currentTimeMillis

  def rootPool: Pool

  def schedulingMode: SchedulingMode

  def start(): Unit

  // Invoked after system has successfully initialized (typically in spark context).
  // Yarn uses this to bootstrap allocation of resources based on preferred locations,
  // wait for slave registrations, etc.
  def postStartHook() { }

  // Disconnect from the cluster.
  def stop(): Unit

  // Submit a sequence of tasks to run.
  def submitTasks(taskSet: TaskSet): Unit

  // Cancel a stage.
  def cancelTasks(stageId: Int, interruptThread: Boolean)

  // Set the DAG scheduler for upcalls. This is guaranteed to be set before submitTasks is called.
  def setDAGScheduler(dagScheduler: DAGScheduler): Unit

  // Get the default level of parallelism to use in the cluster, as a hint for sizing jobs.
  def defaultParallelism(): Int

  /**
   * Update metrics for in-progress tasks and let the master know that the BlockManager is still
   * alive. Return true if the driver knows about the given block manager. Otherwise, return false,
   * indicating that the block manager should re-register.
   */
  def executorHeartbeatReceived(execId: String, taskMetrics: Array[(Long, TaskMetrics)],
    blockManagerId: BlockManagerId): Boolean

  /**
   * Get an application ID associated with the job.
   *
   * @return An application ID
   */
  def applicationId(): String = appId

  /**
   * Process a lost executor
   */
  def executorLost(executorId: String, reason: ExecutorLossReason): Unit

  /**
   * Get an application's attempt ID associated with the job.
   *
   * @return An application's Attempt ID
   */
  def applicationAttemptId(): Option[String]

}

通过源码我们可以知道，TaskScheduler提供了实例化与销毁时必要的start()和stop()方法，并提供了提交Tasks与取消Tasks的submitTasks()和cancelTasks()方法，并且通过executorHeartbeatReceived()周期性的接收executor的心跳，更新运行中tasks的元信息，并让master知晓BlockManager仍然存活。

好了，结合源码，我们一步步来看吧。

首先，在DAGScheduler的submitMissingTasks()方法的最后，每个stage生成一组tasks后，即调用用TaskScheduler的submitTasks()方法提交task，代码如下：

// 利用taskScheduler.submitTasks()提交task
      taskScheduler.submitTasks(new TaskSet(
        tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
      // 记录提交时间
      stage.latestInfo.submissionTime = Some(clock.getTimeMillis())

那么我们先来看下TaskScheduler的submitTasks()方法，在其实现类TaskSchedulerImpl中，代码如下：

override def submitTasks(taskSet: TaskSet) {
    
    // 获取TaskSet中的tasks
    val tasks = taskSet.tasks
    logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")
    
    // 使用synchronized进行同步
    this.synchronized {
      
      // 创建TaskSetManager
      val manager = createTaskSetManager(taskSet, maxTaskFailures)
      
      // 获取taskSet对应的stageId
      val stage = taskSet.stageId
      
      // taskSetsByStageIdAndAttempt存储的是stageId->[taskSet.stageAttemptId->TaskSetManager]
      // 更新taskSetsByStageIdAndAttempt，将上述对应关系存入
      val stageTaskSets =
        taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
      stageTaskSets(taskSet.stageAttemptId) = manager
      
      // 查看是否存在冲突的taskSet，如果存在，抛出IllegalStateException异常
      val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>
        ts.taskSet != taskSet && !ts.isZombie
      }
      if (conflictingTaskSet) {
        throw new IllegalStateException(s"more than one active taskSet for stage $stage:" +
          s" ${stageTaskSets.toSeq.map{_._2.taskSet.id}.mkString(",")}")
      }
      
      // 将TaskSetManager添加到schedulableBuilder中
      schedulableBuilder.addTaskSetManager(manager, manager.taskSet.