Spark Task级调度及黑名单和失败重试_spark job中task失败重提-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_44563670/article/details/113528360

Spark 在执行用户app程序时，执行Driver时会存在TaskScheduler 任务调度器属性，这个属性在伴随Driver的执行就会初始化，在StageScheduler 中 submitTask任务时，把任务提交给TaskScheduler中

TaskScheduler：Driver中的task任务调度器

在Driver执行过程中，DAGScheduler将划分的阶段Stage提交给TaskScheduler 后，TaskScheduler 将Stage 按照最后一个RDD的分区数量来划分成task，包装成TaskSet，再将TaskSet封装成TaskSetManager，将TaskSetManager加入TaskScheduler 的任务队列中

 if (tasks.nonEmpty) {
      logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " +
        s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
      taskScheduler.submitTasks(new TaskSet(
        tasks.toArray, stage.id, stage.latestInfo.attemptNumber, jobId, properties))
    } else {
      // Because we posted SparkListenerStageSubmitted earlier, we should mark
      // the stage as completed here in case there are no tasks to run
      markStageAsFinished(stage, None)

      stage match {
        case stage: ShuffleMapStage =>
          logDebug(s"Stage ${stage} is actually done; " +
              s"(available: ${stage.isAvailable}," +
              s"available outputs: ${stage.numAvailableOutputs}," +
              s"partitions: ${stage.numPartitions})")
          markMapStageJobsAsFinished(stage)
        case stage : ResultStage =>
          logDebug(s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})")
      }
      submitWaitingChildStages(stage)
    }

封装成tasksetManager

override def submitTasks(taskSet: TaskSet): Unit = {
    val tasks = taskSet.tasks
    logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")
    this.synchronized {
      val manager = createTaskSetManager(taskSet, maxTaskFailures)
      val stage = taskSet.stageId
      val stageTaskSets =
        taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])

      // Mark all the existing TaskSetManagers of this stage as zombie, as we are adding a new one.
      // This is necessary to handle a corner case. Let's say a stage has 10 partitions and has 2
      // TaskSetManagers: TSM1(zombie) and TSM2(active). TSM1 has a running task for partition 10
      // and it completes. TSM2 finishes tasks for partition 1-9, and thinks he is still active
      // because partition 10 is not completed yet. However, DAGScheduler gets task completion
      // events for all the 10 partitions and thinks the stage is finished. If it's a shuffle stage
      // and somehow it has missing map outputs, then DAGScheduler will resubmit it and create a
      // TSM3 for it. As a stage can't have more than one active task set managers, we must mark
      // TSM2 as zombie (it actually is).
      stageTaskSets.foreach { case (_, ts) =>
        ts.isZombie = true
      }

加入队列中

stageTaskSets.foreach { case (_, ts) =>
        ts.isZombie = true
      }
      stageTaskSets(taskSet.stageAttemptId) = manager
      schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)

      if (!isLocal && !hasReceivedTask) {
        starvationTimer.scheduleAtFixedRate(new TimerTask() {
          override def run(): Unit = {
            if (!hasLaunchedTask) {
              logWarning("Initial job has not accepted any resources; " +
                "check your cluster UI to ensure that workers are registered " +
                "and have sufficient resources")
            } else {
              this.cancel()
            }
          }
        }, STARVATION_TIMEOUT_MS, STARVATION_TIMEOUT_MS)
      }
      hasReceivedTask = true
    }
    backend.reviveOffers()
  }

然后根据调度把任务发送到Excetor中，Excetor中有taskpool，Excetor中的excetor对象就会从任务池中拿到这些任务执行

TaskScheduler 中的调度策略有两种，一种是FIFO，另一种是FAIR ，默认是FIFO

失败重试与黑名单：
当Excetor中的excetor对象执行task任务失败时，会把信息反馈给TaskScheduler，taskScheduler把信息反馈给TaskSetManager，TaskSetManager把task重新提交到taskpool中，同时记录失败的excetor，加如黑名单，这样下次执行就不会提交到此Excetor上了，但是在提交多次后还是失败，则整个application失败