在前四篇博文中,我们分析了Job提交运行总流程的第一阶段Stage划分与提交,它又被细化为三个分阶段:
1、Job的调度模型与运行反馈;
2、Stage划分;
3、Stage提交:对应TaskSet的生成。
Stage划分与提交阶段主要是由DAGScheduler完成的,而DAGScheduler负责Job的逻辑调度,主要职责也即DAG图的分解,按照RDD间是否为shuffle dependency,将整个Job划分为一个个stage,并将每个stage转化为tasks的集合--TaskSet。
接下来我们要讲的第二阶段Task调度与执行,则是Spark中Job的物理调度,它实际上分为两个主要阶段:
1、Task调度;
2、Task运行。
下面,我们分析下Task的调度。我们知道,在第一阶段的末尾,stage被提交后,每个stage被转化为一组task的集合--TaskSet,而紧接着,则调用taskScheduler.submitTasks()提交这些tasks,而TaskScheduler的主要职责,则是负责Job物理调度阶段--Task调度。TaskScheduler为scala中的一个trait,你可以简单的把它理解为Java中的接口,目前它仅仅有一个实现类TaskSchedulerImpl。
TaskScheduler负责低层次任务的调度,每个TaskScheduler为一个特定的SparkContext调度tasks。这些调度器获取到由DAGScheduler为每个stage提交至他们的一组Tasks,并负责将这些tasks发送到集群,以执行它们,在它们失败时重试,并减轻掉队情况(类似MapReduce的推测执行原理吧,在这里留个疑问)。这些调度器返回一些事件events给DAGScheduler。其源码如下:
/**
* Low-level task scheduler interface, currently implemented exclusively by
* [[org.apache.spark.scheduler.TaskSchedulerImpl]].
* This interface allows plugging in different task schedulers. Each TaskScheduler schedules tasks
* for a single SparkContext. These schedulers get sets of tasks submitted to them from the
* DAGScheduler for each stage, and are responsible for sending the tasks to the cluster, running
* them, retrying if there are failures, and mitigating stragglers. They return events to the
* DAGScheduler.
*
*/
private[spark] trait TaskScheduler {
private val appId = "spark-application-" + System.currentTimeMillis
def rootPool: Pool
def schedulingMode: SchedulingMode
def start(): Unit
// Invoked after system has successfully initialized (typically in spark context).
// Yarn uses this to bootstrap allocation of resources based on preferred locations,
// wait for slave registrations, etc.
def postStartHook() { }
// Disconnect from the cluster.
def stop(): Unit
// Submit a sequence of tasks to run.
def submitTasks(taskSet: TaskSet): Unit
// Cancel a stage.
def cancelTasks(stageId: Int, interruptThread: Boolean)
// Set the DAG scheduler for upcalls. This is guaranteed to be set before submitTasks is called.
def setDAGScheduler(dagScheduler: DAGScheduler): Unit
// Get the default level of parallelism to use in the cluster, as a hint for sizing jobs.
def defaultParallelism(): Int
/**
* Update metrics for in-progress tasks and let the master know that the BlockManager is still
* alive. Return true if the driver knows about the given block manager. Otherwise, return false,
* indicating that the block manager should re-register.
*/
def executorHeartbeatReceived(execId: String, taskMetrics: Array[(Long, TaskMetrics)],
blockManagerId: BlockManagerId): Boolean
/**
* Get an application ID associated with the job.
*
* @return An application ID
*/
def applicationId(): String = appId
/**
* Process a lost executor
*/
def executorLost(executorId: String, reason: ExecutorLossReason): Unit
/**
* Get an application's attempt ID associated with the job.
*
* @return An application's Attempt ID
*/
def applicationAttemptId(): Option[String]
}
通过源码我们可以知道,TaskScheduler提供了实例化与销毁时必要的start()和stop()方法,并提供了提交Tasks与取消Tasks的submitTasks()和cancelTasks()方法,并且通过executorHeartbeatReceived()周期性的接收executor的心跳,更新运行中tasks的元信息,并让master知晓BlockManager仍然存活。
好了,结合源码,我们一步步来看吧。
首先,在DAGScheduler的submitMissingTasks()方法的最后,每个stage生成一组tasks后,即调用用TaskScheduler的submitTasks()方法提交task,代码如下:
// 利用taskScheduler.submitTasks()提交task
taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
// 记录提交时间
stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
那么我们先来看下TaskScheduler的submitTasks()方法,在其实现类TaskSchedulerImpl中,代码如下:
override def submitTasks(taskSet: TaskSet) {
// 获取TaskSet中的tasks
val tasks = taskSet.tasks
logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")
// 使用synchronized进行同步
this.synchronized {
// 创建TaskSetManager
val manager = createTaskSetManager(taskSet, maxTaskFailures)
// 获取taskSet对应的stageId
val stage = taskSet.stageId
// taskSetsByStageIdAndAttempt存储的是stageId->[taskSet.stageAttemptId->TaskSetManager]
// 更新taskSetsByStageIdAndAttempt,将上述对应关系存入
val stageTaskSets =
taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
stageTaskSets(taskSet.stageAttemptId) = manager
// 查看是否存在冲突的taskSet,如果存在,抛出IllegalStateException异常
val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>
ts.taskSet != taskSet && !ts.isZombie
}
if (conflictingTaskSet) {
throw new IllegalStateException(s"more than one active taskSet for stage $stage:" +
s" ${stageTaskSets.toSeq.map{_._2.taskSet.id}.mkString(",")}")
}
// 将TaskSetManager添加到schedulableBuilder中
schedulableBuilder.addTaskSetManager(manager, manager.taskSet.