上一篇文章介绍了spark的运行模式。而我之所以写这篇文章是因为好奇当RDD的action被出发后,究竟发生了什么?
- 所有的组件都已经准备妥当以后,sparkContext.scala的runJob()在经过好几层同名函数的层层调用后,最终会调用dagScheduler的runJob()
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
resultHandler: (Int, U) => Unit): Unit = {
...
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
...
}
- DagScheduler.scala的runJob会调用eventProcessLoop.post()方法,传的参数是JobSubmitted
def submitJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): JobWaiter[U] = {
...
assert(partitions.size > 0)
val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
SerializationUtils.clone(properties)))
waiter
}
- 通过DAGScheduler.scala的onReceive()->doOnReceive()调用handleJobSubmitted()
override def onReceive(event: DAGSchedulerEvent): Unit = {
val timerContext = timer.time()
try {
doOnReceive(event)
} finally {
timerContext.stop()
}
}
private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
...
}
- DAGScheduler.scala的handleJobSubmitted()源码如下
private[scheduler] def handleJobSubmitted(jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
callSite: CallSite,
listener: JobListener,
properties: Properties) {
var finalStage: ResultStage = null
try {
//划分stages
finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
} catch {
//处理exception
...
}
// Job submitted, clear internal data.
barrierJobIdToNumTasksCheckFailures.remove(jobId)
//创建需要运行的Job
val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
clearCacheLocs()
logInfo("Got job %s (%s) with %d output partitions".format(
job.jobId, callSite.shortForm, partitions.length))
logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
logInfo("Parents of final stage: " + finalStage.parents)
logInfo("Missing parents: " + getMissingParentStages(finalStage))
val jobSubmissionTime = clock.getTimeMillis()
jobIdToActiveJob(jobId) = job
activeJobs += job
finalStage.setActiveJob(job)
val stageIds = jobIdToStageIds(jobId).toArray
val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
listenerBus.post(
SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
//提交stage
submitStage(finalStage)
}
划分stage的逻辑首先要建立在理解宽依赖和窄依赖的基础上。具体的源码逻辑本质就是通过action操作后的RDD,往上遍历所有RDD,寻找所有的 ShuffleDependency 列表,这个逻辑由getShuffleDependencies()实现;然后在getOrCreateShuffleMapStage()->createShuffleMapStage()->getOrCreateParentStages()继续往前不断回溯上一个stage。这一段逻辑本身就不在此赘述,网上有不少简述,大部分和spark2.4版的不一样,有兴趣的同学可以看一下https://www.jianshu.com/p/9f74e7f5e913, 这是一个2.4版本的描述
5. submitStage(finalStage)看起来只是提交了最后一个stage,但其实在该方法里会不停往前找parent stage
private def submitStage(stage: Stage) {
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
logDebug("submitStage(" + stage + ")")
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
val missing = getMissingParentStages(stage).sortBy(_.id)
logDebug("missing: " + missing)
if (missing.isEmpty) {
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
//没有parent stage,那么就提交当前stage
submitMissingTasks(stage, jobId.get)
} else {
//如果有parent stage,则递归调用submitStage()
for (parent <- missing) {
submitStage(parent)
}
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id, None)
}
}
- 接下来结果一连窜调用submitMissingTasks(stage, jobId.get)->taskScheduler.submitTasks() (此处的taskScheduler是TaskSchedulerImpl.scala) -> backend.reviveOffers(),此处的backend对于不懂得deploy模式是不一样的,比如local的实例是LocalSchedulerBackend.scala类型;而standalone则是StandaloneSchedulerBackend.scala
- StandaloneSchedulerBackend.scala的reviveOffers()方法继承于其父类CoarseGrainedSchedulerBackend.scala。逻辑就是告诉driver一个信号ReviveOffers
override def reviveOffers() {
driverEndpoint.send(ReviveOffers)
}
- retrieve()方法监听到RevieveOffers信号后会调用makeOffers(), 而makeOffers()会调用launchTasks()
case ReviveOffers =>
makeOffers()
- 紧接着,把序列化的task发送给executor
// Launch tasks returned by a set of resource offers
private def launchTasks(tasks: Seq[Seq[TaskDescription]]) {
for (task <- tasks.flatten) {
...
//给executor发送LaunchTask信号
executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
}
}
}
- executor端的CoarseGrainedExecutorBackend.scala的retrieve会去响应LaunchTask这个信号。为什么是CoarseGrainedExecutorBackend呢?回顾一下前一张,standalone模式运行查找答案
override def receive: PartialFunction[Any, Unit] = {
...
case LaunchTask(data) =>
if (executor == null) {
exitExecutor(1, "Received LaunchTask command but executor was null")
} else {
val taskDesc = TaskDescription.decode(data.value)
logInfo("Got assigned task " + taskDesc.taskId)
//运行这个task
executor.launchTask(this, taskDesc)
}
- Executor.scala的launchTask()会通过threadPool来运行TaskRunner线程。class TaskRunner是Executor.scala的内部类
def launchTask(context: ExecutorBackend, taskDescription: TaskDescription): Unit = {
val tr = new TaskRunner(context, taskDescription)
runningTasks.put(taskDescription.taskId, tr)
threadPool.execute(tr)
}
- 内部类TaskRunner的run()方法巨长无比,spark作者的写作习惯按照很多行业标准来说是不符合标准的,但这阻挡不了spark本身是一个相当优秀的开源项目。
run(
//此处省略一万字做准备工作
...
val value = Utils.tryWithSafeFinally {
val res = task.run(
taskAttemptId = taskId,
attemptNumber = taskDescription.attemptNumber,
metricsSystem = env.metricsSystem)
threwException = false
res
}
)
//此处省略一万字处理异常
...
- task是个abstract class,有两个子类,对于非最后一个stage,task是ShuffleMapTask.scala类型。计算结果会写到会写到BlockManager中,并返回MapStatus对象。该对象管理了ShuffleMapTask的运算结果存储到BlockManager里的相关存储信息,而不是计算结果本身。
override def runTask(context: TaskContext): MapStatus = {
// 反序列化广播变量得到RDD
...
var writer: ShuffleWriter[Any, Any] = null
try {
//从 SparkEnv 中获取 ShuffleManager 对象,当前支持 Hash、Sort Based、Tungsten-sort Based 以及自定义的 Shuffle(关于 shuffle 之后会专门写文章说明)
val manager = SparkEnv.get.shuffleManager
//从 ShuffleManager 中获取 ShuffleWriter 对象 writer
writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
//把rdd结果通过writer写到文件系统,在介绍shuffle和blockManager时会详细介绍
writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
writer.stop(success = true).get
} catch {
//处理exception
...
}
}
rdd.iterator()会返回计算结果。这个方法的document是如此描述的:Internal method to this RDD; will read from cache if applicable, or otherwise compute it. 翻译过来就是RDD内部方法,如果存在缓存则返回缓存,如果不是则计算。
14. 对于最后一个stage, task是ResultTask.scala对于ResultTask返回的计算结果
override def runTask(context: TaskContext): U = {
// 反序列化广播变量得到RDD
...
// 返回rdd计算结果
func(context, rdd.iterator(partition, context))
}
- executor返回的结果根据大小不同有不同处理方式
15.1 超过1GB结果直接丢弃,可以通过spark.driver.maxResultSize进行设置
15.2 结果大小在1MB(默认值)到1GB区间内,会把该结果IndirectTaskResult以blockId为编号存入到blockMahager中,然后发送给Driver端。下1MB是因为他取spark.task.maxDirectResultSize(默认1MB)和spark.rpc.message.maxSize(默认128MB)的最小值
15.3 结果小于128MB直接返回到Driver端 - 通过execBackend.statusUpdate(taskId, TaskState.FINISHED, serializedResult)把结果返回给driver端。如果在executor端任务执行过程中有任何异常被捕获,则会依据不同情况返回FAILED或者KILLED状态
- 再回到driver端,StandaloneSchedulerBackend.scala父类CoarseGrainedSchedulerBackend.scala的retrieve()方法就会监听到StatusUpdate信号
override def receive: PartialFunction[Any, Unit] = {
case StatusUpdate(executorId, taskId, state, data) =>
scheduler.statusUpdate(taskId, state, data.value)
if (TaskState.isFinished(state)) {
executorDataMap.get(executorId) match {
case Some(executorInfo) =>
executorInfo.freeCores += scheduler.CPUS_PER_TASK
makeOffers(executorId)
case None =>
// Ignoring the update since we don't know about the executor.
logWarning(s"Ignored task status update ($taskId state $state) " +
s"from unknown executor with ID $executorId")
}
}
这里的scheduler.statusUpdate(taskId, state, data.value)就是调用TaskSchedulerImpl.scala的statusUpdate()方法,而他又会调用taskResultGetter.enqueueSuccessfulTask(taskSet, tid, serializedData)
if (state == TaskState.FINISHED) {
taskResultGetter.enqueueSuccessfulTask(taskSet, tid, serializedData)
- TaskResultGetter.scala enqueueSuccessfulTask()源码如下
def enqueueSuccessfulTask(
taskSetManager: TaskSetManager,
tid: Long,
serializedData: ByteBuffer): Unit = {
//在线程池中启动一个新的线程处理finished的task
getTaskResultExecutor.execute(new Runnable {
override def run(): Unit = Utils.logUncaughtExceptions {
try {
//首先要反序列化结果集。结果集分为directResult和indirectTask,之所以有这种划分是因为worker的executor在返回结果时,会根据结果大小由不同策略响应。不记得的话看本文#15
val (result, size) = serializer.get().deserialize[TaskResult[_]](serializedData) match {
case directResult: DirectTaskResult[_] =>
if (!taskSetManager.canFetchMoreResults(serializedData.limit())) {
return
}
// deserialize "value" without holding any lock so that it won't block other threads.
// We should call it here, so that when it's called again in
// "TaskSetManager.handleSuccessfulTask", it does not need to deserialize the value.
directResult.value(taskResultSerializer.get())
(directResult, serializedData.limit())
case IndirectTaskResult(blockId, size) =>
if (!taskSetManager.canFetchMoreResults(size)) {
// dropped by executor if size is larger than maxResultSize
sparkEnv.blockManager.master.removeBlock(blockId)
return
}
logDebug("Fetching indirect task result for TID %s".format(tid))
scheduler.handleTaskGettingResult(taskSetManager, tid)
val serializedTaskResult = sparkEnv.blockManager.getRemoteBytes(blockId)
if (!serializedTaskResult.isDefined) {
/* We won't be able to get the task result if the machine that ran the task failed
* between when the task ended and when we tried to fetch the result, or if the
* block manager had to flush the result. */
scheduler.handleFailedTask(
taskSetManager, tid, TaskState.FINISHED, TaskResultLost)
return
}
val deserializedResult = serializer.get().deserialize[DirectTaskResult[_]](
serializedTaskResult.get.toByteBuffer)
// force deserialization of referenced value
deserializedResult.value(taskResultSerializer.get())
sparkEnv.blockManager.master.removeBlock(blockId)
(deserializedResult, size)
}
// Set the task result size in the accumulator updates received from the executors.
// We need to do this here on the driver because if we did this on the executors then
// we would have to serialize the result again after updating the size.
result.accumUpdates = result.accumUpdates.map { a =>
if (a.name == Some(InternalAccumulator.RESULT_SIZE)) {
val acc = a.asInstanceOf[LongAccumulator]
assert(acc.sum == 0L, "task result size should not have been set on the executors")
acc.setValue(size.toLong)
acc
} else {
a
}
}
//处理成功的task
scheduler.handleSuccessfulTask(taskSetManager, tid, result)
} catch {
case cnf: ClassNotFoundException =>
val loader = Thread.currentThread.getContextClassLoader
taskSetManager.abort("ClassNotFound with classloader: " + loader)
// Matching NonFatal so we don't catch the ControlThrowable from the "return" above.
case NonFatal(ex) =>
logError("Exception while getting task result", ex)
taskSetManager.abort("Exception while getting task result: %s".format(ex))
}
}
})
}
//处理成功的task
def handleSuccessfulTask(
taskSetManager: TaskSetManager,
tid: Long,
taskResult: DirectTaskResult[_]): Unit = synchronized {
//调用TaskSetManager.scala的handleSuccessfulTask()
taskSetManager.handleSuccessfulTask(tid, taskResult)
}
- 对于taskSetManager.scala的handleSuccessfulTask()方法
/**
* Marks a task as successful and notifies the DAGScheduler that the task has ended.
*/
def handleSuccessfulTask(tid: Long, result: DirectTaskResult[_]): Unit = {
//此处省略一万字,标记task成功运行、跟新failedExecutors。
...
//task执行好了,要到更高层级的DAGScheduler来处理了
sched.dagScheduler.taskEnded(tasks(index), Success, result.value(), result.accumUpdates, info)
//若taskSet所有task都成功执行的一些处理
maybeFinishTaskSet()
}
- DAGScheduler.scala
def taskEnded(
task: Task[_],
reason: TaskEndReason,
result: Any,
accumUpdates: Seq[AccumulatorV2[_, _]],
taskInfo: TaskInfo): Unit = {
eventProcessLoop.post(
CompletionEvent(task, reason, result, accumUpdates, taskInfo))
}
private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
case completion: CompletionEvent =>
dagScheduler.handleTaskCompletion(completion)
}
- 继续看看DAGScheduler.scala handleTaskCompletion()的实现。太长了,将重点还要分两块,第一块是如果事件是ShuffleMapTask,则还要调用submitWaitingChildStages(shuffleStage)来触发下游的stages
private[scheduler] def handleTaskCompletion(event: CompletionEvent) {
//省略一万字
...
event.reason match {
case Success =>
task match {
case rt: ResultTask[_, _] =>
//我们待会来看
...
}
case smt: ShuffleMapTask =>
val shuffleStage = stage.asInstanceOf[ShuffleMapStage]
shuffleStage.pendingPartitions -= task.partitionId
val status = event.result.asInstanceOf[MapStatus]
val execId = status.location.executorId
logDebug("ShuffleMapTask finished on " + execId)
if (failedEpoch.contains(execId) && smt.epoch <= failedEpoch(execId)) {
logInfo(s"Ignoring possibly bogus $smt completion from executor $execId")
} else {
// The epoch of the task is acceptable (i.e., the task was launched after the most
// recent failure we're aware of for the executor), so mark the task's output as
// available.
mapOutputTracker.registerMapOutput(
shuffleStage.shuffleDep.shuffleId, smt.partitionId, status)
}
if (runningStages.contains(shuffleStage) && shuffleStage.pendingPartitions.isEmpty) {
markStageAsFinished(shuffleStage)
logInfo("looking for newly runnable stages")
logInfo("running: " + runningStages)
logInfo("waiting: " + waitingStages)
logInfo("failed: " + failedStages)
// This call to increment the epoch may not be strictly necessary, but it is retained
// for now in order to minimize the changes in behavior from an earlier version of the
// code. This existing behavior of always incrementing the epoch following any
// successful shuffle map stage completion may have benefits by causing unneeded
// cached map outputs to be cleaned up earlier on executors. In the future we can
// consider removing this call, but this will require some extra investigation.
// See https://github.com/apache/spark/pull/17955/files#r117385673 for more details.
mapOutputTracker.incrementEpoch()
clearCacheLocs()
if (!shuffleStage.isAvailable) {
// Some tasks had failed; let's resubmit this shuffleStage.
// TODO: Lower-level scheduler should also deal with this
logInfo("Resubmitting " + shuffleStage + " (" + shuffleStage.name +
") because some of its tasks had failed: " +
shuffleStage.findMissingPartitions().mkString(", "))
submitStage(shuffleStage)
} else {
markMapStageJobsAsFinished(shuffleStage)
//很重要,接着调用下游的字stages
submitWaitingChildStages(shuffleStage)
}
}
}
//其他处理场景
...
}
}
- 如果事件是ResultTask,则调用job.listener.taskSucceeded(rt.outputId, event.result)
private[scheduler] def handleTaskCompletion(event: CompletionEvent) {
//省略一万字
...
event.reason match {
case Success =>
task match {
case rt: ResultTask[_, _] =>
// Cast to ResultStage here because it's part of the ResultTask
// TODO Refactor this out to a function that accepts a ResultStage
val resultStage = stage.asInstanceOf[ResultStage]
resultStage.activeJob match {
case Some(job) =>
//省略一万字
...
try {
job.listener.taskSucceeded(rt.outputId, event.result)
} catch {
case e: Exception =>
// TODO: Perhaps we want to mark the resultStage as failed?
job.listener.jobFailed(new SparkDriverExecutionException(e))
}
}
case None =>
logInfo("Ignoring result from " + rt + " because its job has finished")
}
...
}
- 还记得第二步返回的JobWaiter吗?job.listener.taskSucceeded(rt.outputId, event.result)就是调用JobWaiter.scala taskSucceeded(index: Int, result: Any)
override def taskSucceeded(index: Int, result: Any): Unit = {
// resultHandler call must be synchronized in case resultHandler itself is not thread safe.
synchronized {
//返回结果给sparkContext
resultHandler(index, result.asInstanceOf[T])
}
if (finishedTasks.incrementAndGet() == totalTasks) {
jobPromise.success(())
}
}