Spark job究竟是怎么跑起来的？源码分析

最新推荐文章于 2020-12-07 21:45:03 发布

stephenpan0415

最新推荐文章于 2020-12-07 21:45:03 发布

阅读量467

点赞数

CC 4.0 BY-SA版权

分类专栏： Spark 大数据

本文链接：https://blog.youkuaiyun.com/stephenpan0415/article/details/87630838

Spark 同时被 2 个专栏收录

2 篇文章

订阅专栏

大数据

2 篇文章

订阅专栏

本文深入探讨了Spark作业从启动到执行的整个过程。从SparkContext的runJob开始，经过DAGScheduler的调度，包括stage划分、任务提交、executor执行、结果返回等多个阶段，详细解析了每个步骤的关键代码和逻辑。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

上一篇文章介绍了spark的运行模式。而我之所以写这篇文章是因为好奇当RDD的action被出发后，究竟发生了什么？

所有的组件都已经准备妥当以后，sparkContext.scala的runJob()在经过好几层同名函数的层层调用后，最终会调用dagScheduler的runJob()

       def runJob[T, U: ClassTag](
           rdd: RDD[T],
           func: (TaskContext, Iterator[T]) => U,
           partitions: Seq[Int],
           resultHandler: (Int, U) => Unit): Unit = {
         ...
         dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
         ...
       }

DagScheduler.scala的runJob会调用eventProcessLoop.post()方法，传的参数是JobSubmitted

       def submitJob[T, U](
             rdd: RDD[T],
             func: (TaskContext, Iterator[T]) => U,
             partitions: Seq[Int],
             callSite: CallSite,
             resultHandler: (Int, U) => Unit,
             properties: Properties): JobWaiter[U] = {
           ...
           assert(partitions.size > 0)
           val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
           val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
           eventProcessLoop.post(JobSubmitted(
             jobId, rdd, func2, partitions.toArray, callSite, waiter,
             SerializationUtils.clone(properties)))
           waiter
         }

通过DAGScheduler.scala的onReceive()->doOnReceive()调用handleJobSubmitted()

       override def onReceive(event: DAGSchedulerEvent): Unit = {
           val timerContext = timer.time()
           try {
             doOnReceive(event)
           } finally {
             timerContext.stop()
           }
         }
       
         private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
           case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
             dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
             ...
       }

DAGScheduler.scala的handleJobSubmitted()源码如下

       private[scheduler] def handleJobSubmitted(jobId: Int,
             finalRDD: RDD[_],
             func: (TaskContext, Iterator[_]) => _,
             partitions: Array[Int],
             callSite: CallSite,
             listener: JobListener,
             properties: Properties) {
           var finalStage: ResultStage = null
           try {
             //划分stages
             finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
           } catch {
             //处理exception
               ...
           }
           // Job submitted, clear internal data.
           barrierJobIdToNumTasksCheckFailures.remove(jobId)
       
           //创建需要运行的Job
           val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
           clearCacheLocs()
           logInfo("Got job %s (%s) with %d output partitions".format(
             job.jobId, callSite.shortForm, partitions.length))
           logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
           logInfo("Parents of final stage: " + finalStage.parents)
           logInfo("Missing parents: " + getMissingParentStages(finalStage))
       
           val jobSubmissionTime = clock.getTimeMillis()
           jobIdToActiveJob(jobId) = job
           activeJobs += job
           finalStage.setActiveJob(job)
           val stageIds = jobIdToStageIds(jobId).toArray
           val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
           listenerBus.post(
             SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
           //提交stage
           submitStage(finalStage)
         }

划分stage的逻辑首先要建立在理解宽依赖和窄依赖的基础上。具体的源码逻辑本质就是通过action操作后的RDD，往上遍历所有RDD，寻找所有的 ShuffleDependency 列表，这个逻辑由getShuffleDependencies()实现；然后在getOrCreateShuffleMapStage()->createShuffleMapStage()->getOrCreateParentStages()继续往前不断回溯上一个stage。这一段逻辑本身就不在此赘述，网上有不少简述，大部分和spark2.4版的不一样，有兴趣的同学可以看一下https://www.jianshu.com/p/9f74e7f5e913，这是一个2.4版本的描述
5. submitStage(finalStage)看起来只是提交了最后一个stage，但其实在该方法里会不停往前找parent stage

       private def submitStage(stage: Stage) {
           val jobId = activeJobForStage(stage)
           if (jobId.isDefined) {
             logDebug("submitStage(" + stage + ")")
             if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
               val missing = getMissingParentStages(stage).sortBy(_.id)
               logDebug("missing: " + missing)
               if (missing.isEmpty) {
                 logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
                 //没有parent stage，那么就提交当前stage
                 submitMissingTasks(stage, jobId.get)
               } else {
                 //如果有parent stage，则递归调用submitStage()
                 for (parent <- missing) {
                   submitStage(parent)
                 }
                 waitingStages += stage
               }
             }
           } else {
             abortStage(stage, "No active job for stage " + stage.id, None)
           }
         }

接下来结果一连窜调用submitMissingTasks(stage, jobId.get)->taskScheduler.submitTasks() (此处的taskScheduler是TaskSchedulerImpl.scala) -> backend.reviveOffers()，此处的backend对于不懂得deploy模式是不一样的，比如local的实例是LocalSchedulerBackend.scala类型；而standalone则是StandaloneSchedulerBackend.scala
StandaloneSchedulerBackend.scala的reviveOffers()方法继承于其父类CoarseGrainedSchedulerBackend.scala。逻辑就是告诉driver一个信号ReviveOffers

       override def reviveOffers() {
           driverEndpoint.send(ReviveOffers)
       }

retrieve()方法监听到RevieveOffers信号后会调用makeOffers(), 而makeOffers()会调用launchTasks()

       case ReviveOffers =>
               makeOffers()

紧接着，把序列化的task发送给executor

           // Launch tasks returned by a set of resource offers
           private def launchTasks(tasks: Seq[Seq[TaskDescription]]) {
             for (task <- tasks.flatten) {
       		...
                 //给executor发送LaunchTask信号
                 executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
               }
             }
           }

executor端的CoarseGrainedExecutorBackend.scala的retrieve会去响应LaunchTask这个信号。为什么是CoarseGrainedExecutorBackend呢？回顾一下前一张，standalone模式运行查找答案

         override def receive: PartialFunction[Any, Unit] = {
            ...
            case LaunchTask(data) =>
              if (executor == null) {
                exitExecutor(1, "Received LaunchTask command but executor was null")
              } else {
                val taskDesc = TaskDescription.decode(data.value)
                logInfo("Got assigned task " + taskDesc.taskId)
                //运行这个task
                executor.launchTask(this, taskDesc)
              }

Executor.scala的launchTask()会通过threadPool来运行TaskRunner线程。class TaskRunner是Executor.scala的内部类

        def launchTask(context: ExecutorBackend, taskDescription: TaskDescription): Unit = {
            val tr = new TaskRunner(context, taskDescription)
            runningTasks.put(taskDescription.taskId, tr)
            threadPool.execute(tr)
          }

内部类TaskRunner的run()方法巨长无比，spark作者的写作习惯按照很多行业标准来说是不符合标准的，但这阻挡不了spark本身是一个相当优秀的开源项目。

	run(
      //此处省略一万字做准备工作
       ...
       val value = Utils.tryWithSafeFinally {
          val res = task.run(
            taskAttemptId = taskId,
            attemptNumber = taskDescription.attemptNumber,
            metricsSystem = env.metricsSystem)
          threwException = false
          res
        } 
 	)
 	//此处省略一万字处理异常
       ...

task是个abstract class，有两个子类，对于非最后一个stage，task是ShuffleMapTask.scala类型。计算结果会写到会写到BlockManager中，并返回MapStatus对象。该对象管理了ShuffleMapTask的运算结果存储到BlockManager里的相关存储信息，而不是计算结果本身。

        override def runTask(context: TaskContext): MapStatus = {
            // 反序列化广播变量得到RDD
            ...
            var writer: ShuffleWriter[Any, Any] = null
            try {
              //从 SparkEnv 中获取 ShuffleManager 对象，当前支持 Hash、Sort Based、Tungsten-sort Based 以及自定义的 Shuffle（关于 shuffle 之后会专门写文章说明）
              val manager = SparkEnv.get.shuffleManager
              //从 ShuffleManager 中获取 ShuffleWriter 对象 writer
              writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
              //把rdd结果通过writer写到文件系统，在介绍shuffle和blockManager时会详细介绍
              writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
              writer.stop(success = true).get
            } catch {
              //处理exception
              ...
            }
          }

rdd.iterator()会返回计算结果。这个方法的document是如此描述的:Internal method to this RDD; will read from cache if applicable, or otherwise compute it. 翻译过来就是RDD内部方法，如果存在缓存则返回缓存，如果不是则计算。
14. 对于最后一个stage， task是ResultTask.scala对于ResultTask返回的计算结果

        override def runTask(context: TaskContext): U = {
            // 反序列化广播变量得到RDD
            ...
            // 返回rdd计算结果
            func(context, rdd.iterator(partition, context))
          }

executor返回的结果根据大小不同有不同处理方式
15.1 超过1GB结果直接丢弃，可以通过spark.driver.maxResultSize进行设置
15.2 结果大小在1MB（默认值）到1GB区间内，会把该结果IndirectTaskResult以blockId为编号存入到blockMahager中，然后发送给Driver端。下1MB是因为他取spark.task.maxDirectResultSize（默认1MB）和spark.rpc.message.maxSize（默认128MB）的最小值
15.3 结果小于128MB直接返回到Driver端
通过execBackend.statusUpdate(taskId, TaskState.FINISHED, serializedResult)把结果返回给driver端。如果在executor端任务执行过程中有任何异常被捕获，则会依据不同情况返回FAILED或者KILLED状态
再回到driver端，StandaloneSchedulerBackend.scala父类CoarseGrainedSchedulerBackend.scala的retrieve()方法就会监听到StatusUpdate信号

         override def receive: PartialFunction[Any, Unit] = {
              case StatusUpdate(executorId, taskId, state, data) =>
                scheduler.statusUpdate(taskId, state, data.value)
                if (TaskState.isFinished(state)) {
                  executorDataMap.get(executorId) match {
                    case Some(executorInfo) =>
                      executorInfo.freeCores += scheduler.CPUS_PER_TASK
                      makeOffers(executorId)
                    case None =>
                      // Ignoring the update since we don't know about the executor.
                      logWarning(s"Ignored task status update ($taskId state $state) " +
                        s"from unknown executor with ID $executorId")
                  }
                }

这里的scheduler.statusUpdate(taskId, state, data.value)就是调用TaskSchedulerImpl.scala的statusUpdate()方法，而他又会调用taskResultGetter.enqueueSuccessfulTask(taskSet, tid, serializedData)

        if (state == TaskState.FINISHED) {
                        taskResultGetter.enqueueSuccessfulTask(taskSet, tid, serializedData)

TaskResultGetter.scala enqueueSuccessfulTask()源码如下

        def enqueueSuccessfulTask(
              taskSetManager: TaskSetManager,
              tid: Long,
              serializedData: ByteBuffer): Unit = {
            //在线程池中启动一个新的线程处理finished的task
            getTaskResultExecutor.execute(new Runnable {
              override def run(): Unit = Utils.logUncaughtExceptions {
                try {
                  //首先要反序列化结果集。结果集分为directResult和indirectTask，之所以有这种划分是因为worker的executor在返回结果时，会根据结果大小由不同策略响应。不记得的话看本文#15
                  val (result, size) = serializer.get().deserialize[TaskResult[_]](serializedData) match {
                    case directResult: DirectTaskResult[_] =>
                      if (!taskSetManager.canFetchMoreResults(serializedData.limit())) {
                        return
                      }
                      // deserialize "value" without holding any lock so that it won't block other threads.
                      // We should call it here, so that when it's called again in
                      // "TaskSetManager.handleSuccessfulTask", it does not need to deserialize the value.
                      directResult.value(taskResultSerializer.get())
                      (directResult, serializedData.limit())
                    case IndirectTaskResult(blockId, size) =>
                      if (!taskSetManager.canFetchMoreResults(size)) {
                        // dropped by executor if size is larger than maxResultSize
                        sparkEnv.blockManager.master.removeBlock(blockId)
                        return
                      }
                      logDebug("Fetching indirect task result for TID %s".format(tid))
                      scheduler.handleTaskGettingResult(taskSetManager, tid)
                      val serializedTaskResult = sparkEnv.blockManager.getRemoteBytes(blockId)
                      if (!serializedTaskResult.isDefined) {
                        /* We won't be able to get the task result if the machine that ran the task failed
                         * between when the task ended and when we tried to fetch the result, or if the
                         * block manager had to flush the result. */
                        scheduler.handleFailedTask(
                          taskSetManager, tid, TaskState.FINISHED, TaskResultLost)
                        return
                      }
                      val deserializedResult = serializer.get().deserialize[DirectTaskResult[_]](
                        serializedTaskResult.get.toByteBuffer)
                      // force deserialization of referenced value
                      deserializedResult.value(taskResultSerializer.get())
                      sparkEnv.blockManager.master.removeBlock(blockId)
                      (deserializedResult, size)
                  }
        
                  // Set the task result size in the accumulator updates received from the executors.
                  // We need to do this here on the driver because if we did this on the executors then
                  // we would have to serialize the result again after updating the size.
                  result.accumUpdates = result.accumUpdates.map { a =>
                    if (a.name == Some(InternalAccumulator.RESULT_SIZE)) {
                      val acc = a.asInstanceOf[LongAccumulator]
                      assert(acc.sum == 0L, "task result size should not have been set on the executors")
                      acc.setValue(size.toLong)
                      acc
                    } else {
                      a
                    }
                  }
                  //处理成功的task
                  scheduler.handleSuccessfulTask(taskSetManager, tid, result)
                } catch {
                  case cnf: ClassNotFoundException =>
                    val loader = Thread.currentThread.getContextClassLoader
                    taskSetManager.abort("ClassNotFound with classloader: " + loader)
                  // Matching NonFatal so we don't catch the ControlThrowable from the "return" above.
                  case NonFatal(ex) =>
                    logError("Exception while getting task result", ex)
                    taskSetManager.abort("Exception while getting task result: %s".format(ex))
                }
              }
            })
          }
        
        //处理成功的task
        def handleSuccessfulTask(
              taskSetManager: TaskSetManager,
              tid: Long,
              taskResult: DirectTaskResult[_]): Unit = synchronized {
            //调用TaskSetManager.scala的handleSuccessfulTask()
            taskSetManager.handleSuccessfulTask(tid, taskResult)
          }

对于taskSetManager.scala的handleSuccessfulTask()方法

         /**
           * Marks a task as successful and notifies the DAGScheduler that the task has ended.
           */
          def handleSuccessfulTask(tid: Long, result: DirectTaskResult[_]): Unit = {
            //此处省略一万字，标记task成功运行、跟新failedExecutors。
            ...
            //task执行好了，要到更高层级的DAGScheduler来处理了
            sched.dagScheduler.taskEnded(tasks(index), Success, result.value(), result.accumUpdates, info)
            //若taskSet所有task都成功执行的一些处理
            maybeFinishTaskSet()
          }

DAGScheduler.scala

          def taskEnded(
              task: Task[_],
              reason: TaskEndReason,
              result: Any,
              accumUpdates: Seq[AccumulatorV2[_, _]],
              taskInfo: TaskInfo): Unit = {
            eventProcessLoop.post(
              CompletionEvent(task, reason, result, accumUpdates, taskInfo))
          }
        
        private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
         case completion: CompletionEvent =>
              dagScheduler.handleTaskCompletion(completion)   
        }

继续看看DAGScheduler.scala handleTaskCompletion()的实现。太长了，将重点还要分两块，第一块是如果事件是ShuffleMapTask，则还要调用submitWaitingChildStages(shuffleStage)来触发下游的stages

        private[scheduler] def handleTaskCompletion(event: CompletionEvent) {
           //省略一万字
            ...
        
            event.reason match {
              case Success =>
                task match {
                  case rt: ResultTask[_, _] =>
                   	  //我们待会来看
                      ...
                    }
        
                  case smt: ShuffleMapTask =>
                    val shuffleStage = stage.asInstanceOf[ShuffleMapStage]
                    shuffleStage.pendingPartitions -= task.partitionId
                    val status = event.result.asInstanceOf[MapStatus]
                    val execId = status.location.executorId
                    logDebug("ShuffleMapTask finished on " + execId)
                    if (failedEpoch.contains(execId) && smt.epoch <= failedEpoch(execId)) {
                      logInfo(s"Ignoring possibly bogus $smt completion from executor $execId")
                    } else {
                      // The epoch of the task is acceptable (i.e., the task was launched after the most
                      // recent failure we're aware of for the executor), so mark the task's output as
                      // available.
                      mapOutputTracker.registerMapOutput(
                        shuffleStage.shuffleDep.shuffleId, smt.partitionId, status)
                    }
        
                    if (runningStages.contains(shuffleStage) && shuffleStage.pendingPartitions.isEmpty) {
                      markStageAsFinished(shuffleStage)
                      logInfo("looking for newly runnable stages")
                      logInfo("running: " + runningStages)
                      logInfo("waiting: " + waitingStages)
                      logInfo("failed: " + failedStages)
        
                      // This call to increment the epoch may not be strictly necessary, but it is retained
                      // for now in order to minimize the changes in behavior from an earlier version of the
                      // code. This existing behavior of always incrementing the epoch following any
                      // successful shuffle map stage completion may have benefits by causing unneeded
                      // cached map outputs to be cleaned up earlier on executors. In the future we can
                      // consider removing this call, but this will require some extra investigation.
                      // See https://github.com/apache/spark/pull/17955/files#r117385673 for more details.
                      mapOutputTracker.incrementEpoch()
        
                      clearCacheLocs()
        
                      if (!shuffleStage.isAvailable) {
                        // Some tasks had failed; let's resubmit this shuffleStage.
                        // TODO: Lower-level scheduler should also deal with this
                        logInfo("Resubmitting " + shuffleStage + " (" + shuffleStage.name +
                          ") because some of its tasks had failed: " +
                          shuffleStage.findMissingPartitions().mkString(", "))
                        submitStage(shuffleStage)
                      } else {
                        markMapStageJobsAsFinished(shuffleStage)
                        //很重要，接着调用下游的字stages
                        submitWaitingChildStages(shuffleStage)
                      }
                    }
                }
        	 //其他处理场景
              ...
            }
          }

如果事件是ResultTask,则调用job.listener.taskSucceeded(rt.outputId, event.result)

        private[scheduler] def handleTaskCompletion(event: CompletionEvent) {
            //省略一万字
            ...
            event.reason match {
              case Success =>
                task match {
                  case rt: ResultTask[_, _] =>
                    // Cast to ResultStage here because it's part of the ResultTask
                    // TODO Refactor this out to a function that accepts a ResultStage
                    val resultStage = stage.asInstanceOf[ResultStage]
                    resultStage.activeJob match {
                      case Some(job) =>
                          //省略一万字
           				 ...
                          try {
                            job.listener.taskSucceeded(rt.outputId, event.result)
                          } catch {
                            case e: Exception =>
                              // TODO: Perhaps we want to mark the resultStage as failed?
                              job.listener.jobFailed(new SparkDriverExecutionException(e))
                          }
                        }
                      case None =>
                        logInfo("Ignoring result from " + rt + " because its job has finished")
                    }
                ...
            }

还记得第二步返回的JobWaiter吗？job.listener.taskSucceeded(rt.outputId, event.result)就是调用JobWaiter.scala taskSucceeded(index: Int, result: Any)

         override def taskSucceeded(index: Int, result: Any): Unit = {
            // resultHandler call must be synchronized in case resultHandler itself is not thread safe.
            synchronized {
              //返回结果给sparkContext
              resultHandler(index, result.asInstanceOf[T])
            }
            if (finishedTasks.incrementAndGet() == totalTasks) {
              jobPromise.success(())
            }
          }