Spark源码分析之worker节点启动driver和executor

最新推荐文章于 2024-05-22 15:17:22 发布

原创最新推荐文章于 2024-05-22 15:17:22 发布 · 701 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#spark #源码分析 #worker-driver #worker-executor #启动

spark 专栏收录该内容

32 篇文章

订阅专栏

本文详细解析了Spark集群中Driver与Executor的启动过程，包括Master如何调度资源启动Driver，Worker接收并执行启动指令，以及启动Executor的具体步骤。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、启动driver

1.首先在Master.scala类中执行schedule()方法，该方法主要有两个方法lanuchDriver()和launchExecutor()分别用来启动driver和executor。在master上面一旦可用资源发生变动或者有新的application提交进来之后就会调用该schedule()方法。

2.先去调度所有的driver，针对这些application采取严格的优先级.当该worker上面的可用内存大于application需要的内存，以及worker上面可用的core数目大于application中需要的core数目,通过launchDriver(worker, driver)才能够启动dirver,启动之后将该dirver就从正在等待的driver列表(ArrayBuffer)中删除同时将当前driver的状态设置为（launched=）true。
def launchDriver(worker: WorkerInfo, driver: DriverInfo) {
    logInfo("Launching driver " + driver.id + " on worker " + worker.id)
    worker.addDriver(driver)
    driver.worker = Some(worker)
    worker.actor ! LaunchDriver(driver.id, driver.desc)//向Worker节点发送一个基于AKKA ACTOR的事件通知模型的样例类LaunchDriver
    driver.state = DriverState.RUNNING
  }
3.worker接收到master发送来的LanuchDriver事件通知(Worker.scala类中)
case LaunchDriver(driverId, driverDesc) => {
      logInfo(s"Asked to launch driver $driverId")
      val driver = new DriverRunner(
        conf,
        driverId,
        workDir,
        sparkHome,
        driverDesc.copy(command = Worker.maybeUpdateSSLSettings(driverDesc.command, conf)),
        self,
        akkaUrl)
      drivers(driverId) = driver
      driver.start()

      coresUsed += driverDesc.cores
      memoryUsed += driverDesc.mem
    }

在LaunchRiver内部，创建DriverRunner对象(内部封装了一个线程)，将DriverRunner对象添加进该Worker内部的dirver列表(HashMap<Key(driverID), Value(DriverRunner对象)>)，调用DriverRunner对象的start方法，同时修改该worker进程的可用cores个数，内存个数。

4.查看driver.start()方法（DriverRunner.scala）
def start() = {
    new Thread("DriverRunner for " + driverId) {
      override def run() {
        try {
          val driverDir = createWorkingDirectory()
          val localJarFilename = downloadUserJar(driverDir)

          def substituteVariables(argument: String): String = argument match {
            case "{{WORKER_URL}}" => workerUrl
            case "{{USER_JAR}}" => localJarFilename
            case other => other
          }

          // TODO: If we add ability to submit multiple jars they should also be added here
          /*
            类似组织启动进程的命令如下：
            Storm jar jar-path classpath paramters
           */
          val builder = CommandUtils.buildProcessBuilder(driverDesc.command, driverDesc.mem,
            sparkHome.getAbsolutePath, substituteVariables)
          launchDriver(builder, driverDir, driverDesc.supervise)
        }
        catch {
          case e: Exception => finalException = Some(e)
        }
        //获取当前driver的启动状态(KILLED ERROR FINISHED FAILED)
        val state =
          if (killed) {
            DriverState.KILLED
          } else if (finalException.isDefined) {
            DriverState.ERROR
          } else {
            finalExitCode match {
              case Some(0) => DriverState.FINISHED
              case _ => DriverState.FAILED
            }
          }

        finalState = Some(state)

        worker ! DriverStateChanged(driverId, state, finalException)
      }
    }.start()
  }
该方法主要作用：①创建的driver工作的目录，②从master中下载要执行的jar包(移动计算、不移动数据)，③调用内部的方法launchDriver(ProcessBuilder, driverDriver, supervise)来启动driver启动ProcessBuilder（使用java的API去操作一个java的进程|Process Runtime）。driver在此启动起来
并向worker发送DriverStateChanged(driverId, state, finalException)事件通知，worker接收到后，向master发送相同的DriverStateChanged(driverId, state, finalException)事件通知，

二、启动executor

1.类似于启动driver。在schedule()方法满足一定的条件(e.g:资源满足等等其他条件)后启动launchExecutor() (Master.scala)

def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc) {
    logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)
    worker.addExecutor(exec)
    worker.actor ! LaunchExecutor(masterUrl,
      exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory)
    //启动executor之后，将该executor注册添加进driver
    exec.application.driver ! ExecutorAdded(
      exec.id, worker.id, worker.hostPort, exec.cores, exec.memory)
  }

2.在worker上面，创建了一个内部持有一个Thread线程的ExecutorRunner的对象，内部封装了启动该executor所需要的信息，调用start方法

case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>
......
    val manager = new ExecutorRunner(...)
    manager.start()
    coresUsed += cores_
    memoryUsed += memory_
    master ! ExecutorStateChanged(appId, execId, manager.state, None, None)
......

3. 二-2中manager.start()方法（ExecutorRunner.scala）

def start() {
    workerThread = new Thread("ExecutorRunner for " + fullId) {
      override def run() { fetchAndRunExecutor() }
    }
    workerThread.start()
    // Shutdown hook that kills actors on shutdown.
    shutdownHook = new Thread() {
      override def run() {
        killProcess(Some("Worker shutting down"))
      }
    }
    Runtime.getRuntime.addShutdownHook(shutdownHook)
  }}

def fetchAndRunExecutor() {
    try {
      // Launch the process
      val builder = CommandUtils.buildProcessBuilder(appDesc.command, memory,
        sparkHome.getAbsolutePath, substituteVariables)
      val command = builder.command()
      logInfo("Launch command: " + command.mkString("\"", "\" \"", "\""))

      builder.directory(executorDir)
      builder.environment.put("SPARK_LOCAL_DIRS", appLocalDirs.mkString(","))
      // In case we are running this from within the Spark Shell, avoid creating a "scala"
      // parent process for the executor command
      builder.environment.put("SPARK_LAUNCH_WITH_SCALA", "0")

      // Add webUI log urls
      val baseUrl =
        s"http://$publicAddress:$webUiPort/logPage/?appId=$appId&executorId=$execId&logType="
      builder.environment.put("SPARK_LOG_URL_STDERR", s"${baseUrl}stderr")
      builder.environment.put("SPARK_LOG_URL_STDOUT", s"${baseUrl}stdout")

      process = builder.start()
      val header = "Spark Executor Command: %s\n%s\n\n".format(
        command.mkString("\"", "\" \"", "\""), "=" * 40)

      // Redirect its stdout and stderr to files
      val stdout = new File(executorDir, "stdout")
      stdoutAppender = FileAppender(process.getInputStream, stdout, conf)

      val stderr = new File(executorDir, "stderr")
      Files.write(header, stderr, UTF_8)
      stderrAppender = FileAppender(process.getErrorStream, stderr, conf)

      // Wait for it to exit; executor may exit with code 0 (when driver instructs it to shutdown)
      // or with nonzero exit code
      val exitCode = process.waitFor()
      state = ExecutorState.EXITED
      val message = "Command exited with code " + exitCode
      worker ! ExecutorStateChanged(appId, execId, state, Some(message), Some(exitCode))
    } catch {
      case interrupted: InterruptedException => {
        logInfo("Runner thread for executor " + fullId + " interrupted")
        state = ExecutorState.KILLED
        killProcess(None)
      }
      case e: Exception => {
        logError("Error running executor", e)
        state = ExecutorState.FAILED
        killProcess(Some(e.toString))
      }
    }
  }

主要作用：类似driver中创建该executor工作目录，下载运行的jar。在fetchAndRunExecutor()开启执行application的进程executor。并向worker发送ExecutorStateChanged的事件通知，

4.worker向master发送接收到的ExecutorStateChanged的事件通知。