Spark源码分析(二)：Master注册机制

最新推荐文章于 2025-12-16 08:22:03 发布

原创最新推荐文章于 2025-12-16 08:22:03 发布 · 241 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#spark

spark core 专栏收录该内容

10 篇文章

订阅专栏

本文解析了Spark中的Master、Driver和Worker的注册流程。包括Master如何响应注册请求、Driver如何处理注册反馈及Worker启动后的注册过程。此外，还介绍了PersistenceEngine的作用及其在注册流程中的重要性。

Master注册机制

Application注册

前一篇已经分析了SparkContext的初始化流程，最后会向Master发送RegisterApplication类型的注册信息
下面看一下Master接收到这些信息之后，会怎样响应
首先Master类继承了ThreadSafeRpcEndpoint类
下面看一下Master的receive()
这里只看和应用注册相关的代码
因为Spark可以允许多个Master存在，但是只有其中一个是ACTIVE的，其他的都是STANDBY的，并且只有ACTIVE的Master会响应应用注册请求
主要做的就是将application信息添加到缓存结构中，并且向应用对应的driver发送类型为RegisteredApplication信息，最后会调用schedule()，schedule()的作用会在之后的博客中介绍

override def receive: PartialFunction[Any, Unit] = {
  // 处理来自application的注册信息
case RegisterApplication(description, driver) =>
  // TODO Prevent repeated registrations from some driver
  // 当前master是standby不是active，那么不做响应
  if (state == RecoveryState.STANDBY) {
    // ignore, don't send response
  } else {
    logInfo("Registering app " + description.name)
    // 使用注册信息创建application对象
    // 这里会生成application的id，具体格式为
    // val appId = "app-%s-%04d".format(createDateFormat.format(submitDate), nextAppNumber)
    // 其中nextAppNumber是从0开始自增的
    val app = createApplication(description, driver)
    registerApplication(app)
    logInfo("Registered app " + description.name + " with ID " + app.id)
    // 将当前的aplicationInfo加入到缓存引擎中
    persistenceEngine.addApplication(app)
    // 给driver发送响应信息
    driver.send(RegisteredApplication(app.id, self))
    schedule()
  }
}

/**
 * 处理来自application的注册
  * 将applicationInfo加入缓存
  * 将application加入等待队列
  * @param app
  */
private def registerApplication(app: ApplicationInfo): Unit = {
  // 获得应用driver的地址
  val appAddress = app.driver.address
  // 如果driver已经存在，那么就返回，判断为对app的重复注册
  if (addressToApp.contains(appAddress)) {
    logInfo("Attempted to re-register application at same address: " + appAddress)
    return
  }

  applicationMetricsSystem.registerSource(app.appSource)

  apps += app
  idToApp(app.id) = app
  endpointToApp(app.driver) = app
  addressToApp(appAddress) = app
  // 加入等待调度队列
  waitingApps += app
}

下面再看一下driver接收到响应信息之后会做什么，因为分析的是standalone模式，所以这里由StandaloneAppClient负责接收响应信息，从这里也可以看出AppClient用来和集群进行通信

override def receive: PartialFunction[Any, Unit] = {

  case RegisteredApplication(appId_, masterRef) =>
  	// 这里主要就是记录applicationId以及标志注册成功
    appId.set(appId_)
    registered.set(true)
    master = Some(masterRef)
    listener.connected(appId.get)
}

Driver注册

当使用spark-submit提交任务的时候，首先就会注册Driver，会向Master发送类型为RequestSubmitDriver的信息，下面看一下Master如何处理

override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
 case RequestSubmitDriver(description) =>
 	// 当master处于ALIVE状态时，无法处理Driver的注册信息
   if (state != RecoveryState.ALIVE) {
     val msg = s"${Utils.BACKUP_STANDALONE_MASTER_PREFIX}: $state. " +
       "Can only accept driver submissions in ALIVE state."
     context.reply(SubmitDriverResponse(self, false, None, msg))
   } else {
     logInfo("Driver submitted " + description.command.mainClass)
     // 下面将更新一些内存缓存
     val driver = createDriver(description)
     persistenceEngine.addDriver(driver)
     // 将driver添加到driver的等待队列中
     waitingDrivers += driver
     drivers.add(driver)
     // 触发调度
     schedule()

	// 向Driver发送响应消息
     context.reply(SubmitDriverResponse(self, true, Some(driver.id),
       s"Driver successfully submitted as ${driver.id}"))
   }
}

Worker注册

当worker启动之后也会向Master发送注册信息

case RegisterWorker(
   id, workerHost, workerPort, workerRef, cores, memory, workerWebUiUrl, masterAddress) =>
   // 可以看到worker的信息中有worker的core数量以及RAM大小
   logInfo("Registering worker %s:%d with %d cores, %s RAM".format(
     workerHost, workerPort, cores, Utils.megabytesToString(memory)))
     // 如果当前Master是STANDBY的，那么会通知worker
   if (state == RecoveryState.STANDBY) {
     workerRef.send(MasterInStandby)
   } else if (idToWorker.contains(id)) {
     workerRef.send(RegisterWorkerFailed("Duplicate worker ID"))
   } else {
     val worker = new WorkerInfo(id, workerHost, workerPort, cores, memory,
       workerRef, workerWebUiUrl)
       // 将worker信息添加到缓存中
       
     if (registerWorker(worker)) {
       persistenceEngine.addWorker(worker)
       workerRef.send(RegisteredWorker(self, masterWebUiUrl, masterAddress))
       // 触发调度
       schedule()
     } else {
       val workerAddress = worker.endpoint.address
       logWarning("Worker registration failed. Attempted to re-register worker at same " +
         "address: " + workerAddress)
       workerRef.send(RegisterWorkerFailed("Attempted to re-register worker at same address: "
         + workerAddress))
     }
   }

下面看一下这三种注册中都出现了persistenceEngine
我们看一下PersistenceEngine是干什么用的

/**
 * Allows Master to persist any state that is necessary in order to recover from a failure.
 * The following semantics are required:
 *   - addApplication and addWorker are called before completing registration of a new app/worker.
 *   - removeApplication and removeWorker are called at any time.
 * Given these two requirements, we will have all apps and workers persisted, but
 * we might not have yet deleted apps or workers that finished (so their liveness must be verified
 * during recovery).
 *
 * The implementation of this trait defines how name-object pairs are stored or retrieved.
 */

基本意思就是说通过PersistenceEngine可以缓存那些从失败中恢复过来需要用到的状态，
并且必须能够保证以下几点:
(1)addApplication和addWorker是在完成一个新的app或者worker之前完成的
(2)可以在任何时候调用removeApplication和removeWorker
PersistenceEngine的初始化是在Master的onStart()方法中进行的，主要提供三种模式：ZOOKEEPER,FILESYSTEM,CUSTOM