Master注册机制
Application注册
前一篇已经分析了SparkContext的初始化流程,最后会向Master发送RegisterApplication类型的注册信息
下面看一下Master接收到这些信息之后,会怎样响应
首先Master类继承了ThreadSafeRpcEndpoint类
下面看一下Master的receive()
这里只看和应用注册相关的代码
因为Spark可以允许多个Master存在,但是只有其中一个是ACTIVE的,其他的都是STANDBY的,并且只有ACTIVE的Master会响应应用注册请求
主要做的就是将application信息添加到缓存结构中,并且向应用对应的driver发送类型为RegisteredApplication信息,最后会调用schedule(),schedule()的作用会在之后的博客中介绍
override def receive: PartialFunction[Any, Unit] = {
// 处理来自application的注册信息
case RegisterApplication(description, driver) =>
// TODO Prevent repeated registrations from some driver
// 当前master是standby不是active,那么不做响应
if (state == RecoveryState.STANDBY) {
// ignore, don't send response
} else {
logInfo("Registering app " + description.name)
// 使用注册信息创建application对象
// 这里会生成application的id,具体格式为
// val appId = "app-%s-%04d".format(createDateFormat.format(submitDate), nextAppNumber)
// 其中nextAppNumber是从0开始自增的
val app = createApplication(description, driver)
registerApplication(app)
logInfo("Registered app " + description.name + " with ID " + app.id)
// 将当前的aplicationInfo加入到缓存引擎中
persistenceEngine.addApplication(app)
// 给driver发送响应信息
driver.send(RegisteredApplication(app.id, self))
schedule()
}
}
/**
* 处理来自application的注册
* 将applicationInfo加入缓存
* 将application加入等待队列
* @param app
*/
private def registerApplication(app: ApplicationInfo): Unit = {
// 获得应用driver的地址
val appAddress = app.driver.address
// 如果driver已经存在,那么就返回,判断为对app的重复注册
if (addressToApp.contains(appAddress)) {
logInfo("Attempted to re-register application at same address: " + appAddress)
return
}
applicationMetricsSystem.registerSource(app.appSource)
apps += app
idToApp(app.id) = app
endpointToApp(app.driver) = app
addressToApp(appAddress) = app
// 加入等待调度队列
waitingApps += app
}
下面再看一下driver接收到响应信息之后会做什么,因为分析的是standalone模式,所以这里由StandaloneAppClient负责接收响应信息,从这里也可以看出AppClient用来和集群进行通信
override def receive: PartialFunction[Any, Unit] = {
case RegisteredApplication(appId_, masterRef) =>
// 这里主要就是记录applicationId以及标志注册成功
appId.set(appId_)
registered.set(true)
master = Some(masterRef)
listener.connected(appId.get)
}
Driver注册
当使用spark-submit提交任务的时候,首先就会注册Driver,会向Master发送类型为RequestSubmitDriver的信息,下面看一下Master如何处理
override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
case RequestSubmitDriver(description) =>
// 当master处于ALIVE状态时,无法处理Driver的注册信息
if (state != RecoveryState.ALIVE) {
val msg = s"${Utils.BACKUP_STANDALONE_MASTER_PREFIX}: $state. " +
"Can only accept driver submissions in ALIVE state."
context.reply(SubmitDriverResponse(self, false, None, msg))
} else {
logInfo("Driver submitted " + description.command.mainClass)
// 下面将更新一些内存缓存
val driver = createDriver(description)
persistenceEngine.addDriver(driver)
// 将driver添加到driver的等待队列中
waitingDrivers += driver
drivers.add(driver)
// 触发调度
schedule()
// 向Driver发送响应消息
context.reply(SubmitDriverResponse(self, true, Some(driver.id),
s"Driver successfully submitted as ${driver.id}"))
}
}
Worker注册
当worker启动之后也会向Master发送注册信息
case RegisterWorker(
id, workerHost, workerPort, workerRef, cores, memory, workerWebUiUrl, masterAddress) =>
// 可以看到worker的信息中有worker的core数量以及RAM大小
logInfo("Registering worker %s:%d with %d cores, %s RAM".format(
workerHost, workerPort, cores, Utils.megabytesToString(memory)))
// 如果当前Master是STANDBY的,那么会通知worker
if (state == RecoveryState.STANDBY) {
workerRef.send(MasterInStandby)
} else if (idToWorker.contains(id)) {
workerRef.send(RegisterWorkerFailed("Duplicate worker ID"))
} else {
val worker = new WorkerInfo(id, workerHost, workerPort, cores, memory,
workerRef, workerWebUiUrl)
// 将worker信息添加到缓存中
if (registerWorker(worker)) {
persistenceEngine.addWorker(worker)
workerRef.send(RegisteredWorker(self, masterWebUiUrl, masterAddress))
// 触发调度
schedule()
} else {
val workerAddress = worker.endpoint.address
logWarning("Worker registration failed. Attempted to re-register worker at same " +
"address: " + workerAddress)
workerRef.send(RegisterWorkerFailed("Attempted to re-register worker at same address: "
+ workerAddress))
}
}
下面看一下这三种注册中都出现了persistenceEngine
我们看一下PersistenceEngine是干什么用的
/**
* Allows Master to persist any state that is necessary in order to recover from a failure.
* The following semantics are required:
* - addApplication and addWorker are called before completing registration of a new app/worker.
* - removeApplication and removeWorker are called at any time.
* Given these two requirements, we will have all apps and workers persisted, but
* we might not have yet deleted apps or workers that finished (so their liveness must be verified
* during recovery).
*
* The implementation of this trait defines how name-object pairs are stored or retrieved.
*/
基本意思就是说通过PersistenceEngine可以缓存那些从失败中恢复过来需要用到的状态,
并且必须能够保证以下几点:
(1)addApplication和addWorker是在完成一个新的app或者worker之前完成的
(2)可以在任何时候调用removeApplication和removeWorker
PersistenceEngine的初始化是在Master的onStart()方法中进行的,主要提供三种模式:ZOOKEEPER,FILESYSTEM,CUSTOM
本文解析了Spark中的Master、Driver和Worker的注册流程。包括Master如何响应注册请求、Driver如何处理注册反馈及Worker启动后的注册过程。此外,还介绍了PersistenceEngine的作用及其在注册流程中的重要性。
1886

被折叠的 条评论
为什么被折叠?



