主备切换机制原理剖析
Master可以配置两个,Spark原生的standalone模式支持Master主备切换。
Spark Master主备切换可以基于两种机制,一种是基于文件系统的,一种是基于ZooKeeper的,基于文件系统的主备切换机制,需要在Active Master挂掉之后,手动去切换到Standby Master上。基于ZooKeeper的主备切换机制,可以实现自动切换Master。
Master主备切换机制,就是在Active Master挂掉之后,切换到Standby Master会做哪些操作。
- Standby Master,使用持久化引擎去读取持久化的storedApps、storedDrivers、storedWorkers。FileSystemPersistenceEngine,ZooKeeperPersistenceEngine。
- 判断,如果storedApps、storedDrivers、storedWorkers有任何一个是非空的。
- 将持久化的Application、Driver、Worker的信息,重新进行注册,注册到Master内部的内存缓存结构中。
- 将Application和Worker的状态都修改为UNKNOWN,然后向Application所对应的Driver,以及Worker发送Standby Master的地址。
- Driver和Worker,理论上来说,如果它们目前都是正常在运作的话,那么在接收到Master发送来的地址之后,就会返回响应消息给新的Master。
- 此时,Master在陆续接收到Driver和Worker发送来的响应消息之后,会使用completeRecovery()方法对没有发送响应消息的Driver和Worker进行处理,过滤掉它们的信息。
- 调用Master自己的schedule()方法,对正在等待资源配置的Driver和Application进行调度,比如在某个worker上启动Driver,或者为Application在Worker上启动它需要的Executor。
主备切换机制源码分析
org/apache/spark/deploy/master/Master.scala,completeRecovery方法。
1/**
2 * 完成Master的主备切换,就是完成Master的恢复
3 */
4 def completeRecovery() {
5 // Ensure "only-once" recovery semantics using a short synchronization period.
6 synchronized {
7 if (state != RecoveryState.RECOVERING) { return }
8 state = RecoveryState.COMPLETING_RECOVERY
9 }
10 // 将Application和Worker,过滤出来目前状态还是UNKNOWN的
11 // 然后遍历,分别调用removeWorker和finishApplication方法,对可能已经出故障,
12 // 或者甚至已经死掉的Application和Worker,进行清理
13 // 总结一下清理的机制,三点:1、从内存缓存结构中移除;
14 // 2、从相关的组件的内存缓存中移除;
15 // 3、从持久化存储中移除。
16 // Kill off any workers and apps that didn't respond to us.
17 workers.filter(_.state == WorkerState.UNKNOWN).foreach(removeWorker)
18 apps.filter(_.state == ApplicationState.UNKNOWN).foreach(finishApplication)
19 // Reschedule drivers which were not claimed by any workers
20 drivers.filter(_.worker.isEmpty).foreach { d =>
21 logWarning(s"Driver ${d.id} was not found after master recovery")
22 if (d.desc.supervise) {
23 logWarning(s"Re-launching ${d.id}")
24 relaunchDriver(d)
25 } else {
26 removeDriver(d.id, DriverState.ERROR, None)
27 logWarning(s"Did not re-launch ${d.id} because it was not supervised")
28 }
29 }
30 state = RecoveryState.ALIVE
31 schedule()
32 logInfo("Recovery complete - resuming operations!")
33 }
34
注册机制原理剖析
Worker的注册,
- Worker,在启动之后,就会主动向Master进行注册。
- Master,过滤,将状态为DEAD的Worker过滤掉,对于状态为UNKNOWN的Worker,清理掉旧的Worker信息,替换为新的Worker信息。
- 把Worker加入内存缓存中(HashMap)。
- 用持久化引擎,将Worker信息进行持久化(文件系统、ZooKeeper)。
- 调用schedule()方法。
Driver的注册,
- Driver,用spark-submit提交spark Application,首先就会注册Driver。
- 将Driver信息放入内存缓存中(HashMap)。
- 加入等待调度队列(ArrayBuffer)。
- 用持久化引擎,将Driver信息持久化。
- 调用schedule()进行调度。
Application的注册,
- Driver启动好了,执行我们编写的Application,执行SparkContext初始化,底层的SparkDeploySchedulerBackend,会通过AppClient内部的线程,ClientActor,发送RegisterApplication,到Master,进行Application的注册。
- 将Application信息放入内存缓存(HashMap)。
- 将Application加入等待调度的Application队列(ArrayBuffer)。
- 用持久化引擎将Application信息持久化。
- 调用schedule()方法,进行资源调度。
Application注册的源码分析
org/apache/spark/deploy/master/Master.scala,RegisterApplication样例类。
1/**
2 * 处理Application注册的请求
3 */
4 case RegisterApplication(description) => {
5 // 如果master的状态是standby,也就是当前这个master,是standBy Master,不是Active Master,
6 // 那么Application来请求注册,什么都不会干
7 if (state == RecoveryState.STANDBY) {
8 // ignore, don't send response
9 } else {
10 logInfo("Registering app " + description.name)
11 // 用ApplicationDescription信息,创建ApplicationInfo
12 val app = createApplication(description, sender)
13 // 注册Application
14 // 将ApplicationInfo加入缓存,将Application加入等待调度的队列waitingApps
15 registerApplication(app)
16 logInfo("Registered app " + description.name + " with ID " + app.id)
17 // 使用持久化引擎,将Application进行持久化
18 persistenceEngine.addApplication(app)
19 // SparkDeploySchedulerBackend创建的AppClient通过ClientActor线程向Master Actor发送注册请求,
20 // ClientActor会把自己的引用给带过来,用sender来命名,
21 // Master就知道RegisterApplication这个消息是谁给我发过来的,
22 // 反向,向SparkDeploySchedulerBackend的AppClient的ClientActor,发送消息,
23 // 也就是RegisteredApplication
24 sender ! RegisteredApplication(app.id, masterUrl)
25 schedule()
26 }
27 }
28def registerApplication(app: ApplicationInfo): Unit = {
29 val appAddress = app.driver.path.address
30 if (addressToApp.contains(appAddress)) {
31 logInfo("Attempted to re-register application at same address: " + appAddress)
32 return
33 }
34 applicationMetricsSystem.registerSource(app.appSource)
35 // 将app的信息加入内存缓存中
36 apps += app
37 idToApp(app.id) = app
38 actorToApp(app.driver) = app
39 addressToApp(appAddress) = app
40 // 将app加入等待调度的队列waitingApps
41 waitingApps += app
42 }
Master状态改变处理机制源码分析
Driver的状态改变,package org.apache.spark.deploy.master,DriverStateChanged。
1case DriverStateChanged(driverId, state, exception) => {
2 state match {
3 // 如果Driver的状态是错误、完成、被杀掉、失败
4 // 那么就移除Driver。
5 case DriverState.ERROR | DriverState.FINISHED | DriverState.KILLED | DriverState.FAILED =>
6 removeDriver(driverId, state, exception)
7 case _ =>
8 throw new Exception(s"Received unexpected state update for driver $driverId: $state")
9 }
10 }
11def removeDriver(driverId: String, finalState: DriverState, exception: Option[Exception]) {
12 // 用scala的find()高阶函数,找到driverId对应的driver
13 drivers.find(d => d.id == driverId) match {
14 // 如果找到了,Some,样例类(Option)
15 case Some(driver) =>
16 logInfo(s"Removing driver: $driverId")
17 // 将driver从内存缓存中移除
18 drivers -= driver
19 if (completedDrivers.size >= RETAINED_DRIVERS) {
20 val toRemove = math.max(RETAINED_DRIVERS / 10, 1)
21 completedDrivers.trimStart(toRemove)
22 }
23 // 想completedDrivers中加入driver
24 completedDrivers += driver
25 // 使用持久化引擎去除driver的持久化信息
26 persistenceEngine.removeDriver(driver)
27 // 设置driver的state、exception
28 driver.state = finalState
29 driver.exception = exception
30 // 将driver所在的worker,移除dirver
31 driver.worker.foreach(w => w.removeDriver(driver))
32 // 同样,调用schedule()方法
33 schedule()
34 case None =>
35 logWarning(s"Asked to remove unknown driver: $driverId")
36 }
37 }
Executor的状态改变,package org.apache.spark.deploy.master,ExecutorStateChanged。
1case ExecutorStateChanged(appId, execId, state, message, exitStatus) => {
2 // 找到executor对应的app,然后再反过来通过app内部的executors缓存获取executor信息
3 val execOption = idToApp.get(appId).flatMap(app => app.executors.get(execId))
4 execOption match {
5 // 如果有值
6 case Some(exec) => {
7 // 设置executor的当前状态
8 val appInfo = idToApp(appId)
9 exec.state = state
10 if (state == ExecutorState.RUNNING) { appInfo.resetRetryCount() }
11 // 向driver同步发送ExecutorUpdated消息
12 exec.application.driver ! ExecutorUpdated(execId, state, message, exitStatus)
13 // 判断,如果executor完成了
14 if (ExecutorState.isFinished(state)) {
15 // Remove this executor from the worker and app
16 logInfo(s"Removing executor ${exec.fullId} because it is $state")
17 // 从app的缓存中移除executor
18 appInfo.removeExecutor(exec)
19 // 从运行executor的worker的缓存中移除executor
20 exec.worker.removeExecutor(exec)
21 // 判断,如果executor的退出状态是非正常的
22 val normalExit = exitStatus == Some(0)
23 // Only retry certain number of times so we don't go into an infinite loop.
24 if (!normalExit) {
25 // 判断application当前的重试次数,是否达到了最大值10,
26 if (appInfo.incrementRetryCount() < ApplicationState.MAX_NUM_RETRY) {
27 // 重新进行调度
28 schedule()
29 } else {
30 // 否则,那么就removeApplication操作
31 // executor反复调度都是失败,那么就认为application也失败了
32 val execs = appInfo.executors.values
33 if (!execs.exists(_.state == ExecutorState.RUNNING)) {
34 logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " +
35 s"${appInfo.retryCount} times; removing it")
36 removeApplication(appInfo, ApplicationState.FAILED)
37 }
38 }
39 }
40 }
41 }
42 case None =>
43 logWarning(s"Got status update for unknown executor $appId/$execId")
44 }
45 }
Master资源调度算法原理剖析与源码分析
org/apache/spark/deploy/master/Master.scala,schedule方法。
1/**
2 * Schedule the currently available resources among waiting apps. This method will be called
3 * every time a new app joins or resource availability changes.
4 */
5 private def schedule() {
6 // 首先判断,master状态不是ALIVE的话,直接返回,
7 // 也就是说,standby master是不会进行application等资源的调度的。
8 if (state != RecoveryState.ALIVE) { return }
9 // First schedule drivers, they take strict precedence over applications
10 // Randomization helps balance drivers
11 // Random.shuffle的原理,就是对传入的集合的元素进行随机的打乱
12 // 取出workers中的所有之前注册上来的worker,进行过滤,必须是状态为ALIVE的worker
13 // 对状态为ALIVE的worker,调用Random的shuffle方法,进行随机的打乱
14 val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
15 // 获取shuffledAliveWorkers的大小
16 val numWorkersAlive = shuffledAliveWorkers.size
17 var curPos = 0
18 // 首先调度driver
19 // 为什么要调度driver,大家想一下,什么情况下,会注册driver,并且会导致driver被调度
20 // 其实,只有用yarn-cluster模式提交的时候,才会注册driver,因为standalone和yarn-client模式,
21 // 都会在本地直接启动driver,而不会来注册driver,就更不可能让master来调度driver了。
22 // driver的调度机制
23 // 遍历waitingDrivers ArrayBuffer
24 for (driver <- waitingDrivers.toList) { // iterate over a copy of waitingDrivers
25 // We assign workers to each waiting driver in a round-robin fashion. For each driver, we
26 // start from the last worker that was assigned a driver, and continue onwards until we have
27 // explored all alive workers.
28 var launched = false
29 var numWorkersVisited = 0
30 // numWorkersVisited小于numWorkersAlive
31 // 就是说只要还有或者的worker没有遍历到,那么就继续进行遍历,
32 // 而且,当前这个driver还没有被启动,也就是launched为false
33 while (numWorkersVisited < numWorkersAlive && !launched) {
34 // 拿到一个活着的worker
35 val worker = shuffledAliveWorkers(curPos)
36 // 遍历过的worker加1
37 numWorkersVisited += 1
38 // 如果当前这个worker的空闲内存量大于等于driver需要的内存
39 // 并且worker的空闲cpu数量,大于等于driver需要的cpu数量
40 if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
41 // 启动driver
42 launchDriver(worker, driver)
43 // 并且将driver从waitingDrivers队列中移除,后面就不会调度它了,把launched设为true。
44 waitingDrivers -= driver
45 launched = true
46 }
47 // 将指针指向下一个worker
48 curPos = (curPos + 1) % numWorkersAlive
49 // 这个driver去循环遍历所有活着的worker,只要launched为true,表明当前这个driver已经在某个worker启动。
50 }
51 }
52 // Right now this is a very simple FIFO scheduler. We keep trying to fit in the first app
53 // in the queue, then the second app, etc.
54 // Application的调度机制(核心之核心)
55 // Application的调度算法有两种,一种是spreadOutApps,另一种是非spreadOutApps
56 // val spreadOutApps = conf.getBoolean("spark.deploy.spreadOut", true)
57 if (spreadOutApps) {
58 // Try to spread out each app among all the nodes, until it has all its cores
59 // 首先遍历waitingApps中的ApplicationInfo,并且过滤出还有需要调度的core的Application
60 for (app <- waitingApps if app.coresLeft > 0) {
61 // 从workers中,过滤出状态为ALIVE的,再次过滤出可以被Application使用的Worker,
62 // 然后,按照剩余cpu数量倒序排序
63 // canUse,worker.memoryFree >= app.desc.memoryPerSlave && !worker.hasExecutor(app)
64 val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
65 .filter(canUse(app, _)).sortBy(_.coresFree).reverse
66 val numUsable = usableWorkers.length
67 // 创建一个空数组,存储了要分配给每个worker的cpu数量
68 val assigned = new Array[Int](numUsable) // Number of cores to give on each node
69 // 获取到底要分配多少cpu,取app剩余要分配的cpu数量和worker总共可用cpu数量的最小值
70 var toAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum)
71 // 通过这种算法,其实会将每个application,要启动的executor,都平均分布到各个worker上去
72 // 比如有20个cpu core要分配,那么实际会循环两遍worker,每次循环,给每个worker分配1个core
73 // 最后每个worker分配了2个core
74 // while条件,只要要分配的cpu,还没分配完,就继续循环
75 var pos = 0
76 while (toAssign > 0) {
77 // 每一个worker,如果空闲的cpu数量,大于已经分配出去的cpu数量
78 // 也就是说,worker还有可分配的cpu
79 if (usableWorkers(pos).coresFree - assigned(pos) > 0) {
80 // 将总共要分配的cpu数量-1,因为这里已经决定在这个worker上分配一个cpu了
81 toAssign -= 1
82 // 给这个worker分配的cpu数量,加1
83 assigned(pos) += 1
84 }
85 // 指针移动到下一个worker
86 pos = (pos + 1) % numUsable
87 }
88 // Now that we've decided how many cores to give on each node, let's actually give them
89 // 给每个worker分配完application要求的cpu core之后
90 // 遍历worker
91 for (pos <- 0 until numUsable) {
92 // 只要判断之前给这个worker分配到了core
93 if (assigned(pos) > 0) {
94 // 首先,在application内部缓存结构中,添加executor
95 // 并且创建ExecutorDesc对象,其中封装了,给这个executor分配多个cpu core
96 // 这里至少是spark1.3.0版本的executor启动的内部机制
97 // 在spark-submit脚本中,可以指定要多少个executor,每个executor多少个cpu,多少内存
98 // 基于我们的机制,实际上,最后,executor的实际数量,以及每个executor的cpu,可能与配置是不一样的
99 // 因为,我们这里是基于总的cpu来分配的,就是说,比如要求3个executor,每个要3个cpu,
100 // 比如有9个worker,每个有1个cpu,那么其实总共知道,要分配9个core,其实根据这种算法,
101 // 会给每个worker分配一个core,然后,给每个worker启动一个executor吧,最后,
102 // 会启动9个executor,每个executor有1个cpu core
103 val exec = app.addExecutor(usableWorkers(pos), assigned(pos))
104 // 那么就在worker上启动executor
105 launchExecutor(usableWorkers(pos), exec)
106 // 将application的状态设置为RUNNING
107 app.state = ApplicationState.RUNNING
108 }
109 }
110 }
111 } else {
112 // Pack each app into as few nodes as possible until we've assigned all its cores
113 // 非spreadOutApps调度算法
114 // 这种算法和spreadOutApps算法正好相反
115 // 每个application,都尽可能分配到尽量少的worker上去,
116 // 比如总共有10个worker,每个有10个core,app总共要分配20个core,那么,其实
117 // 只会分配到两个worker上,每个worker都占满10个core,那么其余的app,就只能分配到下一个worker了。
118 // 所以,比方说,application,spark-submit里,配置的是要10个executor,每个要2个core,那么总共是20个core,
119 // 但是在这种算法下,其实只会启动2个executor,每个有10个core。
120 // 将每一个Application,尽可能少的分配到worker上去
121 // 首先遍历worker,并且是状态为ALIVE,还有空闲cpu的worker
122 for (worker <- workers if worker.coresFree > 0 && worker.state == WorkerState.ALIVE) {
123 // 遍历application,并且是还有需要分配的core的application
124 for (app <- waitingApps if app.coresLeft > 0) {
125 // 判断,如果当前这个worker可以被application使用
126 if (canUse(app, worker)) {
127 // 取worker剩余cpu数量,与app要分配的cpu数量的最小值
128 val coresToUse = math.min(worker.coresFree, app.coresLeft)
129 if (coresToUse > 0) {
130 // 给app添加一个executor
131 val exec = app.addExecutor(worker, coresToUse)
132 // 在worker上启动executor
133 launchExecutor(worker, exec)
134 // 将application状态修改为RUNNING
135 app.state = ApplicationState.RUNNING
136 }
137 }
138 }
139 }
140 }
141 }
launchDriver方法,
1def launchDriver(worker: WorkerInfo, driver: DriverInfo) {
2 logInfo("Launching driver " + driver.id + " on worker " + worker.id)
3 // 将driver键入worker内部的缓存结构
4 // 将worker内使用的内存和cpu数量,都加上driver需要的内存和cpu数量
5 worker.addDriver(driver)
6 // 同时把worker加入到driver内部的缓存结构中
7 driver.worker = Some(worker)
8 // 然后调用worker的actor,给它发送LaunchDriver,让Worker来启动Driver
9 worker.actor ! LaunchDriver(driver.id, driver.desc)
10 // 将driver的状态设置为RUNNING
11 driver.state = DriverState.RUNNING
12 }
launchExecutor方法,
1def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc) {
2 logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)
3 // 将executor加入worker内部的缓存
4 worker.addExecutor(exec)
5 // 向worker的actor发送LaunchExecutor消息
6 worker.actor ! LaunchExecutor(masterUrl,
7 exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory)
8 // 向executor对应的application的driver,发送ExecutorAdded消息
9 exec.application.driver ! ExecutorAdded(
10 exec.id, worker.id, worker.hostPort, exec.cores, exec.memory)
11 }
本文首发于steem,感谢阅读,转载请注明。
微信公众号「padluo」,分享数据科学家的自我修养,既然遇见,不如一起成长。
数据分析
读者交流电报群
知识星球交流群
知识星球读者交流群