Spark任务启动后,Executor会定时向Driver发送心跳信息,Driver端通过HeartbeatReceiver处理心跳信息,HeartbeatReceiver会在启动前,生成一个定时线程维护可用Executor列表:
override def onStart(): Unit = {
timeoutCheckingTask = eventLoopThread.scheduleAtFixedRate(new Runnable {
override def run(): Unit = Utils.tryLogNonFatalError {
Option(self).foreach(_.ask[Boolean](ExpireDeadHosts))
}
}, 0, checkTimeoutIntervalMs, TimeUnit.MILLISECONDS)
}
checkTimeoutIntervalMs默认为60s,可以通过设置【spark.storage.blockManagerTimeoutIntervalMs】或【spark.network.timeoutInterval】两个属性设置
每隔60s执行一次终止失效Executor的操作
失效的Executor的判断标准是【当前时间-上一次心跳时间】>指定的时间,默认为120s,即如果120s还没有收到Executor发过来的心跳信息,则认为该Executor失效,可以通过设置【spark.storage.blockManagerSlaveTimeoutMs】或【spark.network.timeout】设置。
Executor发送心跳:
/**
* Schedules a task to report heartbeat and partial metrics for active tasks to driver.
*/
private def startDriverHeartbeater(): Unit = {
val intervalMs = conf.getTimeAsMs("spark.executor.heartbeatInterval", "10s")
// Wait a random interval so the heartbeats don't end up in sync
val initialDelay = intervalMs + (math.random * intervalMs).asInstanceOf[Int]
val heartbeatTask = new Runnable() {
override def run(): Unit = Utils.logUncaughtExceptions(reportHeartBeat())
}
heartbeater.scheduleAtFixedRate(heartbeatTask, initialDelay, intervalMs, TimeUnit.MILLISECONDS)
}
IntervalMs默认为10s,也就是每10s,Executor会向Driver发送一次心跳信息,可以通过属性【spark.executor.heartbeatInterval】设置