TaskScheduler是抽象类,目前Spark仅提供了TaskSchedulerImpl一种实现;其初始化是在SparkContext中
private[spark] class TaskSchedulerImpl(
val sc: SparkContext,
val maxTaskFailures: Int,
isLocal: Boolean = false)
extends TaskScheduler with Logging
TaskScheduler实际是SchedulerBackend的代理,本身处理一些通用逻辑,如不同Job间的调度顺序,将运行缓慢的task在空闲节点上重新提交(speculation)等
// SparkContext调用TaskSchedulerImpl.initialize方法,传入SchedulerBackend对象
def initialize(backend: SchedulerBackend) {
this.backend = backend
// temporarily set rootPool name to empty
rootPool = new Pool("", schedulingMode, 0, 0)
schedulableBuilder = {
schedulingMode match {
case SchedulingMode.FIFO =>
new FIFOSchedulableBuilder(rootPool)
case SchedulingMode.FAIR =>
new FairSchedulableBuilder(rootPool, conf)
}
}
schedulableBuilder.buildPools()
}
Pool用于调度TaskManager,实际上Pool和TaskManager都继承了Schedulable特征,因此Pool可以包含TaskManager或其他Pool;
Spark默认使用FIFO(First In,First Out)调度模式,另外还有FAIR模式。FIFO模式只有一个pool,FAIR模式有多个pool。Pool也分FIFO和FAIR两种模式,两种模式分别对应于FairSchedulableBuilder和FIFOSchedulableBuilder
SparkContext根据master参数决定采用何种SchedulerBackend,以Spark Standalone模式为例,使用的是SparkDeploySchedulerBackend,继承CoarseGrainedSchedulerBackend父类
private[spark] class SparkDeploySchedulerBackend(
scheduler: TaskSchedulerImpl,
sc: SparkContext,
masters: Array[String])
extends CoarseGrainedSchedulerBackend(scheduler, sc.env.rpcEnv)
with AppClientListener
with Logging {
// SparkContext调用TaskSchedulerImpl.start方法
override def start() {
backend.start()
// 判断speculation是否开启,如是,则启动线程将运行缓慢的任务在空闲的资源上重新提交
if (!isLocal && conf.getBoolean("spark.speculation", false)) {
logInfo("Starting speculative execution thread")
sc.env.actorSystem.scheduler.schedule(SPECULATION_INTERVAL_MS milliseconds, SPECULATION_INTERVAL_MS milliseconds) {
Utils.tryOrStopSparkContext(sc) { checkSpeculatableTasks() }
}(sc.env.actorSystem.dispatcher)
}
}
SparkDeploySchedulerBackend.start方法中初始化了AppClient对象,主要用于Driver和Master的Akka通信交互信息、注册Spark Application等
// SparkDeploySchedulerBackend.start()
override def start() {
super.start()
...
val sparkJavaOpts = Utils.sparkJavaOpts(conf, SparkConf.isExecutorStartupConf)
val javaOpts = sparkJavaOpts ++ extraJavaOpts
// 将CoarseGrainedExecutorBackend封装入Command
val command = Command("org.apache.spark.executor.CoarseGrainedExecutorBackend", args, sc.executorEnvs, classPathEntries ++ testingClassPath, libraryPathEntries, javaOpts)
val appUIAddress = sc.ui.map(_.appUIAddress).getOrElse("")
val coresPerExecutor = conf.getOption("spark.executor.cores").map(_.toInt)
val appDesc = new ApplicationDescription(sc.appName, maxCores, sc.executorMemory, command, appUIAddress, sc.eventLogDir, sc.eventLogCodec, coresPerExecutor)
//
Spark TaskScheduler 深入解析与源码探讨

本文详细介绍了Spark TaskScheduler的功能和源码解析,涉及TaskScheduler的初始化、调度模式、数据本地化以及Task的分配与执行。TaskScheduler采用粗粒度或细粒度管理executor资源,并通过TaskSetManager进行任务调度,考虑数据本地性以优化性能。在资源变化时,会重新分配任务,确保高效利用executor。Task的执行由TaskRunner负责,包括反序列化、运行和时间统计。
最低0.47元/天 解锁文章
2964

被折叠的 条评论
为什么被折叠?



