Spark-Core源码学习记录 3 SparkContext、SchedulerBackend、TaskScheduler初始化及应用的注册流程

最新推荐文章于 2022-11-14 16:45:03 发布

御街打码

最新推荐文章于 2022-11-14 16:45:03 发布

阅读量389

点赞数

CC 4.0 BY-SA版权

分类专栏： Spark-Core源码学习记录文章标签： Saprk SchedulerBackend TaskScheduler SparkContext 应用注册

本文链接：https://blog.youkuaiyun.com/u011372108/article/details/89305908

本文详细记录了Spark-Core中SparkContext、SchedulerBackend和TaskScheduler的初始化过程，以及应用注册流程。从SparkContext的getOrCreate方法开始，探讨了它们如何在driver端和executor端交互，特别是TaskSchedulerImpl、StandaloneSchedulerBackend的创建和初始化，以及DAGSchedulerEvent的处理。文章深入源码，帮助理解Spark的工作机制。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Spark-Core源码学习记录

该系列作为Spark源码回顾学习的记录，旨在捋清Spark分发程序运行的机制和流程，对部分关键源码进行追踪，争取做到知其所以然，对枝节部分源码仅进行文字说明，不深入下钻，避免混淆主干内容。
从本篇文章开始，进入到Spark核心的部分，我们将依次展开。

SparkContext 基石

SparkContext在整个Spark运行期间都起着重要的作用，并在其中完成了许多重要组件的初始化等内容，因此可以说是Saprk的基石，下面就从上节中的getOrCreate方法开始，进入SaprkContext一探究竟：

def getOrCreate(): SparkSession = synchronized {
   
   
	  val sparkContext = userSuppliedContext.getOrElse {
   
   
      // 实例化SparkConf，并添加属性值，options是前面builder里面设置的参数
      val sparkConf = new SparkConf()
      options.foreach {
   
    case (k, v) => sparkConf.set(k, v) }
      // set a random app name if not given.
      if (!sparkConf.contains("spark.app.name")) {
   
   
        sparkConf.setAppName(java.util.UUID.randomUUID().toString)
      }
	  // 正式实例化SaprkContext，重要组件，之后会详解。注意：实例化SparkContext后就不可以再改变SparkConf的内容了
      SparkContext.getOrCreate(sparkConf)
    }
}

首先是初始化SparkConf，源码比较简单，就是一些赋值存储操作，值得注意的是，无参构造默认会调用SaprkConf(true)构造，表示需要加载默认的配置文件，// Load any spark.* system properties,因此在测试阶段的用法经常是SparkConf(false)构造。
完成SaprkConf初始化后，正式进入SaprkConf的初始化中：

def getOrCreate(config: SparkConf): SparkContext = {
   
   
  // Synchronize to ensure that multiple create requests don't trigger an exception
  // from assertNoOtherContextIsRunning within setActiveContext
  SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized {
   
   
    if (activeContext.get() == null) {
   
   
      // 如果不存在，就新建一个，并在 ActiveContext中记录
      setActiveContext(new SparkContext(config))
    } else {
   
   
      // 如果存在，那么新传入的 SparkConf不会起作用，给出警告提示
      if (config.getAll.nonEmpty) {
   
   
        logWarning("Using an existing SparkContext; some configuration may not take effect.")
      }
    }
    activeContext.get()
  }
}

下面进到new SparkContext(config)中，惯例先看注释

/**
 * Main entry point for Spark functionality. A SparkContext represents the connection to a Spark
 * cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
 * SparkContext是整个Spark功能的重要入口，是与集群的连接点，通过SparkContext可以创建RDD、Accumulators和Broadcast。
 * @note Only one `SparkContext` should be active per JVM. You must `stop()` the
 *   active `SparkContext` before creating a new one.
 * @param config a Spark Config object describing the application configuration. Any settings in
 *   this config overrides the default configs as well as system properties.
 */

下面开始代码段比较长，重要部分已给出注释，始终记住下面代码都是发生在driver端的，不论部署模式是client还是cluster：

class SparkContext(config: SparkConf) extends Logging {
   
   
  private[spark] val stopped: AtomicBoolean = new AtomicBoolean(false)
  // log out Spark Version in Spark driver log
  logInfo(s"Running Spark version $SPARK_VERSION")
  /* ------------------------------------------------------------------------------------- *
   | Private variables. These variables keep the internal state of the context, and are    |
   | not accessible by the outside world. They're mutable since we want to initialize all  |
   | of them to some neutral value ahead of time, so that calling "stop()" while the       |
   | constructor is still running is safe.                                                 |
   * ------------------------------------------------------------------------------------- */
   private var _conf: SparkConf = _
   private var _env: SparkEnv = _
   private var _schedulerBackend: SchedulerBackend = _
   private var _taskScheduler: TaskScheduler = _
   @volatile private var _dagScheduler: DAGScheduler = _
   ...//这里很多声明的private字段，此处省略掉
   /* ------------------------------------------------------------------------------------- *
   | Accessors and public fields. These provide access to the internal state of the        |
   | context.                                                                              |
   * ------------------------------------------------------------------------------------- */
   ...// 这里是对上面个别字段的get和set方法，采用scala的方式，省略掉
   // _listenerBus和_statusStore用于监听应用程序状态
   _listenerBus = new LiveListenerBus(_conf)
    // Initialize the app status store and listener before SparkEnv is created so that it gets
    // all events.
    val appStatusSource = AppStatusSource.createSource(conf)
    _statusStore = AppStatusStore.createLiveStore(conf, appStatusSource)
    listenerBus.addToStatusQueue(_statusStore.listener.get)

	// Create the Spark execution environment (cache, map output tracker, etc)
	// 这里 createSparkEnv内部调用SparkEnv.createDriverEnv，最终调用 RpcEnv.create(...)，创建sparkDriver的Env
	//具体过程可以参考前面的 RpcEnv篇中介绍
    _env = createSparkEnv(_conf, isLocal, listenerBus)
    SparkEnv.set(_env)

	// 这里可以看到，默认的executorMemory是1G
	_executorMemory = _conf.getOption(EXECUTOR_MEMORY.key)
      .orElse(Option(System.getenv("SPARK_EXECUTOR_MEMORY")))
      .orElse(Option(System.getenv("SPARK_MEM"))
      .map(warnSparkMem))
      .map(Utils.memoryStringToMb)
      .getOrElse(1024)

	// We need to register "HeartbeatReceiver" before "createTaskScheduler" because Executor will
    // retrieve "HeartbeatReceiver" in the constructor. (SPARK-6640)
    // 实例化driver端的心跳机 Endpoint，用于接收 Executor心跳消息
    _heartbeatReceiver = env.rpcEnv.setupEndpoint(
      HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))

	//从这里开始是最重要的部分，出事了三个调度器，每一个都需要我们单独的去追踪流程
	// Create and start the scheduler
    val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
    _schedulerBackend = sched
    _taskScheduler = ts
    _dagScheduler = new DAGScheduler(this)

	// 发送 TaskSchedulerIsSet消息，
    _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)
    /* heartbeatReceiver自身接到消息后,其实就是绑定了SparkContext的 taskScheduler调度器，然后答复true
    case TaskSchedulerIsSet =>
      scheduler = sc.taskScheduler
      context.reply(true)
    */

    // create and start the heartbeater for collecting memory metrics
    _heartbeater = new Heartbeater(...)
    _heartbeater.start()

    // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's constructor
    _taskScheduler.start()
    
    //下面开始就一些应用程序信息的初始化，比如获得 SparkAppId之类的
    //对于yarn模式下，还支持动态资源分配模式，该模式下会构造一个ExecutorAllocationManager对象，主要是根据集群资源动态触发增加或者删除资源策略
    //最后是一些cleaner方法和一些运行状态的监控内容，具体源码就不再给出，避免影响阅读主干内容
}

现在可以专注于_schedulerBackend 、_taskScheduler 、_dagScheduler三者的实例化过程。其中前两个是通过SparkContext.createTaskScheduler(this, master, deployMode)完成实例化，_dagScheduler是通过new DAGScheduler(this)完成。

SchedulerBackend与 TaskScheduler

首先进入createTaskScheduler中查看

/**
 * Create a task scheduler based on a given master URL.
 * Return a 2-tuple of the scheduler backend and the task scheduler.
 */

最低0.47元/天解锁文章

200万优质内容无限畅学