Spark DAGScheduler中stage转换成TaskSet的过程

本文详细解析了Spark中DAGScheduler如何将stage转换为TaskSet的过程,包括获取分区信息、计算首选位置信息、创建任务以及提交任务给调度器等关键步骤。

Spark DAGScheduler中stage转换成TaskSet的过程

更多资源分享

Youtube视频分析

  • Spark DAGScheduler中stage转换成TaskSet的过程(youtube视频) :

Bilibili视频分析

DAGScheduler 转化stage为TaskSet

  • 得到 partitions(分区信息)
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
  • 计算分区的首选位置信息
 val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
      stage match {
        case s: ShuffleMapStage =>
          partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
        case s: ResultStage =>
          val job = s.activeJob.get
          partitionsToCompute.map { id =>
            val p = s.partitions(id)
            (id, getPreferredLocs(stage.rdd, p))
          }.toMap
      }
    } catch {
      case NonFatal(e) =>
        stage.makeNewStageAttempt(partitionsToCompute.size)
        listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
        abortStage(stage, s"Task creation failed: $e\n${e.getStackTraceString}", Some(e))
        runningStages -= stage
        return
    }
  • 把stage转换成对应的任务,stage有多少个partition就有多少个任务
  • ShuffleMapStage 转换成对应的 ShuffleMapTask
  • ResultStage 转换成对应的 ResultTask
 val tasks: Seq[Task[_]] = try {
      stage match {
        case stage: ShuffleMapStage =>
          stage.pendingPartitions.clear()
          partitionsToCompute.map { id =>
            val locs = taskIdToLocations(id)
            val part = stage.rdd.partitions(id)
            stage.pendingPartitions += id
            new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
              taskBinary, part, locs, stage.internalAccumulators)
          }

        case stage: ResultStage =>
          val job = stage.activeJob.get
          partitionsToCompute.map { id =>
            val p: Int = stage.partitions(id)
            val part = stage.rdd.partitions(p)
            val locs = taskIdToLocations(id)
            new ResultTask(stage.id, stage.latestInfo.attemptId,
              taskBinary, part, locs, id, stage.internalAccumulators)
          }
      }
    }
  • 把tasks转换成TaskSet给任务调度器进行提交
   taskScheduler.submitTasks(new TaskSet(
        tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
      stage.latestInfo.submissionTime = Some(clock.getTimeMillis())

submitMissingTasks处理方法

 /** Called when stage's parents are available and we can now do its task. */
  private def submitMissingTasks(stage: Stage, jobId: Int) {
    logDebug("submitMissingTasks(" + stage + ")")

    // First figure out the indexes of partition ids to compute.
    val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()

    // Create internal accumulators if the stage has no accumulators initialized.
    // Reset internal accumulators only if this stage is not partially submitted
    // Otherwise, we may override existing accumulator values from some tasks
    if (stage.internalAccumulators.isEmpty || stage.numPartitions == partitionsToCompute.size) {
      stage.resetInternalAccumulators()
    }

    // Use the scheduling pool, job group, description, etc. from an ActiveJob associated
    // with this Stage
    val properties = jobIdToActiveJob(jobId).properties

    runningStages += stage
    // SparkListenerStageSubmitted should be posted before testing whether tasks are
    // serializable. If tasks are not serializable, a SparkListenerStageCompleted event
    // will be posted, which should always come after a corresponding SparkListenerStageSubmitted
    // event.
    stage match {
      case s: ShuffleMapStage =>
        outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
      case s: ResultStage =>
        outputCommitCoordinator.stageStart(
          stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
    }
    val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
      stage match {
        case s: ShuffleMapStage =>
          partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
        case s: ResultStage =>
          val job = s.activeJob.get
          partitionsToCompute.map { id =>
            val p = s.partitions(id)
            (id, getPreferredLocs(stage.rdd, p))
          }.toMap
      }
    } catch {
      case NonFatal(e) =>
        stage.makeNewStageAttempt(partitionsToCompute.size)
        listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
        abortStage(stage, s"Task creation failed: $e\n${e.getStackTraceString}", Some(e))
        runningStages -= stage
        return
    }

    stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)
    listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))

    // TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times.
    // Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast
    // the serialized copy of the RDD and for each task we will deserialize it, which means each
    // task gets a different copy of the RDD. This provides stronger isolation between tasks that
    // might modify state of objects referenced in their closures. This is necessary in Hadoop
    // where the JobConf/Configuration object is not thread-safe.
    var taskBinary: Broadcast[Array[Byte]] = null
    try {
      // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
      // For ResultTask, serialize and broadcast (rdd, func).
      val taskBinaryBytes: Array[Byte] = stage match {
        case stage: ShuffleMapStage =>
          closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef).array()
        case stage: ResultStage =>
          closureSerializer.serialize((stage.rdd, stage.func): AnyRef).array()
      }

      taskBinary = sc.broadcast(taskBinaryBytes)
    } catch {
      // In the case of a failure during serialization, abort the stage.
      case e: NotSerializableException =>
        abortStage(stage, "Task not serializable: " + e.toString, Some(e))
        runningStages -= stage

        // Abort execution
        return
      case NonFatal(e) =>
        abortStage(stage, s"Task serialization failed: $e\n${e.getStackTraceString}", Some(e))
        runningStages -= stage
        return
    }

    val tasks: Seq[Task[_]] = try {
      stage match {
        case stage: ShuffleMapStage =>
          stage.pendingPartitions.clear()
          partitionsToCompute.map { id =>
            val locs = taskIdToLocations(id)
            val part = stage.rdd.partitions(id)
            stage.pendingPartitions += id
            new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
              taskBinary, part, locs, stage.internalAccumulators)
          }

        case stage: ResultStage =>
          val job = stage.activeJob.get
          partitionsToCompute.map { id =>
            val p: Int = stage.partitions(id)
            val part = stage.rdd.partitions(p)
            val locs = taskIdToLocations(id)
            new ResultTask(stage.id, stage.latestInfo.attemptId,
              taskBinary, part, locs, id, stage.internalAccumulators)
          }
      }
    } catch {
      case NonFatal(e) =>
        abortStage(stage, s"Task creation failed: $e\n${e.getStackTraceString}", Some(e))
        runningStages -= stage
        return
    }

    if (tasks.size > 0) {
      logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " +
        s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
      taskScheduler.submitTasks(new TaskSet(
        tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
      stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
    } else {
      // Because we posted SparkListenerStageSubmitted earlier, we should mark
      // the stage as completed here in case there are no tasks to run
      markStageAsFinished(stage, None)

      val debugString = stage match {
        case stage: ShuffleMapStage =>
          s"Stage ${stage} is actually done; " +
            s"(available: ${stage.isAvailable}," +
            s"available outputs: ${stage.numAvailableOutputs}," +
            s"partitions: ${stage.numPartitions})"
        case stage : ResultStage =>
          s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"
      }
      logDebug(debugString)
    }
  }


<think> 嗯,用户想了解Spark中job、stage、task、宽依赖、窄依赖、行动算子和换算子的概述。这些概念确实很基础但也很关键,用户可能正在学习Spark的核心原理。 从用户提供的引用材料来看,内容相当全面。引用1解释了stage如何以shuffle为边界划分,以及宽窄依赖的区别;引用2详细列出了Spark的基本概念;引用3提到了driver如何构建DAG和划分stage;引用4说明RDD的五大特性;引用5则补充了窄依赖中OneToOne依赖的特点。这些正好覆盖了用户需要的所有概念。 用户的问题结构很清晰,应该是有备而来。ta可能已经看过官方文档但需要更系统的中文综述,也可能是为面试做准备。考虑到这些术语的关联性,我会用DAG执行流程作为主线串联解释:从行动算子触发job → job根据shuffle划分stagestage拆分为task → task执行时根据宽窄依赖决定数据传递方式。 特别要注意几个易混淆点:第一,换算子延迟执行特性必须强调;第二,宽窄依赖的核心区别在于是否产生shuffle;第三,task才是最终执行单元而非stage。用户提供的引用5中关于OneToOne依赖的描述很精准,可以重点引用。 用户没有要求代码示例,但概念解析需要足够具体。比如解释宽依赖时准备用join操作的例子,窄依赖则用map。引用2中关于executor和driver的定义也要融入说明,这对理解task执行环境很重要。 最后生成的问题要侧重实践:宽窄依赖优化属于高频考点,stage划分原理对调优很关键,算子选择则直接影响性能。用户后续可能会追问调度细节,比如task分配策略或shuffle优化。 </think> 以下是Spark核心概念的清晰概述,结合宽窄依赖、算子类型和执行层级的关系: --- ### **1. Job(作业)** - **定义**:由**行动算子(Action)** 触发的完整计算流程,每个Action对应一个Job[^2][^3]。 - **特点**: - 例如 `collect()`, `count()`, `saveAsTextFile()` 会触发Job执行。 - 一个Application可包含多个Job(每个Action触发一个Job)。 --- ### **2. Stage(阶段)** - **定义**:Job的子集,以 **Shuffle依赖(宽依赖)** 为边界划分[^1][^3]。 - **分类**: - **ShuffleMapStage**:输出数据供后续Stage Shuffle读取。 - **ResultStage**:执行最终计算并输出结果。 - **划分逻辑**: - 从Job的最终RDD反向遍历,遇到宽依赖则切割Stage(窄依赖合并到同一Stage)[^1][^5]。 --- ### **3. Task(任务)** - **定义**:Stage的执行单元,每个分区(Partition)分配一个Task[^2]。 - **类型**: - **ShuffleMapTask**:对应ShuffleMapStage,输出Shuffle数据。 - **ResultTask**:对应ResultStage,输出最终结果。 - **并行度**:由Stage的**分区数**决定(如RDD有100分区 → 生成100个Task)。 --- ### **4. 宽依赖(Wide Dependency)** - **定义**:子RDD的**一个分区依赖父RDD的多个分区**(需Shuffle)[^1][^5]。 - **特点**: - 触发Stage划分(如 `groupByKey()`, `join()`, `repartition()`)。 - 需要跨节点数据传输,性能开销大。 - **示例**: ```plaintext 父RDD分区: [P1, P2] 子RDD分区: [P1'(依赖P1和P2)] // 宽依赖 ``` --- ### **5. 窄依赖(Narrow Dependency)** - **定义**:子RDD的**每个分区只依赖父RDD的一个分区**(无需Shuffle)[^1][^5]。 - **特点**: - 可在同一Stage内管道化执行(如 `map()`, `filter()`)。 - 支持数据本地化计算,高效。 - **示例**: ```plaintext 父RDD分区: [P1, P2] 子RDD分区: [P1'(仅依赖P1), P2'(仅依赖P2)] // 窄依赖 ``` --- ### **6. 换算子(Transformation)** - **定义**:**延迟计算**的算子,生成新RDD但不立即执行[^4][^5]。 - **常见类型**: - 窄依赖: `map()`, `filter()`, `union()` - 宽依赖: `groupByKey()`, `reduceByKey()`, `join()` - **执行时机**:仅当触发Action时才被调度。 --- ### **7. 行动算子(Action)** - **定义**:触发Job执行的算子,返回非RDD结果(值或输出操作)[^2][^4]。 - **示例**: - `collect()`:返回数据集到Driver。 - `count()`:返回RDD元素总数。 - `saveAsTextFile(path)`:输出数据到存储系统。 --- ### **执行流程总结** ```mermaid graph LR A[Action算子触发Job] --> B{DAG调度器划分Stage} B -->|宽依赖| C[切割Stage] B -->|窄依赖| D[合并到同一Stage] C & D --> E[Task调度器分配Task] E --> F[Executor执行Task] ``` - **关键点**: 1. Action触发Job → 2. DAG根据宽依赖切Stage → 3. StageTaskSet → 4. Task分发到Executor执行[^2][^3][^5]。 --- ### 相关问题 1. **宽依赖如何影响Spark的性能?如何优化Shuffle过程?** 2. **为什么窄依赖比宽依赖执行效率更高?请举例说明。** 3. **DAGSchedulerStage划分中具体做了哪些工作?** 4. **如何根据业务逻辑选择换算子以减少Shuffle?** [^1]: Stages以Shuffle为边界切分,宽依赖需跨Stage [^2]: Job由Action触发,Stage分组Task,Task是执行单元 [^3]: Driver构建DAG并划分Stage [^4]: RDD的依赖关系决定Stage结构 [^5]: 窄依赖分区一对一,宽依赖一对多需Shuffle
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值