Spark在Yarncluster环境运行,在ApplicationMaster在某一个NodeManager上运行后,并所有的Excetor启动后并向Driver注册成功后,Spark开始运行用户程序,Driver 运行后包括了SparkContext准备,DAGScheduler,TaskScheduler等执行,首先就是先执行DAGScheduler任务阶段划分
DAGScheduler
DAGScheduler:Spark任务划分阶段Stage管理器,也就是我们称的有向无环图
在Driver初始化SparkContext时就有DAGScheduer这个属性
@volatile private var _dagScheduler: DAGScheduler = _
在YarnClusterApplication 类中执行YARNClient 的run方法
在提交阶段划分
/** Submits stage, but first recursively submits any missing parents. */
private def submitStage(stage: Stage): Unit = {
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
logDebug(s"submitStage($stage (name=${stage.name};" +
s"jobs=${stage.jobIds.toSeq.sorted.mkString(",")}))")
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
val missing = getMissingParentStages(stage).sortBy(_.id)
logDebug("missing: " + missing)
if (missing.isEmpty) {
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
submitMissingTasks(stage, jobId.get)
} else {
for (parent <- missing) {
submitStage(parent)
}
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id, None)
}
}
getMissingParentStages中代码
private def getMissingParentStages(stage: Stage): List[Stage] = {
val missing = new HashSet[Stage]
val visited = new HashSet[RDD[_]]
// We are manually maintaining a stack here to prevent StackOverflowError
// caused by recursively visiting
val waitingForVisit = new ListBuffer[RDD[_]]
waitingForVisit += stage.rdd
def visit(rdd: RDD[_]): Unit = {
if (!visited(rdd)) {
visited += rdd
val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
if (rddHasUncachedPartitions) {
for (dep <- rdd.dependencies) {
dep match {
case shufDep: ShuffleDependency[_, _, _] =>
val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId)
if (!mapStage.isAvailable) {
missing += mapStage
}
case narrowDep: NarrowDependency[_] =>
waitingForVisit.prepend(narrowDep.rdd)
}
}
}
}
}
while (waitingForVisit.nonEmpty) {
visit(waitingForVisit.remove(0))
}
missing.toList
}
在这段代码中,由rdd的依赖关系由后往前推,遇到一个shuffle依赖,那么再查找它上一个的rdd依赖,如果遇到一个shuffle,那么就会进行一次阶段划分,最终的阶段划分由n+1个阶段,n时shuffle的次数
并且在阶段划分中,最后一个阶段称为resultStage,而在这之前的所有阶段称为 shuffleMapStage