spark MapOutputTracker

最新推荐文章于 2022-08-27 17:52:01 发布

原创最新推荐文章于 2022-08-27 17:52:01 发布 · 268 阅读

0 ·

CC 4.0 BY-SA版权

spark源码专栏收录该内容

12 篇文章

订阅专栏

本文深入解析MapOutputTracker在Spark中的作用，包括其不同版本在Driver和Executor端的应用，以及如何通过MapOutputTrackerMaster和MapOutputTrackerWorker实现map输出的追踪与缓存，以优化任务调度和shuffle过程。

MapOutputTracker 图

在这里插入图片描述

简述

MapOutputTracker

保持追踪一个阶段map输出的位置的类.这个类是抽像的,因为driver和executor有不同的MapOutputTracker版本

MapOutputTrackerMaster

Driver-side的类,跟踪一个stage map输出的位置

DAGScheduler用这个类来注册/取消注册 map输出的状态,查看任务位置感知来减少任务的调度

ShuffleMapStage 用这个类来追踪可用的/错过的outputs,来决定那个任务需要被运行

MapOutputTrackerWorker

Executor端的client从drive 的MapOutputTrackerMaster抓取map输出信,注意这不用在local-mode,local模式可以直接访问MapOutputTrackerMaster

在mapreduce任务中,发了生shuffle,reduce依赖于前一个map任务的结果,这些类就是用来缓存某一次shuffle后,map每个分区的结果.

方法

// DAGScheduler的createShuffleMapStage方法调用
// 在DAGScheduler切分stage的时候,向MapOutputTrackerMaster注册stage,
 def registerShuffle(shuffleId: Int, numMaps: Int) {
   if (shuffleStatuses.put(shuffleId, new ShuffleStatus(numMaps)).isDefined) {
     throw new IllegalArgumentException("Shuffle ID " + shuffleId + " registered twice")
   }
 }
// DAGScheduler的handleTaskCompletion方法调用
//当DAGScheduler监听到任务完成时,会调用registerMapOutput,写入这个map输出的对应分区的数据
 def registerMapOutput(shuffleId: Int, mapId: Int, status: MapStatus) {
   shuffleStatuses(shuffleId).addMapOutput(mapId, status)
 }


//以remove unregister开头的方法
//当executor挂点,主机结点挂掉的时候,都要丢弃对应的数据.
//同时该用  incrementEpoch()方法
 /**
  * Removes all shuffle outputs associated with this host. Note that this will also remove
  * outputs which are served by an external shuffle server (if one exists).
  */
 def removeOutputsOnHost(host: String): Unit = {
   shuffleStatuses.valuesIterator.foreach { _.removeOutputsOnHost(host) }
   incrementEpoch()
 }
 
//epoch  是一个数据的版本快照,如果数据有变化,版本快照应该+1.
 def incrementEpoch() {
   epochLock.synchronized {
     epoch += 1
     logDebug("Increasing epoch to " + epoch)
   }
 }
 /**
 在ShuffleMapStage中被调用.
  * Returns the sequence of partition ids that are missing (i.e. needs to be computed), or None
  * if the MapOutputTrackerMaster doesn't know about this shuffle.
  * 返回缺失的partition,需要再次计算
  */
 def findMissingPartitions(shuffleId: Int): Option[Seq[Int]] = {
   shuffleStatuses.get(shuffleId).map(_.findMissingPartitions())
 }