spark源码阅读笔记RDD（五） RDD中的checkpoint原理_checkpoint directory has not been set in the spark-优快云博客

本文链接：https://blog.youkuaiyun.com/legotime/article/details/51290958

本文深入探讨了Spark中Checkpoint机制的作用与实现原理，详细分析了Checkpoint与Persist的区别，介绍了Checkpoint如何利用HDFS的特性提高数据可靠性，并通过源码解析Checkpoint的具体流程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

----------------------------目录----------------------------

为何需要checkpoint？

checkPoint作用

源码分析

------------------------------------------------------------

为何需要checkpoint？

大家知道checkpoint和persist都是把数据“保存起来”，persist保存的形式有：磁盘/内存/序列化/这三种，其中

tackyon功能还没有，存储类源码如下：

class StorageLevel private(
    private var _useDisk: Boolean,//磁盘
    private var _useMemory: Boolean,//内存
    private var _useOffHeap: Boolean,//tachyon或叫alluxio
    private var _deserialized: Boolean,//反序列化
    private var _replication: Int = 1)
  extends Externalizable {/*集合体*/}

（1）当persist把数据放到内存的时候，因为我们在处理大量数据的时候，我们进行persist的时候，可能把之前的数

据给挤掉。所以persist数据到内存中，虽然说非常快速，但是可以说是最不可靠的一种存储。

（2）当数据放在磁盘的时候，因为存储的数据放在磁盘的共同文件夹下，正如我们storageLevel设置默认为复制一份

也就是说当有我们存储数据的这个磁盘毁坏了，那么就说我们的数据丢失了。
（3）checkpoint就是为了解决（2）出现的问题：把数据放到HDFS中。借助hdfs的高容错、高可靠的特征来达到更

加可靠的数据持久化
note：集群模式下，hdfs存储数据一般是3份放三个节点中（a,b,c）,a在一个机架，bc在另一个机架
（一份数据复制成三份，放在三个节点二个机架中）

checkPoint作用
当checkpoint为当前RDD设置检查点的时候，该函数将会创建一个二进制的文件，并存储到checkpoint目录中，
该目录是用SparkContext.setCheckpointDir()设置的。在checkpoint的过程中，该RDD的所有依赖于父RDD中的信息

将全部被移出。对RDD进行checkpoint操作并不会马上被执行，必须执行Action操作才能触发。当需要checkpoint的

数据的时候，通过ReliableCheckpointRDD的readCheckpointFile方法来从file路径里面读出已经Checkpint的数据，然

后加以应该

源码分析

checkPoint设置的存储位置，这个存储路径必须是HDFS的路径

/**
   * 设置目录，用于存储checkpoint的RDD，如果是集群，这个目录必须是HDFS路径
   */
  def setCheckpointDir(directory: String) {
	//我们运行在集群中，如果把目录设置为本地，那么提出警告
	//另外，driver 会尝试在本地重新构建checkpoint的RDD
	//由于文件其实是在executor上的，所以会提出警告
    if (!isLocal && Utils.nonLocalPaths(directory).isEmpty) {
      logWarning("Spark is not running in local mode, therefore the checkpoint directory " +
        s"must not be on the local filesystem. Directory '$directory' " +
        "appears to be on the local filesystem.")
    }

    checkpointDir = Option(directory).map { dir =>
      val path = new Path(dir, UUID.randomUUID().toString)//path增加一个随机码
      val fs = path.getFileSystem(hadoopConfiguration)//加载一个hadoop文件系统的配置
      fs.mkdirs(path)//创建一个hdfs下的文件
      fs.getFileStatus(path).getPath.toString//把hdfs下的文件路径通过string返回
    }
  }

RDD下的checkpoint函数：首先检查checkpoint文件目录是否为空，如果不空，那么再检查我们想要的checkpoint

的数据是否为空如果不空

  def checkpoint(): Unit = RDDCheckpointData.synchronized {
    // NOTE: we use a global lock here due to complexities downstream with ensuring
    // children RDD partitions point to the correct parent partitions. In the future
    // we should revisit this consideration.
    if (context.checkpointDir.isEmpty) {
      throw new SparkException("Checkpoint directory has not been set in the SparkContext")
    } else if (checkpointData.isEmpty) {
      checkpointData = Some(new ReliableRDDCheckpointData(this))
    }
  }

发现它是用ReliableRDDCheckpointData(this)来把数据进行存储的，在来看ReliableRDDCheckpointData类

 /**
 * An implementation of checkpointing that writes the RDD data to reliable storage.
 * This allows drivers to be restarted on failure with previously computed state.
 */
private[spark] class ReliableRDDCheckpointData[T: ClassTag](@transient private val rdd: RDD[T])
  extends RDDCheckpointData[T](rdd) with Logging {

  // The directory to which the associated RDD has been checkpointed to
  // This is assumed to be a non-local path that points to some reliable storage
  //找到文件存储位置的名字，并用cpDir这个（key，value）的RDD存储
  private val cpDir: String =
    ReliableRDDCheckpointData.checkpointPath(rdd.context, rdd.id)
      .map(_.toString)
      .getOrElse { throw new SparkException("Checkpoint dir must be specified.") }

  /**
   * Return the directory to which this RDD was checkpointed.
   * If the RDD is not checkpointed yet, return None.
   */
   //查看我们需要checkpoint的RDD是否已经checkpoint过，如果没有checkpoint，那么返回None
  def getCheckpointDir: Option[String] = RDDCheckpointData.synchronized {
    if (isCheckpointed) {
      Some(cpDir.toString)
    } else {
      None
    }
  }

  /**
   * Materialize this RDD and write its content to a reliable DFS.
   * This is called immediately after the first action invoked on this RDD has completed.
   * 
   */
  protected override def doCheckpoint(): CheckpointRDD[T] = {
    val newRDD = ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir)

    // Optionally clean our checkpoint files if the reference is out of scope
    if (rdd.conf.getBoolean("spark.cleaner.referenceTracking.cleanCheckpoints", false)) {
      rdd.context.cleaner.foreach { cleaner =>
        cleaner.registerRDDCheckpointDataForCleanup(newRDD, rdd.id)
      }
    }

    logInfo(s"Done checkpointing RDD ${rdd.id} to $cpDir, new parent is RDD ${newRDD.id}")
    newRDD
  }

}

同时同名object有两个函数：checkpointPath和cleanCheckpoint

private[spark] object ReliableRDDCheckpointData extends Logging {

  /** Return the path of the directory to which this RDD's checkpoint data is written. */
  def checkpointPath(sc: SparkContext, rddId: Int): Option[Path] = {
    sc.checkpointDir.map { dir => new Path(dir, s"rdd-$rddId") }
  }

  /** Clean up the files associated with the checkpoint data for this RDD. */
  def cleanCheckpoint(sc: SparkContext, rddId: Int): Unit = {
    checkpointPath(sc, rddId).foreach { path =>
      val fs = path.getFileSystem(sc.hadoopConfiguration)
      if (fs.exists(path)) {
        if (!fs.delete(path, true)) {
          logWarning(s"Error deleting ${path.toString()}")
        }
      }
    }
  }
}

现在我们重点分析docheckpoint这个函数：通过ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd,cpDir)

产生一个新RDD。

 /**
   * Write RDD to checkpoint files and return a ReliableCheckpointRDD representing the RDD.
   */
  def writeRDDToCheckpointDirectory[T: ClassTag](
      originalRDD: RDD[T],
      checkpointDir: String,
      blockSize: Int = -1): ReliableCheckpointRDD[T] = {

    val sc = originalRDD.sparkContext

    // Create the output path for the checkpoint
	//把checkpointDir设置我们checkpoint的目录
    val checkpointDirPath = new Path(checkpointDir)
    val fs = checkpointDirPath.getFileSystem(sc.hadoopConfiguration)
    if (!fs.mkdirs(checkpointDirPath)) {
      throw new SparkException(s"Failed to create checkpoint path $checkpointDirPath")
    }

    // Save to file, and reload it as an RDD
	//保存文件，同时把它作为一个RDD重新加载它
    val broadcastedConf = sc.broadcast(
      new SerializableConfiguration(sc.hadoopConfiguration))
    // TODO: This is expensive because it computes the RDD again unnecessarily (SPARK-8582)
	//发下這是非常消耗的，因为我们需要再次计算我们所需的这个RDD
    sc.runJob(originalRDD,
      writePartitionToCheckpointFile[T](checkpointDirPath.toString, broadcastedConf) _)

    if (originalRDD.partitioner.nonEmpty) {
      writePartitionerToCheckpointDir(sc, originalRDD.partitioner.get, checkpointDirPath)
    }

    val newRDD = new ReliableCheckpointRDD[T](
      sc, checkpointDirPath.toString, originalRDD.partitioner)
    if (newRDD.partitions.length != originalRDD.partitions.length) {
      throw new SparkException(
        s"Checkpoint RDD $newRDD(${newRDD.partitions.length}) has different " +
          s"number of partitions from original RDD $originalRDD(${originalRDD.partitions.length})")
    }
    newRDD
  }

分析：runJob重新计算我们需要的RDD，也就是说我们的RDD会再次计算，如果我们在checkpoint之前先对

这个RDD进行persisit的话，能达到更好的效果。我们再来看看runJob中的 writePartitionToCheckpointFile函数

   /**
   * Write a RDD partition's data to a checkpoint file.
   */
  def writePartitionToCheckpointFile[T: ClassTag](
      path: String,
      broadcastedConf: Broadcast[SerializableConfiguration],
      blockSize: Int = -1)(ctx: TaskContext, iterator: Iterator[T]) {
    val env = SparkEnv.get
    val outputDir = new Path(path)
    val fs = outputDir.getFileSystem(broadcastedConf.value.value)

    val finalOutputName = ReliableCheckpointRDD.checkpointFileName(ctx.partitionId())
    val finalOutputPath = new Path(outputDir, finalOutputName)
    val tempOutputPath =
      new Path(outputDir, s".$finalOutputName-attempt-${ctx.attemptNumber()}")

    if (fs.exists(tempOutputPath)) {
      throw new IOException(s"Checkpoint failed: temporary path $tempOutputPath already exists")
    }
    val bufferSize = env.conf.getInt("spark.buffer.size", 65536)

    val fileOutputStream = if (blockSize < 0) {
      fs.create(tempOutputPath, false, bufferSize)
    } else {
      // This is mainly for testing purpose
      fs.create(tempOutputPath, false, bufferSize,
        fs.getDefaultReplication(fs.getWorkingDirectory), blockSize)
    }
    val serializer = env.serializer.newInstance()
    val serializeStream = serializer.serializeStream(fileOutputStream)
    Utils.tryWithSafeFinally {
      serializeStream.writeAll(iterator)
    } {
      serializeStream.close()
    }

    if (!fs.rename(tempOutputPath, finalOutputPath)) {
      if (!fs.exists(finalOutputPath)) {
        logInfo(s"Deleting tempOutputPath $tempOutputPath")
        fs.delete(tempOutputPath, false)
        throw new IOException("Checkpoint failed: failed to save output of task: " +
          s"${ctx.attemptNumber()} and final output path does not exist: $finalOutputPath")
      } else {
        // Some other copy of this task must've finished before us and renamed it
        logInfo(s"Final output path $finalOutputPath already exists; not overwriting it")
        if (!fs.delete(tempOutputPath, false)) {
          logWarning(s"Error deleting ${tempOutputPath}")
        }
      }
    }
  }
  /**
   * Write a partitioner to the given RDD checkpoint directory. This is done on a best-effort
   * basis; any exception while writing the partitioner is caught, logged and ignored.
   */
  private def writePartitionerToCheckpointDir(
    sc: SparkContext, partitioner: Partitioner, checkpointDirPath: Path): Unit = {
    try {
      val partitionerFilePath = new Path(checkpointDirPath, checkpointPartitionerFileName)
      val bufferSize = sc.conf.getInt("spark.buffer.size", 65536)
      val fs = partitionerFilePath.getFileSystem(sc.hadoopConfiguration)
      val fileOutputStream = fs.create(partitionerFilePath, false, bufferSize)
      val serializer = SparkEnv.get.serializer.newInstance()
      val serializeStream = serializer.serializeStream(fileOutputStream)
      Utils.tryWithSafeFinally {
        serializeStream.writeObject(partitioner)
      } {
        serializeStream.close()
      }
      logDebug(s"Written partitioner to $partitionerFilePath")
    } catch {
      case NonFatal(e) =>
        logWarning(s"Error writing partitioner $partitioner to $checkpointDirPath")
    }
  }

writePartitionToCheckpointFile中的RDD加载了环境等信息。同时发RDD保存的信息放入这个RDD中。

最后我们来总结一下：

val newRDD = ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir)

1、把之前originalRDD信息复制到newRDD中

2、同时把这个RDD进行存储到HDFS，目录在我们设置的setCheckpointDir(directory: String)中的directory

3、同时我们会对originalRDD的hadoopConfiguration信息进行广播

4、我们在runJob的时候会再次计算我们这个RDD，也就是说我们可以对它进行缓存，這样可以得到更好的优化。

现在我们再来看看docheckpoint的最后一个环节。

if (rdd.conf.getBoolean("spark.cleaner.referenceTracking.cleanCheckpoints", false)) {
      rdd.context.cleaner.foreach { cleaner =>
        cleaner.registerRDDCheckpointDataForCleanup(newRDD, rdd.id)
      }
    }

发现我docheckpoint会对originalRDD进行清洗，也就是说，之前的计算链全部清除，只留下我们的newRDD

也就是说这个newRDD称为最原始的父RDD（找源头只能找到這里）

如何获取我们的checkpointRDD，在后面的环节再说，但是可以肯定的是都会通过ReliableCheckpointRDD的

readCheckpointFile方法来从file路径里面读出已经Checkpint的数据