----------------------------目录----------------------------
为何需要checkpoint?
checkPoint作用
源码分析
------------------------------------------------------------
为何需要checkpoint?
大家知道checkpoint和persist都是把数据“保存起来”,persist保存的形式有:磁盘/内存/序列化/这三种,其中
tackyon功能还没有,存储类源码如下:
class StorageLevel private(
private var _useDisk: Boolean,//磁盘
private var _useMemory: Boolean,//内存
private var _useOffHeap: Boolean,//tachyon或叫alluxio
private var _deserialized: Boolean,//反序列化
private var _replication: Int = 1)
extends Externalizable {/*集合体*/}
(1)当persist把数据放到内存的时候,因为我们在处理大量数据的时候,我们进行persist的时候,可能把之前的数
据给挤掉。所以persist数据到内存中,虽然说非常快速,但是可以说是最不可靠的一种存储。
(2)当数据放在磁盘的时候,因为存储的数据放在磁盘的共同文件夹下,正如我们storageLevel设置默认为复制一份
也就是说当有我们存储数据的这个磁盘毁坏了,那么就说我们的数据丢失了。
(3)checkpoint就是为了解决(2)出现的问题:把数据放到HDFS中。借助hdfs的高容错、高可靠的特征来达到更
加可靠的数据持久化
note:集群模式下,hdfs存储数据一般是3份放三个节点中(a,b,c),a在一个机架,bc在另一个机架
(一份数据复制成三份,放在三个节点二个机架中)
checkPoint作用
当checkpoint为当前RDD设置检查点的时候,该函数将会创建一个二进制的文件,并存储到checkpoint目录中,
该目录是用SparkContext.setCheckpointDir()设置的。在checkpoint的过程中,该RDD的所有依赖于父RDD中的信息
将全部被移出。对RDD进行checkpoint操作并不会马上被执行,必须执行Action操作才能触发。当需要checkpoint的
数据的时候,通过ReliableCheckpointRDD的readCheckpointFile方法来从file路径里面读出已经Checkpint的数据,然
后加以应该
源码分析
checkPoint设置的存储位置,这个存储路径必须是HDFS的路径
/**
* 设置目录,用于存储checkpoint的RDD,如果是集群,这个目录必须是HDFS路径
*/
def setCheckpointDir(directory: String) {
//我们运行在集群中,如果把目录设置为本地,那么提出警告
//另外,driver 会尝试在本地重新构建checkpoint的RDD
//由于文件其实是在executor上的,所以会提出警告
if (!isLocal && Utils.nonLocalPaths(directory).isEmpty) {
logWarning("Spark is not running in local mode, therefore the checkpoint directory " +
s"must not be on the local filesystem. Directory '$directory' " +
"appears to be on the local filesystem.")
}
checkpointDir = Option(directory).map { dir =>
val path = new Path(dir, UUID.randomUUID().toString)//path增加一个随机码
val fs = path.getFileSystem(hadoopConfiguration)//加载一个hadoop文件系统的配置
fs.mkdirs(path)//创建一个hdfs下的文件
fs.getFileStatus(path).getPath.toString//把hdfs下的文件路径通过string返回
}
}
RDD下的checkpoint函数:首先检查checkpoint文件目录是否为空,如果不空,那么再检查我们想要的checkpoint
的数据是否为空如果不空
def checkpoint(): Unit = RDDCheckpointData.synchronized {
// NOTE: we use a global lock here due to complexities downstream with ensuring
// children RDD partitions point to the correct parent partitions. In the future
// we should revisit this consideration.
if (context.checkpointDir.isEmpty) {
throw new SparkException("Checkpoint directory has not been set in the SparkContext")
} else if (checkpointData.isEmpty) {
checkpointData = Some(new ReliableRDDCheckpointData(this))
}
}
发现它是用ReliableRDDCheckpointData(this)来把数据进行存储的,在来看ReliableRDDCheckpointData类
/**
* An implementation of checkpointing that writes the RDD data to reliable storage.
* This allows drivers to be restarted on failure with previously computed state.
*/
private[spark] class ReliableRDDCheckpointData[T: ClassTag](@transient private val rdd: RDD[T])
extends RDDCheckpointData[T](rdd) with Logging {
// The directory to which the associated RDD has been checkpointed to
// This is assumed to be a non-local path that points to some reliable storage
//找到文件存储位置的名字,并用cpDir这个(key,value)的RDD存储
private val cpDir: String =
ReliableRDDCheckpointData.checkpointPath(rdd.context, rdd.id)
.map(_.toString)
.getOrElse { throw new SparkException("Checkpoint dir must be specified.") }
/**
* Return the directory to which this RDD was checkpointed.
* If the RDD is not checkpointed yet, return None.
*/
//查看我们需要checkpoint的RDD是否已经checkpoint过,如果没有checkpoint,那么返回None
def getCheckpointDir: Option[String] = RDDCheckpointData.synchronized {
if (isCheckpointed) {
Some(cpDir.toString)
} else {
None
}
}
/**
* Materialize this RDD and write its content to a reliable DFS.
* This is called immediately after the first action invoked on this RDD has completed.
*
*/
protected override def doCheckpoint(): CheckpointRDD[T] = {
val newRDD = ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir)
// Optionally clean our checkpoint files if the reference is out of scope
if (rdd.conf.getBoolean("spark.cleaner.referenceTracking.cleanCheckpoints", false)) {
rdd.context.cleaner.foreach { cleaner =>
cleaner.registerRDDCheckpointDataForCleanup(newRDD, rdd.id)
}
}
logInfo(s"Done checkpointing RDD ${rdd.id} to $cpDir, new parent is RDD ${newRDD.id}")
newRDD
}
}
同时同名object有两个函数:checkpointPath和cleanCheckpoint
private[spark] object ReliableRDDCheckpointData extends Logging {
/** Return the path of the directory to which this RDD's checkpoint data is written. */
def checkpointPath(sc: SparkContext, rddId: Int): Option[Path] = {
sc.checkpointDir.map { dir => new Path(dir, s"rdd-$rddId") }
}
/** Clean up the files associated with the checkpoint data for this RDD. */
def cleanCheckpoint(sc: SparkContext, rddId: Int): Unit = {
checkpointPath(sc, rddId).foreach { path =>
val fs = path.getFileSystem(sc.hadoopConfiguration)
if (fs.exists(path)) {
if (!fs.delete(path, true)) {
logWarning(s"Error deleting ${path.toString()}")
}
}
}
}
}
现在我们重点分析docheckpoint这个函数:通过ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd,cpDir)
产生一个新RDD。
/**
* Write RDD to checkpoint files and return a ReliableCheckpointRDD representing the RDD.
*/
def writeRDDToCheckpointDirectory[T: ClassTag](
originalRDD: RDD[T],
checkpointDir: String,
blockSize: Int = -1): ReliableCheckpointRDD[T] = {
val sc = originalRDD.sparkContext
// Create the output path for the checkpoint
//把checkpointDir设置我们checkpoint的目录
val checkpointDirPath = new Path(checkpointDir)
val fs = checkpointDirPath.getFileSystem(sc.hadoopConfiguration)
if (!fs.mkdirs(checkpointDirPath)) {
throw new SparkException(s"Failed to create checkpoint path $checkpointDirPath")
}
// Save to file, and reload it as an RDD
//保存文件,同时把它作为一个RDD重新加载它
val broadcastedConf = sc.broadcast(
new SerializableConfiguration(sc.hadoopConfiguration))
// TODO: This is expensive because it computes the RDD again unnecessarily (SPARK-8582)
//发下這是非常消耗的,因为我们需要再次计算我们所需的这个RDD
sc.runJob(originalRDD,
writePartitionToCheckpointFile[T](checkpointDirPath.toString, broadcastedConf) _)
if (originalRDD.partitioner.nonEmpty) {
writePartitionerToCheckpointDir(sc, originalRDD.partitioner.get, checkpointDirPath)
}
val newRDD = new ReliableCheckpointRDD[T](
sc, checkpointDirPath.toString, originalRDD.partitioner)
if (newRDD.partitions.length != originalRDD.partitions.length) {
throw new SparkException(
s"Checkpoint RDD $newRDD(${newRDD.partitions.length}) has different " +
s"number of partitions from original RDD $originalRDD(${originalRDD.partitions.length})")
}
newRDD
}
分析:runJob重新计算我们需要的RDD,也就是说我们的RDD会再次计算,如果我们在checkpoint之前先对
这个RDD进行persisit的话,能达到更好的效果。我们再来看看runJob中的 writePartitionToCheckpointFile函数
/**
* Write a RDD partition's data to a checkpoint file.
*/
def writePartitionToCheckpointFile[T: ClassTag](
path: String,
broadcastedConf: Broadcast[SerializableConfiguration],
blockSize: Int = -1)(ctx: TaskContext, iterator: Iterator[T]) {
val env = SparkEnv.get
val outputDir = new Path(path)
val fs = outputDir.getFileSystem(broadcastedConf.value.value)
val finalOutputName = ReliableCheckpointRDD.checkpointFileName(ctx.partitionId())
val finalOutputPath = new Path(outputDir, finalOutputName)
val tempOutputPath =
new Path(outputDir, s".$finalOutputName-attempt-${ctx.attemptNumber()}")
if (fs.exists(tempOutputPath)) {
throw new IOException(s"Checkpoint failed: temporary path $tempOutputPath already exists")
}
val bufferSize = env.conf.getInt("spark.buffer.size", 65536)
val fileOutputStream = if (blockSize < 0) {
fs.create(tempOutputPath, false, bufferSize)
} else {
// This is mainly for testing purpose
fs.create(tempOutputPath, false, bufferSize,
fs.getDefaultReplication(fs.getWorkingDirectory), blockSize)
}
val serializer = env.serializer.newInstance()
val serializeStream = serializer.serializeStream(fileOutputStream)
Utils.tryWithSafeFinally {
serializeStream.writeAll(iterator)
} {
serializeStream.close()
}
if (!fs.rename(tempOutputPath, finalOutputPath)) {
if (!fs.exists(finalOutputPath)) {
logInfo(s"Deleting tempOutputPath $tempOutputPath")
fs.delete(tempOutputPath, false)
throw new IOException("Checkpoint failed: failed to save output of task: " +
s"${ctx.attemptNumber()} and final output path does not exist: $finalOutputPath")
} else {
// Some other copy of this task must've finished before us and renamed it
logInfo(s"Final output path $finalOutputPath already exists; not overwriting it")
if (!fs.delete(tempOutputPath, false)) {
logWarning(s"Error deleting ${tempOutputPath}")
}
}
}
}
/**
* Write a partitioner to the given RDD checkpoint directory. This is done on a best-effort
* basis; any exception while writing the partitioner is caught, logged and ignored.
*/
private def writePartitionerToCheckpointDir(
sc: SparkContext, partitioner: Partitioner, checkpointDirPath: Path): Unit = {
try {
val partitionerFilePath = new Path(checkpointDirPath, checkpointPartitionerFileName)
val bufferSize = sc.conf.getInt("spark.buffer.size", 65536)
val fs = partitionerFilePath.getFileSystem(sc.hadoopConfiguration)
val fileOutputStream = fs.create(partitionerFilePath, false, bufferSize)
val serializer = SparkEnv.get.serializer.newInstance()
val serializeStream = serializer.serializeStream(fileOutputStream)
Utils.tryWithSafeFinally {
serializeStream.writeObject(partitioner)
} {
serializeStream.close()
}
logDebug(s"Written partitioner to $partitionerFilePath")
} catch {
case NonFatal(e) =>
logWarning(s"Error writing partitioner $partitioner to $checkpointDirPath")
}
}
writePartitionToCheckpointFile中的RDD加载了环境等信息。同时发RDD保存的信息放入这个RDD中。
最后我们来总结一下:
val newRDD = ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir)
1、把之前originalRDD信息复制到newRDD中
2、同时把这个RDD进行存储到HDFS,目录在我们设置的setCheckpointDir(directory: String)中的directory
3、同时我们会对originalRDD的hadoopConfiguration信息进行广播
4、我们在runJob的时候会再次计算我们这个RDD,也就是说我们可以对它进行缓存,這样可以得到更好的优化。
现在我们再来看看docheckpoint的最后一个环节。
if (rdd.conf.getBoolean("spark.cleaner.referenceTracking.cleanCheckpoints", false)) {
rdd.context.cleaner.foreach { cleaner =>
cleaner.registerRDDCheckpointDataForCleanup(newRDD, rdd.id)
}
}
发现我docheckpoint会对originalRDD进行清洗,也就是说,之前的计算链全部清除,只留下我们的newRDD
也就是说这个newRDD称为最原始的父RDD(找源头只能找到這里)
如何获取我们的checkpointRDD,在后面的环节再说,但是可以肯定的是都会通过ReliableCheckpointRDD的
readCheckpointFile方法来从file路径里面读出已经Checkpint的数据