Spark RDD缓存中cache,persist 和checkPoint的区别
一、cache,persist和checkPoint是什么?
cache,persist和checkPoint都是Spark任务执行过程中,缓存RDD产生的数据,避免分区执行发生错误,全部RDD再按照血缘关系再重复执行所有RDD逻辑,而直接从缓存中获取某一段RDD逻辑执行产生的数据
二、不同点
1.cache,persist
cache底层代码
/**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
*/
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
persist底层代码
/**
* Set this RDD's storage level to persist its values across operations after the first time
* it is computed. This can only be used to assign a new storage level if the RDD does not
* have a storage level set yet. Local checkpointing is an exception.
*/
def persist(newLevel: StorageLevel): this.type = {
if (isLocallyCheckpointed) {
// This means the user previously called localCheckpoint(), which should have already
// marked this RDD for persisting. Here we should override the old storage level with
// one that is explicitly requested by the user (after adapting it to use disk).
persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)
} else {
persist(newLevel, allowOverride = false)
}
}
可以知道的是,cache底层也是persist,只不过默认其存储级别为StorageLevel.MEMORY_ONLY(仅存在内存中)
而persist有多种存储级别
cache,persist 缓存数据会在应用执行完毕后删除缓存数据
cache,persist因为应用结束后会删除缓存数据,所以cache,persist不会切断RDD的血缘关系
val NONE = new StorageLevel(false, false, false, false)
val DISK_ONLY = new StorageLevel(true, false, false, false)
val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
val MEMORY_ONLY = new StorageLevel(false, true, false, true)
val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
val OFF_HEAP = new StorageLevel(true, true, true, false, 1)
存储级别不在本文讨论范围
2.checkPoint
而checkPoint 中,在行动算子执行时
/**
* Run a function on a given set of partitions in an RDD and pass the results to the given
* handler function. This is the main entry point for all actions in Spark.
*
* @param rdd target RDD to run tasks on
* @param func a function to run on each partition of the RDD
* @param partitions set of partitions to run on; some jobs may not want to compute on all
* partitions of the target RDD, e.g. for operations like `first()`
* @param resultHandler callback to pass each result to
*/
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
resultHandler: (Int, U) => Unit): Unit = {
if (stopped.get()) {
throw new IllegalStateException("SparkContext has been shutdown")
}
val callSite = getCallSite
val cleanedFunc = clean(func)
logInfo("Starting job: " + callSite.shortForm)
if (conf.getBoolean("spark.logLineage", false)) {
logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
}
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
progressBar.foreach(_.finishAll())
rdd.doCheckpoint()
}
其中的rdd.doCheckpoint会再一次执行rdd逻辑,所以checkPoint之前的rdd逻辑会执行两次,一次时正常计算,一次则是为了checkPoint缓存,所以会执行两次RDD逻辑
checkPoint在应用执行完毕后不会删除缓存数据
chcekPoint因为保存了缓存数据所以切断血缘关系,形成新的源
总结
1.persist缓存数据在应用结束后会删除缓存数据,而checkPoint不会删除
2.persist和checkPoint缓存都是在行动算子执行后产生作用
3.persist 中RDD的逻辑只会执行一次,而checkPoint会执行两次
4.persist不会切断RDD的血缘关系,而checkPoint会
5.生产环境中一般都是cache和checkPoint连用,这样RDD逻辑只会执行一次,并且会缓存到checkPoint中