每次进步一点点——spark中cache和persist的区别

最新推荐文章于 2025-06-16 00:11:40 发布

原创

最新推荐文章于 2025-06-16 00:11:40 发布 · 4.2w 阅读

70 ·

CC 4.0 BY-SA版权

文章标签：

#spark #缓存级别 #cache #persist

昨天面试被问到了cache和persist区别，当时只记得是其中一个调用了另一个，但没有回答出二者的不同，所以回来后重新看了源码，算是弄清楚它们的区别了。

cache和persist都是用于将一个RDD进行缓存的，这样在之后使用的过程中就不需要重新计算了，可以大大节省程序运行时间。

cache和persist的区别

基于Spark 1.4.1 的源码，可以看到

/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def cache(): this.type = persist()

说明是cache()调用了persist(), 想要知道二者的不同还需要看一下persist函数：

/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

可以看到persist()内部调用了persist(StorageLevel.MEMORY_ONLY)，继续深入：

/**
 * Set this RDD's storage level to persist its values across operations after the first time
 * it is computed. This can only be used to assign a new storage level if the RDD does not
 * have a storage level set yet..
 */
def persist(newLevel: StorageLevel): this.type = {
  // TODO: Handle changes of StorageLevel
  if (storageLev