Spark RDD缓存中cache，persist 和checkPoint的区别

最新推荐文章于 2024-11-19 20:13:52 发布

原创最新推荐文章于 2024-11-19 20:13:52 发布 · 634 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#spark #大数据 #scala

大数据同时被 3 个专栏收录

28 篇文章

订阅专栏

Spark

18 篇文章

订阅专栏

scala

16 篇文章

订阅专栏

本文详细介绍了Spark中RDD的缓存机制，包括cache、persist和checkPoint的区别。cache是persist的默认实现，使用MEMORY_ONLY存储级别，而persist允许选择不同的存储级别。cache和persist在应用结束时会删除缓存，不切断血缘关系。checkPoint则在行动算子执行时触发，执行两次RDD逻辑，保存数据并切断血缘关系，适用于长期保存数据的场景。在生产环境中，通常结合cache和checkPoint使用，以减少计算次数并持久化数据。

Spark RDD缓存中cache，persist 和checkPoint的区别

一、cache，persist和checkPoint是什么？

cache,persist和checkPoint都是Spark任务执行过程中，缓存RDD产生的数据，避免分区执行发生错误，全部RDD再按照血缘关系再重复执行所有RDD逻辑，而直接从缓存中获取某一段RDD逻辑执行产生的数据

二、不同点

1.cache,persist

cache底层代码

/**
   * Persist this RDD with the default storage level (`MEMORY_ONLY`).
   */
  def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

persist底层代码

/**
   * Set this RDD's storage level to persist its values across operations after the first time
   * it is computed. This can only be used to assign a new storage level if the RDD does not
   * have a storage level set yet. Local checkpointing is an exception.
   */
  def persist(newLevel: StorageLevel): this.type = {
    if (isLocallyCheckpointed) {
      // This means the user previously called localCheckpoint(), which should have already
      // marked this RDD for persisting. Here we should override the old storage level with
      // one that is explicitly requested by the user (after adapting it to use disk).
      persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)
    } else {
      persist(newLevel, allowOverride = false)
    }
  }

可以知道的是，cache底层也是persist，只不过默认其存储级别为StorageLevel.MEMORY_ONLY（仅存在内存中）
而persist有多种存储级别
cache，persist 缓存数据会在应用执行完毕后删除缓存数据
cache，persist因为应用结束后会删除缓存数据，所以cache，persist不会切断RDD的血缘关系

  val NONE = new StorageLevel(false, false, false, false)
  val DISK_ONLY = new StorageLevel(true, false, false, false)
  val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
  val MEMORY_ONLY = new StorageLevel(false, true, false, true)
  val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
  val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
  val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
  val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
  val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
  val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
  val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
  val OFF_HEAP = new StorageLevel(true, true, true, false, 1)

存储级别不在本文讨论范围

2.checkPoint

而checkPoint 中，在行动算子执行时

/**
   * Run a function on a given set of partitions in an RDD and pass the results to the given
   * handler function. This is the main entry point for all actions in Spark.
   *
   * @param rdd target RDD to run tasks on
   * @param func a function to run on each partition of the RDD
   * @param partitions set of partitions to run on; some jobs may not want to compute on all
   * partitions of the target RDD, e.g. for operations like `first()`
   * @param resultHandler callback to pass each result to
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    }
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
    progressBar.foreach(_.finishAll())
    rdd.doCheckpoint()
  }

其中的rdd.doCheckpoint会再一次执行rdd逻辑，所以checkPoint之前的rdd逻辑会执行两次，一次时正常计算，一次则是为了checkPoint缓存，所以会执行两次RDD逻辑
checkPoint在应用执行完毕后不会删除缓存数据
chcekPoint因为保存了缓存数据所以切断血缘关系，形成新的源

总结

1.persist缓存数据在应用结束后会删除缓存数据,而checkPoint不会删除
2.persist和checkPoint缓存都是在行动算子执行后产生作用
3.persist 中RDD的逻辑只会执行一次，而checkPoint会执行两次
4.persist不会切断RDD的血缘关系，而checkPoint会
5.生产环境中一般都是cache和checkPoint连用，这样RDD逻辑只会执行一次，并且会缓存到checkPoint中