Spark RDD缓存中cache,persist 和checkPoint的区别

本文详细介绍了Spark中RDD的缓存机制,包括cache、persist和checkPoint的区别。cache是persist的默认实现,使用MEMORY_ONLY存储级别,而persist允许选择不同的存储级别。cache和persist在应用结束时会删除缓存,不切断血缘关系。checkPoint则在行动算子执行时触发,执行两次RDD逻辑,保存数据并切断血缘关系,适用于长期保存数据的场景。在生产环境中,通常结合cache和checkPoint使用,以减少计算次数并持久化数据。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Spark RDD缓存中cache,persist 和checkPoint的区别



一、cache,persist和checkPoint是什么?

cache,persist和checkPoint都是Spark任务执行过程中,缓存RDD产生的数据,避免分区执行发生错误,全部RDD再按照血缘关系再重复执行所有RDD逻辑,而直接从缓存中获取某一段RDD逻辑执行产生的数据

二、不同点

1.cache,persist

cache底层代码

/**
   * Persist this RDD with the default storage level (`MEMORY_ONLY`).
   */
  def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

persist底层代码

/**
   * Set this RDD's storage level to persist its values across operations after the first time
   * it is computed. This can only be used to assign a new storage level if the RDD does not
   * have a storage level set yet. Local checkpointing is an exception.
   */
  def persist(newLevel: StorageLevel): this.type = {
    if (isLocallyCheckpointed) {
      // This means the user previously called localCheckpoint(), which should have already
      // marked this RDD for persisting. Here we should override the old storage level with
      // one that is explicitly requested by the user (after adapting it to use disk).
      persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)
    } else {
      persist(newLevel, allowOverride = false)
    }
  }

可以知道的是,cache底层也是persist,只不过默认其存储级别为StorageLevel.MEMORY_ONLY(仅存在内存中)
而persist有多种存储级别
cache,persist 缓存数据会在应用执行完毕后删除缓存数据
cache,persist因为应用结束后会删除缓存数据,所以cache,persist不会切断RDD的血缘关系

  val NONE = new StorageLevel(false, false, false, false)
  val DISK_ONLY = new StorageLevel(true, false, false, false)
  val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
  val MEMORY_ONLY = new StorageLevel(false, true, false, true)
  val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
  val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
  val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
  val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
  val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
  val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
  val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
  val OFF_HEAP = new StorageLevel(true, true, true, false, 1)

存储级别不在本文讨论范围

2.checkPoint

而checkPoint 中,在行动算子执行时

/**
   * Run a function on a given set of partitions in an RDD and pass the results to the given
   * handler function. This is the main entry point for all actions in Spark.
   *
   * @param rdd target RDD to run tasks on
   * @param func a function to run on each partition of the RDD
   * @param partitions set of partitions to run on; some jobs may not want to compute on all
   * partitions of the target RDD, e.g. for operations like `first()`
   * @param resultHandler callback to pass each result to
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    }
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
    progressBar.foreach(_.finishAll())
    rdd.doCheckpoint()
  }

其中的rdd.doCheckpoint会再一次执行rdd逻辑,所以checkPoint之前的rdd逻辑会执行两次,一次时正常计算,一次则是为了checkPoint缓存,所以会执行两次RDD逻辑
checkPoint在应用执行完毕后不会删除缓存数据
chcekPoint因为保存了缓存数据所以切断血缘关系,形成新的源

总结

1.persist缓存数据在应用结束后会删除缓存数据,而checkPoint不会删除
2.persist和checkPoint缓存都是在行动算子执行后产生作用
3.persist 中RDD的逻辑只会执行一次,而checkPoint会执行两次
4.persist不会切断RDD的血缘关系,而checkPoint会
5.生产环境中一般都是cache和checkPoint连用,这样RDD逻辑只会执行一次,并且会缓存到checkPoint中

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值