[spark-src-core] 6. checkpoint in spark

本文详细介绍了Apache Spark中检查点(checkpoint)机制的工作原理及使用方法。解释了如何通过设置检查点目录、标记数据状态快照及执行实际检查点操作来实现数据恢复,避免重复计算。同时探讨了为什么建议将RDD持久化到内存,并讨论了如何利用检查点来恢复数据,以及为何不在首次运行时保存计算结果的原因。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

   same as others big data technology,CheckPoint is a well-knowed solution to keep data a snapshot for speeduping failovers,ie. restores to most recent checkpoint state of data ,so u will not need to recomputate the  rdd against the job.

  in fact,the checkpoint op will cut down the relationships of all parent rdds.so the current rdd will be the last rdd of data line,and it will be derived by CheckpointRDD to achieve this goal.moreover,CheckpointRDDData is a other wrapper of CheckpointRDD.

 

1.how to

  in spark,the checkpoint version is done by below steps(spark 1.4.1):

  

a. setup checkpoint dir by SparkContext.setupCheckpointDir(xx)
b. snapshot a data state of timeline:rdd.checkpoint()
c. do real checkpoint op at the last of a job(by default)

  now lets detail more the steps respectively.

   in the step 'b',the src is implemented by below codepath:

 /**
   * Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint
   * directory set with SparkContext.setCheckpointDir() and all references to its parent
   * RDDs will be removed. This function must be called before any job has been
   * executed on this RDD. It is strongly recommended that this RDD is persisted in
   * memory, otherwise saving it on a file will require recomputation.-cmp RDDCheckpointData#doCheckpoint()
   */
  def checkpoint() {
    if (context.checkpointDir.isEmpty) {
      throw new SparkException("Checkpoint directory has not been set in the SparkContext")
    } else if (checkpointData.isEmpty) {
      checkpointData = Some(new RDDCheckpointData(this))
      checkpointData.get.markForCheckpoint()
    }
  }

   in the comment,u will curious about :why its necessary to persist the rdd,and to memory?

   by diving into the src we know that the checkpoint op is really a job to run one more time on this rdd to save the result to file,so u will do one more computation if this rdd is not persisted.

  on the other hand,why this rdd is recommanded to save in memory but disk? in fact,it's a little bit of differencs between the data saved in memory and file(maybe data format is),therefor,i think the author does not emphasize where to persist but the op of 'persist'.

  

2.FAQ

 

a.how to use checkpoint to restore data

  from the StreamContext,we know that a func named 'getOrCreate(...)' is there for using the specified checkpoint dir defined before .so the snapshot data will readin rdd if any.

 

b.why not to save computated results when the rdd is run in first time

   hm...no doubt,the real meaning of checkpont op is a second same job run on thie rdd.so why not to save ths results to file simetaneously at the first time?

  first,there is only one anomyous function only defined in any runJob(..),thereby no more param can be accpted besides the user function .

  second,the user function divided by the checkpoint save-op is more clearly to debug ,mantain etc.

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值