Spark Streaming Checkpoint

本文详细介绍了Apache Spark Streaming中Checkpoint的使用方法及其实现原理。包括Checkpoint的重要性、何时启用Checkpoint、元数据与数据Checkpoint的区别,并对比了Checkpoint与Persist的不同之处。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1. Objective

This document aims at a Spark Streaming Checkpoint, we will start with what is a streaming checkpoint, how streaming checkpoint helps to achieve fault tolerance. There are two types of spark checkpoint i.e. reliable checkpointing, local checkpointing. In this spark streaming tutorial, we will learn both the types in detail. Also, to understand more about a comparison of checkpointing & persist() in Spark.

Spark Streaming Checkpoint

Spark Streaming Checkpoint in Apache Spark

2. What is Spark Streaming Checkpoint

A process of writing received records at checkpoint intervals to HDFS is checkpointing. It is a requirement that streaming application must operate 24/7. Hence, must be resilient to failures unrelated to the application logic such as system failures, JVM crashes, etc. Checkpointing creates fault-tolerant stream processing pipelines. So, input dstreams can restore before-failure streaming state and continue stream processing.

In Streaming, DStreams can checkpoint input data at specified time intervals. For its possibility, needs to checkpoint enough information to fault tolerant storage system such that, it can recover from failures. Data checkpoint are of two types.

2.1. Metadata checkpointing 

We use it to recover from the failure of the node running the driver of the streaming application. Metadata includes:

    • Configuration – We use to create the streaming application.
    • DStream operations – Defines the streaming application.
    • Incomplete batches -Jobs are in the queue but have not completed yet.

2.2. Data checkpointing 

All the generated RDDs are saving to reliable storage. For some stateful transformations, it is necessary to combine the data across multiple batches. Since generated RDDs in some transformations depend on RDDs of previous batches. It causes the length of the dependency chain to keep increasing with time, to reduce increases in recovery time. Checkpoint intermediate RDDs of stateful transformation, it happens at reliable storage to cut off the dependency chains.

In other words, for recovery from driver failures, metadata checkpointing is primarily needed, data or RDD checkpointing is also necessary. Even for basic functioning, if we use stateful transformations.

3. When to enable Checkpointing in Spark Streaming

With any of the following requirements, checkpointing in Spark streaming is a must for applications:

3.1. While we use stateful transformations

The checkpoint directory must be provided to allow for periodic RDD checkpointing. Only while we use following stateful transformations, such as updateStateByKey or reduceByKeyAndWindow (with inverse function) in the application.

3.2. Recovering from failures of the driver running the application

To recover with progress information, we use metadata checkpoints.

Note: Apart from above mentioned, simple streaming applications can run, without enabling checkpointing. In that case, the recovery from driver failures will also be partial. Also, remember some received but unprocessed data may get lost. It is often acceptable and many run Spark Streaming applications in this way.

4. Marking StreamingContext as Checkpointed

While we persist checkpoint data we use “StreamingContext.checkpoint” method. We use this method to set up an HDFS-compatible checkpoint directory.

For Example

  1. ssc.checkpoint("_checkpoint")

5. Types of Checkpointing in Spark Streaming

Apache Spark checkpointing are two categories:

5.1. Reliable Checkpointing –

The checkpointing in which the actual RDD exist in the reliable distributed file system, e.g. HDFS. We need to call following method to set the checkpoint directory

  1. SparkContext.setCheckpointDir(directory: String)

While running over cluster, the directory must be an HDFS path. Since the driver tries to recover the checkpointed RDD from a local file. Even so, checkpoint files are actually on the executor’s machines.

5.2. Local Checkpointing –

We truncate the RDD lineage graph in spark, in Streaming or GraphX. In local checkpointing, we persist RDD to local storage in the executor

6. Difference between Spark Checkpointing and Persist

Spark checkpoint vs persist is different in many ways. Let’s discuss them one by one-

Persist

  • While we persist RDD with DISK_ONLY storage, RDD gets stored in whereafter use of RDD will not reach, that points to recomputing the lineage.
  • Spark remembers the lineage of the RDD, even though it doesn’t call it, just after Persist() called.
  • As soon as the job run is complete, it clears the cache and also destroys all the files.

Checkpointing

  • Through checkpointing, RDDs get stored in HDFS. Also, deletes the lineage which created it.
  • Unlike the cache, the checkpoint file is not deleted upon completing the job run.
  • At the time of checkpointing an RDD, it results in double computation.

7. Spark Streaming Checkpoint – Conclusion

Spark Streaming Checkpoint tutorial, said that by using a checkpointing method in spark streaming one can achieve fault tolerance. Whenever it needs, it provides fault tolerance to the streaming data. Moreover, when the read operation is complete the files are not removed, as in persist method. Hence, the RDD in Apache Spark needs to be checkpointed if the computation takes a long time or the computing chain is too long. Also, if it depends on too many RDDs. It also helps to avoid unbounded increases in recovery time. Ultimately, checkpointing improves the performance of the system.

List of best books for learning spark.

Reference – Apache Spark

一、综合实战—使用极轴追踪方式绘制信号灯 实战目标:利用对象捕捉追踪和极轴追踪功能创建信号灯图形 技术要点:结合两种追踪方式实现精确绘图,适用于工程制图中需要精确定位的场景 1. 切换至AutoCAD 操作步骤: 启动AutoCAD 2016软件 打开随书光盘中的素材文件 确认工作空间为"草图与注释"模式 2. 绘图设置 1)草图设置对话框 打开方式:通过"工具→绘图设置"菜单命令 功能定位:该对话框包含捕捉、追踪等核心绘图辅助功能设置 2)对象捕捉设置 关键配置: 启用对象捕捉(F3快捷键) 启用对象捕捉追踪(F11快捷键) 勾选端点、中心、圆心、象限点等常用捕捉模式 追踪原理:命令执行时悬停光标可显示追踪矢量,再次悬停可停止追踪 3)极轴追踪设置 参数设置: 启用极轴追踪功能 设置角度增量为45度 确认后退出对话框 3. 绘制信号灯 1)绘制圆形 执行命令:"绘图→圆→圆心、半径"命令 绘制过程: 使用对象捕捉追踪定位矩形中心作为圆心 输入半径值30并按Enter确认 通过象限点捕捉确保圆形位置准确 2)绘制直线 操作要点: 选择"绘图→直线"命令 捕捉矩形上边中点作为起点 捕捉圆的上象限点作为终点 按Enter结束当前直线命令 重复技巧: 按Enter可重复最近使用的直线命令 通过圆心捕捉和极轴追踪绘制放射状直线 最终形成完整的信号灯指示图案 3)完成绘制 验证要点: 检查所有直线是否准确连接圆心和象限点 确认极轴追踪的45度增量是否体现 保存绘图文件(快捷键Ctrl+S)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值