一文读懂 Flink Exactly-Once 保证机制深度解析(附源码)

1. 传统方案:Aligned Checkpoint(对齐 Checkpoint)

1.1 为什么需要 Barrier 对齐?

在传统的 Aligned Checkpoint 中,Barrier 对齐是必须的,原因如下:
问题场景

考虑一个两输入流的算子:

Input Stream 1: [A1] [A2] [A3] [Barrier-n] [A4] [A5]
                                    ↓
Input Stream 2: [B1] [Barrier-n] [B2] [B3] [B4]
                        ↓
                   Operator State

如果不对齐会发生什么?

时间线:
T1: 从 Stream2 收到 Barrier-n → 如果立即快照
T2: 继续处理 Stream2 的 [B2] [B3]
T3: 从 Stream1 收到 Barrier-n

问题:B2 和 B3 已经影响了状态,但它们属于 Checkpoint n+1
     如果在 T2 和 T3 之间故障,恢复后:
     - 从 Checkpoint n 恢复状态(不包含 B2、B3 的影响)
     - 重放 B2、B3(再次影响状态)
     → 重复处理!破坏 Exactly-Once!

1.2 Aligned Checkpoint 工作流程

┌─────────────────────────────────────────────────────────────┐
│ Step 1: 接收第一个 Barrier                                   │
├─────────────────────────────────────────────────────────────┤
Input1: [A1][A2][Barrier-n]  ← 收到 Barrier
Input2: [B1][B2][B3][B4]     ← 还未收到

Action: 阻塞 Input1,缓存后续数据 [A3][A4]
        继续处理 Input2 的数据
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ Step 2: 等待对齐(Barrier Alignment)                       │
├─────────────────────────────────────────────────────────────┤
Input1: [缓存: A3, A4]       ← 阻塞
Input2: [B2][B3][Barrier-n]  ← 收到 Barrier

Action: 现在两个输入都收到 Barrier-n
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ Step 3: 执行快照                                             │
├─────────────────────────────────────────────────────────────┤
State: 状态包含所有 Barrier-n 之前的数据
       = f([A1,A2][B1,B2,B3])
       
Action: 
1. 异步保存状态
2. 向下游发送 Barrier-n
3. 解除阻塞,继续处理缓存的数据
└─────────────────────────────────────────────────────────────┘

1.3 核心实现

// org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierAligner

public class CheckpointBarrierAligner {
    
    // 收到 Barrier 时的处理
    public void processBarrier(CheckpointBarrier barrier, InputChannelInfo channel) {
        
        if (numBarriersReceived == 0) {
            // 第一个 Barrier
            startOfAlignmentTimestamp = System.nanoTime();
        }
        
        // 阻塞该通道
        blockChannel(channel);
        numBarriersReceived++;
        
        // 检查是否所有输入都收到 Barrier
        if (numBarriersReceived == totalNumberOfInputChannels) {
            // ✅ 对齐完成!执行 Checkpoint
            triggerCheckpoint(barrier);
            
            // 解除所有通道的阻塞
            releaseBlocksAndResetBarriers();
        }
    }
    
    // 阻塞期间缓存数据
    private void blockChannel(InputChannelInfo channel) {
        blockedChannels[channel.getGateIdx()][channel.getInputChannelIdx()] = true;
        // 后续数据会被缓存,不会被处理
    }
}

1.4 Aligned Checkpoint 的问题

反压场景下的性能问题

极端场景:

Input1: [快速流] → Barrier-n 很快到达
Input2: [慢速流] → Barrier-n 需要很长时间

等待时间可能达到:分钟级甚至小时级!

后果:
├── Checkpoint 延迟增加
├── 大量数据被缓存(占用内存)
├── 阻塞快速流的处理
└── 影响作业整体延迟

2. 新方案:Unaligned Checkpoint(非对齐 Checkpoint)

2.1 核心思想

Flink 1.11+ 引入的革命性改进

不需要等待 Barrier 对齐,同时仍然保证 Exactly-Once!
关键创新:把飞行中的数据(In-flight Data)也作为状态的一部分保存

2.2 Unaligned Checkpoint 工作流程

┌─────────────────────────────────────────────────────────────┐
│ Step 1: 接收第一个 Barrier(无需等待!)                     │
├─────────────────────────────────────────────────────────────┤
Input1: [A1][A2][Barrier-n][A3][A4]  ← 收到 Barrier
Input2: [B1][B2][B3][B4]             ← 还未收到

Action: 
1. 立即处理 Barrier!(不阻塞)
2. 立即向下游转发 Barrier
3. 标记"被超越的数据"需要持久化
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ Step 2: 快照状态 + 飞行中数据                                │
├─────────────────────────────────────────────────────────────┤
Checkpoint 包含:

1. Operator State(算子状态)
   └── 当前状态(已处理 A1, A2)

2. Input Buffer State(输入缓冲区)
   └── Input2 中还未处理的: [B1][B2][B3][B4]
   
3. Output Buffer State(输出缓冲区)
   └── 已发出但还在网络中的数据

完整快照 = Operator State + In-flight Data
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ Step 3: Barrier 快速传播                                     │
├─────────────────────────────────────────────────────────────┤
Barrier 不会被数据阻塞,可以"超车"

Source → Barrier → Sink  (很快完成)
    ↓
不受反压影响!
└─────────────────────────────────────────────────────────────┘

2.3 为什么还能保证 Exactly-Once?

关键:恢复时重放 In-flight Data

故障前状态:
┌─────────────────────────────────────┐
│ Checkpoint n (Unaligned)            │
├─────────────────────────────────────┤
│ Operator State:                     │
│   count = 5 (已处理 A1, A2)         │
│                                     │
│ In-flight Data:                     │
│   Input Buffer: [B1, B2]            │
│   Output Buffer: [Result1, Result2] │
└─────────────────────────────────────┘

恢复流程:
1. 恢复 Operator State (count=5)
2. 先重放 In-flight Data: [B1, B2]
   └── 按照原来的顺序!
3. 然后继续处理新数据

结果:每条数据对状态的影响仍然是精确一次!

2.4 核心实现

// org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler

public class SingleCheckpointBarrierHandler {
    
    @Override
    public void processBarrier(CheckpointBarrier barrier, InputChannelInfo channelInfo) {
        
        if (barrier.isUnalignedCheckpoint()) {
            // 🔥 Unaligned 模式
            handleUnalignedBarrier(barrier, channelInfo);
        } else {
            // 传统 Aligned 模式
            handleAlignedBarrier(barrier, channelInfo);
        }
    }
    
    private void handleUnalignedBarrier(CheckpointBarrier barrier, InputChannelInfo channel) {
        
        // 1. 立即转发 Barrier(不等待对齐)
        output.emitBarrier(barrier);
        
        // 2. 标记需要持久化的 In-flight 数据
        for (InputGate inputGate : inputGates) {
            // 捕获输入缓冲区中的数据
            channelStateWriter.addInputData(
                barrier.getId(),
                inputGate.getChannelInfo(),
                inputGate.captureInflightData()  // 📦 捕获飞行中数据
            );
        }
        
        // 3. 捕获输出缓冲区数据
        for (ResultPartition partition : resultPartitions) {
            channelStateWriter.addOutputData(
                barrier.getId(),
                partition.captureInflightData()
            );
        }
        
        // 4. 触发算子状态快照
        triggerCheckpoint(barrier);
        
        // ✅ 完成!无需阻塞,继续处理数据
    }
}

2.5 Channel State(通道状态)持久化

// org.apache.flink.runtime.checkpoint.channel.ChannelStateWriter

public interface ChannelStateWriter {
    
    // 保存输入通道的飞行中数据
    void addInputData(
        long checkpointId,
        InputChannelInfo info,
        Buffer... buffers  // 缓冲区数据
    );
    
    // 保存输出通道的飞行中数据
    void addOutputData(
        long checkpointId,
        ResultSubpartitionInfo info,
        Buffer... buffers
    );
}

// 持久化到分布式文件系统
// 格式: checkpoint_dir/chk-123/channel_state/input/gate-0-channel-1

3. 两种模式对比

3.1 特性对比表

特性Aligned CheckpointUnaligned Checkpoint
Barrier 对齐✅ 必须对齐❌ 不需要对齐
是否阻塞✅ 慢输入会阻塞快输入❌ 不阻塞
Checkpoint 延迟受反压严重影响不受反压影响
In-flight 数据不持久化✅ 持久化
Checkpoint 大小较小(只有状态)较大(状态+飞行数据)
I/O 压力较小较大
恢复时间较快较慢(需重放飞行数据)
适用场景正常/低反压高反压场景
Savepoint 支持✅ 支持❌ Savepoint 必须对齐

3.2 性能对比图

Aligned Checkpoint (反压下):
────────────────────────────────────────────
Checkpoint Duration: ████████████████████  (20)
├── Barrier 传播: ███████████████  (15) ← 瓶颈!
└── 状态快照:     █████  (5)


Unaligned Checkpoint (反压下):
────────────────────────────────────────────
Checkpoint Duration: ███████  (7)
├── Barrier 传播: ██  (2)  ← 快!
├── 状态快照:     ███  (3)
└── 通道快照:     ██  (2)

4. 混合模式:Aligned Timeout

Flink 提供了一个智能策略:先尝试对齐,超时后切换到非对齐

4.1 配置

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

// 启用非对齐 Checkpoint
env.getCheckpointConfig().enableUnalignedCheckpoints();

// 设置对齐超时时间(例如 30 秒)
env.getCheckpointConfig().setAlignedCheckpointTimeout(Duration.ofSeconds(30));

或配置文件:

execution.checkpointing.unaligned: true
execution.checkpointing.aligned-checkpoint-timeout: 30s

4.2 工作流程

┌─────────────────────────────────────────────────────────────┐
│ Checkpoint 开始: 尝试 Aligned 模式                           │
├─────────────────────────────────────────────────────────────┤
Timer 启动: 30 秒倒计时
  ↓
正常情况: 30秒内对齐完成
  └─→ 执行 Aligned Checkpoint ✅
  
反压情况: 30秒内未对齐完成
  └─→ 超时!自动切换到 Unaligned 模式 🔄
      └─→ 立即处理,不再等待
└─────────────────────────────────────────────────────────────┘

4.3 实现原理

// org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierHandler

@Override
public void processBarrier(CheckpointBarrier barrier, InputChannelInfo channel) {
    
    if (isAlignedCheckpoint(barrier)) {
        // 开始对齐
        startAlignment(barrier);
        
        // 启动超时定时器
        scheduledExecutor.schedule(() -> {
            if (isStillAligning(barrier.getId())) {
                LOG.info("Aligned checkpoint timeout, switching to unaligned");
                
                // 🔄 切换到 Unaligned 模式
                switchToUnaligned(barrier);
            }
        }, alignedCheckpointTimeout, TimeUnit.MILLISECONDS);
    }
}

private void switchToUnaligned(CheckpointBarrier barrier) {
    // 1. 停止等待对齐
    stopAlignment();
    
    // 2. 持久化当前所有缓冲区数据
    persistAllBufferedData(barrier.getId());
    
    // 3. 按 Unaligned 模式继续
    continueAsUnaligned(barrier);
}

5. 何时使用哪种模式?

5.1 使用 Aligned Checkpoint

✅ 适用场景:
├── 作业运行稳定,没有明显反压
├── Checkpoint 能在合理时间内完成(秒级)
├── 希望 Checkpoint 尽量小
└── 存储 I/O 是瓶颈

❌ 不适用:
└── 存在严重反压,对齐时间超过分钟级

5.2 使用 Unaligned Checkpoint

✅ 适用场景:
├── 作业存在严重反压
├── Checkpoint 对齐时间很长(分钟到小时)
├── 对 Checkpoint 延迟敏感
├── 有足够的 I/O 带宽
└── 数据倾斜导致某些分区很慢

❌ 不适用:
├── 存储 I/O 已经是瓶颈
├── 飞行中数据量特别大(会导致巨大的 Checkpoint)
└── 需要频繁做 Savepoint(Savepoint 必须对齐)

5.3 使用混合模式(推荐)

✅ 最佳实践:
env.getCheckpointConfig().enableUnalignedCheckpoints();
env.getCheckpointConfig().setAlignedCheckpointTimeout(Duration.ofSeconds(30));

优点:
├── 正常情况下使用 Aligned(Checkpoint 小)
├── 反压时自动切换到 Unaligned(不阻塞)
└── 兼顾两者优点

6.配置建议

6.1 启用非对齐 Checkpoint

// Java API
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

// 1. 启用 Checkpoint
env.enableCheckpointing(60000);  // 每 60 秒一次

// 2. 启用非对齐 Checkpoint
env.getCheckpointConfig().enableUnalignedCheckpoints();

// 3. 设置对齐超时(混合模式)
env.getCheckpointConfig().setAlignedCheckpointTimeout(Duration.ofSeconds(30));

// 4. 其他配置
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(5000);
env.getCheckpointConfig().setCheckpointTimeout(600000);  // 10 分钟超时

6.2 配置文件方式

# flink-conf.yaml

# Checkpoint 间隔
execution.checkpointing.interval: 60s

# 启用非对齐 Checkpoint
execution.checkpointing.unaligned: true

# 对齐超时(0 表示立即非对齐)
execution.checkpointing.aligned-checkpoint-timeout: 30s

# 模式
execution.checkpointing.mode: EXACTLY_ONCE

# Checkpoint 超时
execution.checkpointing.timeout: 10min

# 最小间隔
execution.checkpointing.min-pause: 5s

7.核心源码位置

// Checkpoint 选项
org.apache.flink.runtime.checkpoint.CheckpointOptions
└── enum AlignmentType {
    AT_LEAST_ONCE,     // 不需要精确一次
    ALIGNED,           // 对齐模式
    UNALIGNED,         // 非对齐模式
    FORCED_ALIGNED     // 强制对齐(Savepoint)
}

// Barrier 处理器
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler
├── handleAlignedBarrier()    // 对齐处理
└── handleUnalignedBarrier()  // 非对齐处理

// Channel State 写入
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriter
├── addInputData()   // 保存输入缓冲区
└── addOutputData()  // 保存输出缓冲区

// Channel State 读取(恢复时)
org.apache.flink.runtime.checkpoint.channel.ChannelStateReader
├── readInputData()
└── readOutputData()

8. 总结

8.1 核心要点

  1. Aligned Checkpoint 必须对齐:这是保证 Exactly-Once 的传统方法,但在反压下性能差
  2. Unaligned Checkpoint 不需要对齐:通过持久化飞行中数据,同样保证 Exactly-Once,但避免了阻塞
  3. 两者都是 Exactly-Once:只是实现方式不同
    • Aligned: 对齐后只快照状态
    • Unaligned: 不对齐,快照状态+飞行数据
  4. 混合模式最佳:先尝试对齐,超时后切换非对齐
  5. Savepoint 必须对齐:因为 Savepoint 需要可移植性和稳定性

8.2 演进历程

Flink 1.0 - 1.10:  只支持 Aligned Checkpoint
                   └── 反压下 Checkpoint 很慢

Flink 1.11+:       引入 Unaligned Checkpoint
                   └── 解决反压问题!

Flink 1.12+:       优化混合模式(Aligned Timeout)
                   └── 更智能的选择
### Flink Exactly-Once Semantics Explained In the context of stream processing, ensuring that each record is processed only once (exactly-once) without any loss or duplication becomes critical for applications requiring high accuracy and reliability. For this purpose, Apache Flink implements sophisticated mechanisms to guarantee exactly-once delivery semantics. #### Importance of Exactly-Once Processing Exactly-once processing ensures every message is consumed precisely one time by downstream systems, preventing both data loss and duplicate records[^3]. This level of assurance is particularly important when dealing with financial transactions, billing information, or other scenarios where even a single error can lead to significant issues. #### Implementation Mechanisms To achieve exactly-once guarantees, Flink employs several key technologies: 1. **Checkpointing**: Periodic snapshots are taken across all operators within a job graph at consistent points in time. These checkpoints serve as recovery states which allow jobs to resume from these saved positions upon failure. 2. **Two-phase commit protocol**: When interacting with external systems like databases or messaging queues through sinks, Flink uses an extended version of the two-phase commit transaction mechanism. During checkpoint creation, pre-commit actions prepare changes; after successful completion of the checkpoint process, global commits finalize those operations[^4]. ```mermaid graph LR; A[Start Transaction] --> B{Prepare Changes}; B --> C(Pre-Commit); C --> D{All Pre-commits Succeed?}; D -->|Yes| E(Global Commit); D -->|No| F(Abort); ``` This diagram illustrates how the two-phase commit works during sink operations. Each operator prepares its part before confirming globally whether everything has been successfully prepared. Only then does it proceed with committing or aborting based on consensus among participants. #### Barrier Insertion & Propagation For maintaining consistency between different parts of computation while taking periodic snapshots, barriers play a crucial role. They act as synchronization markers inserted into streams periodically according to configured intervals. As they propagate along with events throughout the topology, they ensure that no new elements enter until previous ones have completed their respective stages up till the barrier point. ```mermaid sequenceDiagram participant Source participant OperatorA participant OperatorB Note over Source: Time advances... Source->>OperatorA: Data Element 1 Source->>OperatorA: Checkpoint Barrier X Source->>OperatorA: Data Element 2 OperatorA->>OperatorB: Forwarded Elements + Barrier X Note right of OperatorB: Process pending items\nbefore handling next element post-barrier ``` The sequence above shows how barriers travel alongside regular data flow but enforce order so that computations remain synchronized despite asynchronous nature inherent in distributed environments. --related questions-- 1. What challenges arise when implementing exactly-once semantics in real-world applications? 2. How do checkpointing frequencies impact performance versus fault tolerance trade-offs? 3. Can you explain what happens if some nodes fail midway through a two-phase commit operation? 4. Are there alternative methods besides using barriers for achieving similar levels of consistency? 5. In practice, under what circumstances might at-least-once be preferred over exactly-once semantics?
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值