flink checkpoint 存储策略源码分析_flink 状态存储源码-优快云博客

本文链接：https://blog.youkuaiyun.com/yuchuanchen/article/details/106668994

本文深入探讨 Flink 中的 checkpoint 存储策略，包括同步阶段的 DefaultOperatorStateBackend、HeapKeyedStateBackend 和 RocksDBKeyedStateBackend 的详细操作，以及异步阶段、Checkpoint Meta 文件生成和测试验证。重点关注不同 StateBackend 如何处理 Operator State 和 Keyed State，特别是RocksDBStateBackend在全量和增量快照策略上的差异。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

我们在 flink checkpoint 流程分析这篇文章中，详细介绍了 checkpoint 的过程；在 Flink 如何保存状态数据中，介绍了 state 的分类、state 的三种后端存储方式和具体使用方法，并在逻辑上简要介绍了 statebackend 保存 state 的方法。
本文将在上面两篇文章基础上，详细介绍 state 的存储策略。
本文基于 flink-1.10 版本。

官方文档中简要介绍了 checkpoint 的目录结构，该目录结构是由 FLINK-8531 引入的。

/user-defined-checkpoint-dir
    /{job-id}
        |
        + --shared/
        + --taskowned/
        + --chk-1/
        + --chk-2/
        + --chk-3/
        ...

user-defined-checkpoint-dir 由 state.checkpoints.dir 配置；shared 目录用于存放多个 checkpoint 共享的文件，现在这个目录用于 RocksDBStateBackend 存放开启增量 checkpoint 时的 rocksdb 文件；taskowned 存放的文件只属于 TaskManager，永远不能被 Jobmanager 删除；chk-xxx 中的文件只属于一个 checkpoint。
注意: 该目录结构之后还可能会更改。

下面我们从同步阶段、异步阶段、meta 文件生成三个部分详细介绍 checkpoint 的存储策略。最后通过测试来验证。

1.同步阶段

默认情况下，只有 RocksDBKeyedStateBackend 在同步阶段会有 io 操作。同步阶段阶段会生成 OperatorSnapshotFutures，这些 Futures 会在异步阶段进行处理，对于 MemoryStateBackend 和 FsStateBackend 我们可以通过配置使得整个 checkpoint 过程同步进行。
flink checkpoint 流程分析提到过，在 CheckpointingOperation 执行 executeCheckpointing() 时，会对当前所有 operator 执行 checkpointStreamOperator()：

// StreamTask.java
        private void checkpointStreamOperator(StreamOperator<?> op) throws Exception {
   
			if (null != op) {
   

				OperatorSnapshotFutures snapshotInProgress = op.snapshotState(
						checkpointMetaData.getCheckpointId(),
						checkpointMetaData.getTimestamp(),
						checkpointOptions,
						storageLocation);
				operatorSnapshotsInProgress.put(op.getOperatorID(), snapshotInProgress);
			}
		}

StreamOperator 的 snapshotState() 方法最终由它的子类 AbstractStreamOperator 给出了一个 final 实现:

// AbstractStreamOperator.java
    public final OperatorSnapshotFutures snapshotState(long checkpointId, long timestamp, CheckpointOptions checkpointOptions,
			CheckpointStreamFactory factory) throws Exception {
   

		KeyGroupRange keyGroupRange = null != keyedStateBackend ?
				keyedStateBackend.getKeyGroupRange() : KeyGroupRange.EMPTY_KEY_GROUP_RANGE;

		OperatorSnapshotFutures snapshotInProgress = new OperatorSnapshotFutures();

		StateSnapshotContextSynchronousImpl snapshotContext = new StateSnapshotContextSynchronousImpl(
			checkpointId,
			timestamp,
			factory,
			keyGroupRange,
			getContainingTask().getCancelables());

		try {
   
		    // 子类会各自实现 snapshotState(StateSnapshotContext context)
		    snapshotState(snapshotContext);

			snapshotInProgress.setKeyedStateRawFuture(snapshotContext.getKeyedStateStreamFuture());
			snapshotInProgress.setOperatorStateRawFuture(snapshotContext.getOperatorStateStreamFuture());

			if (null != operatorStateBackend) {
   
				snapshotInProgress.setOperatorStateManagedFuture(
					operatorStateBackend.snapshot(checkpointId, timestamp, factory, checkpointOptions));
			}

			if (null != keyedStateBackend) {
   
				snapshotInProgress.setKeyedStateManagedFuture(
					keyedStateBackend.snapshot(checkpointId, timestamp, factory, checkpointOptions));
			}
		} 
...
		return snapshotInProgress;
	}

这一步获取OperatorSnapshotFutures：由 managed/raw keyed/operator 组合的四种 state Futures。
我们主要关注 Operator state Managed Future 和 Keyed state Managed Future。分别由 operatorStateBackend.snapshot() 和 keyedStateBackend.snapshot() 生成。对于不同的 state backend，operatorStateBackend/keyedStateBackend 有不同的实现类，读者应该还记得下面这张图：
@ flink-state | center

1.1 DefaultOperatorStateBackend#snapshot()

对于 operator state，无论选取的哪种 StateBackend，数据全部存储在内存中。

// DefaultOperatorStateBackend.java
RunnableFuture<SnapshotResult<OperatorStateHandle>> snapshotRunner =
			snapshotStrategy.snapshot(checkpointId, timestamp, streamFactory, checkpointOptions);

snapshotStrategy 实际上是 DefaultOperatorStateBackendSnapshotStrategy，我们看这个实现类的 snapshot():
1.首先将注册的 OperatorStates（只能是 ListState 类型）和 BroadcastState 进行了深拷贝：

// DefaultOperatorStateBackendSnapshotStrategy.java
            final Map<String, PartitionableListState<?>> registeredOperatorStatesDeepCopies =
			new HashMap<>(registeredOperatorStates.size());
		    final Map<String, BackendWritableBroadcastState<?, ?>> registeredBroadcastStatesDeepCopies =
			new HashMap<>(registeredBroadcastStates.size());
			...
            if (!registeredOperatorStates.isEmpty()) {
   
				for (Map.Entry<String, PartitionableListState<?>> entry : registeredOperatorStates.entrySet()) {
   
					PartitionableListState<?> listState = entry.getValue();
					if (null != listState) {
   
						listState = listState.deepCopy();
					}
					registeredOperatorStatesDeepCopies.put(entry.getKey(), listState);
				}
			}

			if (!registeredBroadcastStates.isEmpty()) {
   
				for (Map.Entry<String, BackendWritableBroadcastState<?, ?>> entry : registeredBroadcastStates.entrySet()) {
   
					BackendWritableBroadcastState<?, ?> broadcastState = entry.getValue();
					if (null != broadcastState) {
   
						broadcastState = broadcastState.deepCopy();
					}
					registeredBroadcastStatesDeepCopies.put(entry.getKey(), broadcastState);
				}
			}

2.随后生成一个 AsyncSnapshotCallable 对象：

// DefaultOperatorStateBackendSnapshotStrategy.java
        AsyncSnapshotCallable<SnapshotResult<OperatorStateHandle>> snapshotCallable =
			new AsyncSnapshotCallable<SnapshotResult<OperatorStateHandle>>() {
   

				@Override
				protected SnapshotResult<OperatorStateHandle> callInternal() throws Exception {
   

                    // 创建 checkpoint output stream，EXCLUSIVE 表示 state 写入 chk-xxx 目录
					CheckpointStreamFactory.CheckpointStateOutputStream localOut =
						streamFactory.createCheckpointStateOutputStream(CheckpointedStateScope.EXCLUSIVE);
					snapshotCloseableRegistry.registerCloseable(localOut);

                    // 获取注册的 operator state 元数据信息
					// get the registered operator state infos ...
					List<StateMetaInfoSnapshot> operatorMetaInfoSnapshots =
						new ArrayList<>(registeredOperatorStatesDeepCopies.size());

					for (Map.Entry<String, PartitionableListState<?>> entry :
						registeredOperatorStatesDeepCopies.entrySet()) {
   
						operatorMetaInfoSnapshots.add(entry.getValue().getStateMetaInfo().snapshot());
					}

                    // 获取注册的 broadcast operator state 元数据信息
					// ... get the registered broadcast operator state infos ...
					List<StateMetaInfoSnapshot> broadcastMetaInfoSnapshots =
						new ArrayList<>(registeredBroadcastStatesDeepCopies.size());

					for (Map.Entry<String, BackendWritableBroadcastState<?, ?>> entry :
						registeredBroadcastStatesDeepCopies.entrySet()) {
   
						broadcastMetaInfoSnapshots.add(entry.getValue().getStateMetaInfo().snapshot());
					}

                    // 所有元数据信息（state 名称/state 类型/state 分发模式/state 序列化器/snapshot 配置等）写入 checkpoint 流
					// ... write them all in the checkpoint stream ...
					DataOutputView dov = new DataOutputViewStreamWrapper(localOut);

					OperatorBackendSerializationProxy backendSerializationProxy =
						new OperatorBackendSerializationProxy(operatorMetaInfoSnapshots, broadcastMetaInfoSnapshots);

                    // 写入
					backendSerializationProxy.write(dov);

                    // 随后写入 state
					// ... and then go for the states ...

					// we put BOTH normal and broadcast state metadata here
					int initialMapCapacity =
						registeredOperatorStatesDeepCopies.size() + registeredBroadcastStatesDeepCopies.size();
					// 创建 writtenStatesMetaData，保存 state 对应的 StateMetaInfo 信息，这里的 StateMetaInfo 主要保存 state 的 offset 信息
					final Map<String, OperatorStateHandle.StateMetaInfo> writtenStatesMetaData =
						new HashMap<>(initialMapCapacity);

					for (Map.Entry<String, PartitionableListState<?>> entry :
						registeredOperatorStatesDeepCopies.entrySet()) {
   

                        // 对每一个 operator ListState，序列化写入 checkpoint 流后，记录 List 中每个元素在流种的偏移量，保存到 StateMetaInfo
						PartitionableListState<?> value = entry.getValue();
						long[] partitionOffsets = value.write(localOut);
						OperatorStateHandle.Mode mode = value.getStateMetaInfo().getAssignmentMode();
						writtenStatesMetaData.put(
							entry.getKey(),
							new OperatorStateHandle.StateMetaInfo(partitionOffsets, mode));
					}

					// ... and the broadcast states themselves ...
					for (Map.Entry<String, BackendWritableBroadcastState<?, ?>> entry :
						registeredBroadcastStatesDeepCopies.entrySet()) {
   

                        // 对每一个 BroadCast State（每个 state 在内存中是 Map 保存），序列化写入 checkpoint 流后，记录 Map 在流种的偏移量，保存到 StateMetaInfo
						BackendWritableBroadcastState<?, ?> value = entry.getValue();
						long[] partitionOffsets = {
   value.write(localOut)};
						OperatorStateHandle.Mode mode = value.