flink checkpoint 存储策略源码分析

本文深入探讨 Flink 中的 checkpoint 存储策略,包括同步阶段的 DefaultOperatorStateBackend、HeapKeyedStateBackend 和 RocksDBKeyedStateBackend 的详细操作,以及异步阶段、Checkpoint Meta 文件生成和测试验证。重点关注不同 StateBackend 如何处理 Operator State 和 Keyed State,特别是RocksDBStateBackend在全量和增量快照策略上的差异。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

我们在 flink checkpoint 流程分析 这篇文章中,详细介绍了 checkpoint 的过程;在 Flink 如何保存状态数据 中,介绍了 state 的分类、state 的三种后端存储方式和具体使用方法,并在逻辑上简要介绍了 statebackend 保存 state 的方法。
本文将在上面两篇文章基础上,详细介绍 state 的存储策略。
本文基于 flink-1.10 版本。



官方文档 中简要介绍了 checkpoint 的目录结构,该目录结构是由 FLINK-8531 引入的。

/user-defined-checkpoint-dir
    /{job-id}
        |
        + --shared/
        + --taskowned/
        + --chk-1/
        + --chk-2/
        + --chk-3/
        ...

user-defined-checkpoint-dirstate.checkpoints.dir 配置;shared 目录用于存放多个 checkpoint 共享的文件,现在这个目录用于 RocksDBStateBackend 存放开启增量 checkpoint 时的 rocksdb 文件;taskowned 存放的文件只属于 TaskManager,永远不能被 Jobmanager 删除;chk-xxx 中的文件只属于一个 checkpoint。
注意: 该目录结构之后还可能会更改。

下面我们从同步阶段、异步阶段、meta 文件生成三个部分详细介绍 checkpoint 的存储策略。最后通过测试来验证。

1.同步阶段

默认情况下,只有 RocksDBKeyedStateBackend 在同步阶段会有 io 操作。同步阶段阶段会生成 OperatorSnapshotFutures,这些 Futures 会在异步阶段进行处理,对于 MemoryStateBackend 和 FsStateBackend 我们可以通过配置使得整个 checkpoint 过程同步进行。
flink checkpoint 流程分析 提到过,在 CheckpointingOperation 执行 executeCheckpointing() 时,会对当前所有 operator 执行 checkpointStreamOperator():

// StreamTask.java
        private void checkpointStreamOperator(StreamOperator<?> op) throws Exception {
   
			if (null != op) {
   

				OperatorSnapshotFutures snapshotInProgress = op.snapshotState(
						checkpointMetaData.getCheckpointId(),
						checkpointMetaData.getTimestamp(),
						checkpointOptions,
						storageLocation);
				operatorSnapshotsInProgress.put(op.getOperatorID(), snapshotInProgress);
			}
		}

StreamOperator 的 snapshotState() 方法最终由它的子类 AbstractStreamOperator 给出了一个 final 实现:

// AbstractStreamOperator.java
    public final OperatorSnapshotFutures snapshotState(long checkpointId, long timestamp, CheckpointOptions checkpointOptions,
			CheckpointStreamFactory factory) throws Exception {
   

		KeyGroupRange keyGroupRange = null != keyedStateBackend ?
				keyedStateBackend.getKeyGroupRange() : KeyGroupRange.EMPTY_KEY_GROUP_RANGE;

		OperatorSnapshotFutures snapshotInProgress = new OperatorSnapshotFutures();

		StateSnapshotContextSynchronousImpl snapshotContext = new StateSnapshotContextSynchronousImpl(
			checkpointId,
			timestamp,
			factory,
			keyGroupRange,
			getContainingTask().getCancelables());

		try {
   
		    // 子类会各自实现 snapshotState(StateSnapshotContext context)
		    snapshotState(snapshotContext);

			snapshotInProgress.setKeyedStateRawFuture(snapshotContext.getKeyedStateStreamFuture());
			snapshotInProgress.setOperatorStateRawFuture(snapshotContext.getOperatorStateStreamFuture());

			if (null != operatorStateBackend) {
   
				snapshotInProgress.setOperatorStateManagedFuture(
					operatorStateBackend.snapshot(checkpointId, timestamp, factory, checkpointOptions));
			}

			if (null != keyedStateBackend) {
   
				snapshotInProgress.setKeyedStateManagedFuture(
					keyedStateBackend.snapshot(checkpointId, timestamp, factory, checkpointOptions));
			}
		} 
...
		return snapshotInProgress;
	}

这一步获取OperatorSnapshotFutures:由 managed/raw keyed/operator 组合的四种 state Futures。
我们主要关注 Operator state Managed Future 和 Keyed state Managed Future。分别由 operatorStateBackend.snapshot()keyedStateBackend.snapshot() 生成。对于不同的 state backend,operatorStateBackend/keyedStateBackend 有不同的实现类,读者应该还记得下面这张图:
@ flink-state | center

1.1 DefaultOperatorStateBackend#snapshot()

对于 operator state,无论选取的哪种 StateBackend,数据全部存储在内存中。

// DefaultOperatorStateBackend.java
RunnableFuture<SnapshotResult<OperatorStateHandle>> snapshotRunner =
			snapshotStrategy.snapshot(checkpointId, timestamp, streamFactory, checkpointOptions);

snapshotStrategy 实际上是 DefaultOperatorStateBackendSnapshotStrategy,我们看这个实现类的 snapshot():
1.首先将注册的 OperatorStates(只能是 ListState 类型)和 BroadcastState 进行了深拷贝:

// DefaultOperatorStateBackendSnapshotStrategy.java
            final Map<String, PartitionableListState<?>> registeredOperatorStatesDeepCopies =
			new HashMap<>(registeredOperatorStates.size());
		    final Map<String, BackendWritableBroadcastState<?, ?>> registeredBroadcastStatesDeepCopies =
			new HashMap<>(registeredBroadcastStates.size());
			...
            if (!registeredOperatorStates.isEmpty()) {
   
				for (Map.Entry<String, PartitionableListState<?>> entry : registeredOperatorStates.entrySet()) {
   
					PartitionableListState<?> listState = entry.getValue();
					if (null != listState) {
   
						listState = listState.deepCopy();
					}
					registeredOperatorStatesDeepCopies.put(entry.getKey(), listState);
				}
			}

			if (!registeredBroadcastStates.isEmpty()) {
   
				for (Map.Entry<String, BackendWritableBroadcastState<?, ?>> entry : registeredBroadcastStates.entrySet()) {
   
					BackendWritableBroadcastState<?, ?> broadcastState = entry.getValue();
					if (null != broadcastState) {
   
						broadcastState = broadcastState.deepCopy();
					}
					registeredBroadcastStatesDeepCopies.put(entry.getKey(), broadcastState);
				}
			}

2.随后生成一个 AsyncSnapshotCallable 对象:

// DefaultOperatorStateBackendSnapshotStrategy.java
        AsyncSnapshotCallable<SnapshotResult<OperatorStateHandle>> snapshotCallable =
			new AsyncSnapshotCallable<SnapshotResult<OperatorStateHandle>>() {
   

				@Override
				protected SnapshotResult<OperatorStateHandle> callInternal() throws Exception {
   

                    // 创建 checkpoint output stream,EXCLUSIVE 表示 state 写入 chk-xxx 目录
					CheckpointStreamFactory.CheckpointStateOutputStream localOut =
						streamFactory.createCheckpointStateOutputStream(CheckpointedStateScope.EXCLUSIVE);
					snapshotCloseableRegistry.registerCloseable(localOut);

                    // 获取注册的 operator state 元数据信息
					// get the registered operator state infos ...
					List<StateMetaInfoSnapshot> operatorMetaInfoSnapshots =
						new ArrayList<>(registeredOperatorStatesDeepCopies.size());

					for (Map.Entry<String, PartitionableListState<?>> entry :
						registeredOperatorStatesDeepCopies.entrySet()) {
   
						operatorMetaInfoSnapshots.add(entry.getValue().getStateMetaInfo().snapshot());
					}

                    // 获取注册的 broadcast operator state 元数据信息
					// ... get the registered broadcast operator state infos ...
					List<StateMetaInfoSnapshot> broadcastMetaInfoSnapshots =
						new ArrayList<>(registeredBroadcastStatesDeepCopies.size());

					for (Map.Entry<String, BackendWritableBroadcastState<?, ?>> entry :
						registeredBroadcastStatesDeepCopies.entrySet()) {
   
						broadcastMetaInfoSnapshots.add(entry.getValue().getStateMetaInfo().snapshot());
					}

                    // 所有元数据信息(state 名称/state 类型/state 分发模式/state 序列化器/snapshot 配置等)写入 checkpoint 流
					// ... write them all in the checkpoint stream ...
					DataOutputView dov = new DataOutputViewStreamWrapper(localOut);

					OperatorBackendSerializationProxy backendSerializationProxy =
						new OperatorBackendSerializationProxy(operatorMetaInfoSnapshots, broadcastMetaInfoSnapshots);

                    // 写入
					backendSerializationProxy.write(dov);

                    // 随后写入 state
					// ... and then go for the states ...

					// we put BOTH normal and broadcast state metadata here
					int initialMapCapacity =
						registeredOperatorStatesDeepCopies.size() + registeredBroadcastStatesDeepCopies.size();
					// 创建 writtenStatesMetaData,保存 state 对应的 StateMetaInfo 信息,这里的 StateMetaInfo 主要保存 state 的 offset 信息
					final Map<String, OperatorStateHandle.StateMetaInfo> writtenStatesMetaData =
						new HashMap<>(initialMapCapacity);

					for (Map.Entry<String, PartitionableListState<?>> entry :
						registeredOperatorStatesDeepCopies.entrySet()) {
   

                        // 对每一个 operator ListState,序列化写入 checkpoint 流后,记录 List 中每个元素在流种的偏移量,保存到 StateMetaInfo
						PartitionableListState<?> value = entry.getValue();
						long[] partitionOffsets = value.write(localOut);
						OperatorStateHandle.Mode mode = value.getStateMetaInfo().getAssignmentMode();
						writtenStatesMetaData.put(
							entry.getKey(),
							new OperatorStateHandle.StateMetaInfo(partitionOffsets, mode));
					}

					// ... and the broadcast states themselves ...
					for (Map.Entry<String, BackendWritableBroadcastState<?, ?>> entry :
						registeredBroadcastStatesDeepCopies.entrySet()) {
   

                        // 对每一个 BroadCast State(每个 state 在内存中是 Map 保存),序列化写入 checkpoint 流后,记录 Map 在流种的偏移量,保存到 StateMetaInfo
						BackendWritableBroadcastState<?, ?> value = entry.getValue();
						long[] partitionOffsets = {
   value.write(localOut)};
						OperatorStateHandle.Mode mode = value.
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值