我们在 flink checkpoint 流程分析 这篇文章中,详细介绍了 checkpoint 的过程;在 Flink 如何保存状态数据 中,介绍了 state 的分类、state 的三种后端存储方式和具体使用方法,并在逻辑上简要介绍了 statebackend 保存 state 的方法。
本文将在上面两篇文章基础上,详细介绍 state 的存储策略。
本文基于 flink-1.10 版本。
文章目录
官方文档 中简要介绍了 checkpoint 的目录结构,该目录结构是由 FLINK-8531 引入的。
/user-defined-checkpoint-dir
/{job-id}
|
+ --shared/
+ --taskowned/
+ --chk-1/
+ --chk-2/
+ --chk-3/
...
user-defined-checkpoint-dir
由 state.checkpoints.dir
配置;shared
目录用于存放多个 checkpoint 共享的文件,现在这个目录用于 RocksDBStateBackend 存放开启增量 checkpoint 时的 rocksdb 文件;taskowned
存放的文件只属于 TaskManager,永远不能被 Jobmanager 删除;chk-xxx
中的文件只属于一个 checkpoint。
注意
: 该目录结构之后还可能会更改。
下面我们从同步阶段、异步阶段、meta 文件生成三个部分详细介绍 checkpoint 的存储策略。最后通过测试来验证。
1.同步阶段
默认情况下,只有 RocksDBKeyedStateBackend
在同步阶段会有 io 操作。同步阶段阶段会生成 OperatorSnapshotFutures,这些 Futures 会在异步阶段进行处理,对于 MemoryStateBackend 和 FsStateBackend 我们可以通过配置使得整个 checkpoint 过程同步进行。
flink checkpoint 流程分析 提到过,在 CheckpointingOperation 执行 executeCheckpointing() 时,会对当前所有 operator 执行 checkpointStreamOperator():
// StreamTask.java
private void checkpointStreamOperator(StreamOperator<?> op) throws Exception {
if (null != op) {
OperatorSnapshotFutures snapshotInProgress = op.snapshotState(
checkpointMetaData.getCheckpointId(),
checkpointMetaData.getTimestamp(),
checkpointOptions,
storageLocation);
operatorSnapshotsInProgress.put(op.getOperatorID(), snapshotInProgress);
}
}
StreamOperator 的 snapshotState() 方法最终由它的子类 AbstractStreamOperator 给出了一个 final 实现:
// AbstractStreamOperator.java
public final OperatorSnapshotFutures snapshotState(long checkpointId, long timestamp, CheckpointOptions checkpointOptions,
CheckpointStreamFactory factory) throws Exception {
KeyGroupRange keyGroupRange = null != keyedStateBackend ?
keyedStateBackend.getKeyGroupRange() : KeyGroupRange.EMPTY_KEY_GROUP_RANGE;
OperatorSnapshotFutures snapshotInProgress = new OperatorSnapshotFutures();
StateSnapshotContextSynchronousImpl snapshotContext = new StateSnapshotContextSynchronousImpl(
checkpointId,
timestamp,
factory,
keyGroupRange,
getContainingTask().getCancelables());
try {
// 子类会各自实现 snapshotState(StateSnapshotContext context)
snapshotState(snapshotContext);
snapshotInProgress.setKeyedStateRawFuture(snapshotContext.getKeyedStateStreamFuture());
snapshotInProgress.setOperatorStateRawFuture(snapshotContext.getOperatorStateStreamFuture());
if (null != operatorStateBackend) {
snapshotInProgress.setOperatorStateManagedFuture(
operatorStateBackend.snapshot(checkpointId, timestamp, factory, checkpointOptions));
}
if (null != keyedStateBackend) {
snapshotInProgress.setKeyedStateManagedFuture(
keyedStateBackend.snapshot(checkpointId, timestamp, factory, checkpointOptions));
}
}
...
return snapshotInProgress;
}
这一步获取OperatorSnapshotFutures
:由 managed/raw keyed/operator 组合的四种 state Futures。
我们主要关注 Operator state Managed Future 和 Keyed state Managed Future。分别由 operatorStateBackend.snapshot()
和 keyedStateBackend.snapshot()
生成。对于不同的 state backend,operatorStateBackend/keyedStateBackend 有不同的实现类,读者应该还记得下面这张图:
1.1 DefaultOperatorStateBackend#snapshot()
对于 operator state,无论选取的哪种 StateBackend,数据全部存储在内存中。
// DefaultOperatorStateBackend.java
RunnableFuture<SnapshotResult<OperatorStateHandle>> snapshotRunner =
snapshotStrategy.snapshot(checkpointId, timestamp, streamFactory, checkpointOptions);
snapshotStrategy 实际上是 DefaultOperatorStateBackendSnapshotStrategy
,我们看这个实现类的 snapshot():
1.首先将注册的 OperatorStates(只能是 ListState 类型)和 BroadcastState 进行了深拷贝:
// DefaultOperatorStateBackendSnapshotStrategy.java
final Map<String, PartitionableListState<?>> registeredOperatorStatesDeepCopies =
new HashMap<>(registeredOperatorStates.size());
final Map<String, BackendWritableBroadcastState<?, ?>> registeredBroadcastStatesDeepCopies =
new HashMap<>(registeredBroadcastStates.size());
...
if (!registeredOperatorStates.isEmpty()) {
for (Map.Entry<String, PartitionableListState<?>> entry : registeredOperatorStates.entrySet()) {
PartitionableListState<?> listState = entry.getValue();
if (null != listState) {
listState = listState.deepCopy();
}
registeredOperatorStatesDeepCopies.put(entry.getKey(), listState);
}
}
if (!registeredBroadcastStates.isEmpty()) {
for (Map.Entry<String, BackendWritableBroadcastState<?, ?>> entry : registeredBroadcastStates.entrySet()) {
BackendWritableBroadcastState<?, ?> broadcastState = entry.getValue();
if (null != broadcastState) {
broadcastState = broadcastState.deepCopy();
}
registeredBroadcastStatesDeepCopies.put(entry.getKey(), broadcastState);
}
}
2.随后生成一个 AsyncSnapshotCallable 对象:
// DefaultOperatorStateBackendSnapshotStrategy.java
AsyncSnapshotCallable<SnapshotResult<OperatorStateHandle>> snapshotCallable =
new AsyncSnapshotCallable<SnapshotResult<OperatorStateHandle>>() {
@Override
protected SnapshotResult<OperatorStateHandle> callInternal() throws Exception {
// 创建 checkpoint output stream,EXCLUSIVE 表示 state 写入 chk-xxx 目录
CheckpointStreamFactory.CheckpointStateOutputStream localOut =
streamFactory.createCheckpointStateOutputStream(CheckpointedStateScope.EXCLUSIVE);
snapshotCloseableRegistry.registerCloseable(localOut);
// 获取注册的 operator state 元数据信息
// get the registered operator state infos ...
List<StateMetaInfoSnapshot> operatorMetaInfoSnapshots =
new ArrayList<>(registeredOperatorStatesDeepCopies.size());
for (Map.Entry<String, PartitionableListState<?>> entry :
registeredOperatorStatesDeepCopies.entrySet()) {
operatorMetaInfoSnapshots.add(entry.getValue().getStateMetaInfo().snapshot());
}
// 获取注册的 broadcast operator state 元数据信息
// ... get the registered broadcast operator state infos ...
List<StateMetaInfoSnapshot> broadcastMetaInfoSnapshots =
new ArrayList<>(registeredBroadcastStatesDeepCopies.size());
for (Map.Entry<String, BackendWritableBroadcastState<?, ?>> entry :
registeredBroadcastStatesDeepCopies.entrySet()) {
broadcastMetaInfoSnapshots.add(entry.getValue().getStateMetaInfo().snapshot());
}
// 所有元数据信息(state 名称/state 类型/state 分发模式/state 序列化器/snapshot 配置等)写入 checkpoint 流
// ... write them all in the checkpoint stream ...
DataOutputView dov = new DataOutputViewStreamWrapper(localOut);
OperatorBackendSerializationProxy backendSerializationProxy =
new OperatorBackendSerializationProxy(operatorMetaInfoSnapshots, broadcastMetaInfoSnapshots);
// 写入
backendSerializationProxy.write(dov);
// 随后写入 state
// ... and then go for the states ...
// we put BOTH normal and broadcast state metadata here
int initialMapCapacity =
registeredOperatorStatesDeepCopies.size() + registeredBroadcastStatesDeepCopies.size();
// 创建 writtenStatesMetaData,保存 state 对应的 StateMetaInfo 信息,这里的 StateMetaInfo 主要保存 state 的 offset 信息
final Map<String, OperatorStateHandle.StateMetaInfo> writtenStatesMetaData =
new HashMap<>(initialMapCapacity);
for (Map.Entry<String, PartitionableListState<?>> entry :
registeredOperatorStatesDeepCopies.entrySet()) {
// 对每一个 operator ListState,序列化写入 checkpoint 流后,记录 List 中每个元素在流种的偏移量,保存到 StateMetaInfo
PartitionableListState<?> value = entry.getValue();
long[] partitionOffsets = value.write(localOut);
OperatorStateHandle.Mode mode = value.getStateMetaInfo().getAssignmentMode();
writtenStatesMetaData.put(
entry.getKey(),
new OperatorStateHandle.StateMetaInfo(partitionOffsets, mode));
}
// ... and the broadcast states themselves ...
for (Map.Entry<String, BackendWritableBroadcastState<?, ?>> entry :
registeredBroadcastStatesDeepCopies.entrySet()) {
// 对每一个 BroadCast State(每个 state 在内存中是 Map 保存),序列化写入 checkpoint 流后,记录 Map 在流种的偏移量,保存到 StateMetaInfo
BackendWritableBroadcastState<?, ?> value = entry.getValue();
long[] partitionOffsets = {
value.write(localOut)};
OperatorStateHandle.Mode mode = value.