一文搞懂 Flink Stream Graph 构建过程
1. Flink Stream Graph构建过程
env.execute将进行任务的提交和执行,在执行之前会对任务进行StreamGraph和JobGraph的构建,然后再提交JobGraph。
那么现在就来分析一下StreamGraph的构建过程,在正式环境下,会调用StreamContextEnvironment.execute()方法:
public JobExecutionResult execute(String jobName) throws Exception {
Preconditions.checkNotNull(jobName, "Streaming Job name should not be null.");
// 获取 StreamGraph
return execute(getStreamGraph(jobName));
}
public StreamGraph getStreamGraph(String jobName, boolean clearTransformations) {
// 构建StreamGraph
StreamGraph streamGraph = getStreamGraphGenerator().setJobName(jobName).generate();
if (clearTransformations) {
// 构建完StreamGraph之后,可以清空transformations列表
this.transformations.clear();
}
return streamGraph;
}
实现在StreamGraphGenerator.generate(this, transformations);这里的transformations列表就是在之前调用map、flatMap、filter算子时添加进去的生成的DataStream的StreamTransformation,是DataStream的描述信息。
接着看:
public StreamGraph generate() {
// 实例化 StreamGraph
streamGraph = new StreamGraph(executionConfig, checkpointConfig, savepointRestoreSettings);
streamGraph.setStateBackend(stateBackend);
streamGraph.setChaining(chaining);
streamGraph.setScheduleMode(scheduleMode);
streamGraph.setUserArtifacts(userArtifacts);
streamGraph.setTimeCharacteristic(timeCharacteristic);
streamGraph.setJobName(jobName);
streamGraph.setGlobalDataExchangeMode(globalDataExchangeMode);
// 记录了已经transformed的StreamTransformation。
alreadyTransformed = new HashMap<>();
for (Transformation<?> transformation: transformations) {
// 根据 transformation 创建StreamGraph
transform(transformation);
}
final StreamGraph builtStreamGraph = streamGraph;
alreadyTransformed.clear();
alreadyTransformed = null;
streamGraph = null;
return builtStreamGraph;
}
1.1 transform(): 构建的核心
private Collection<Integer> transform(Transformation<?> transform) {
if (alreadyTransformed.containsKey(transform)) {
return alreadyTransformed.get(transform);
}
LOG.debug("Transforming " + transform);
if (transform.getMaxParallelism() <= 0) {
// if the max parallelism hasn't been set, then first use the job wide max parallelism
// from the ExecutionConfig.
int globalMaxParallelismFromConfig = executionConfig.getMaxParallelism();
if (globalMaxParallelismFromConfig > 0) {
transform.setMaxParallelism(globalMaxParallelismFromConfig);
}
}
// call at least once to trigger exceptions about MissingTypeInfo
transform.getOutputType();
Collection<Integer> transformedIds;
// 判断 transform 属于哪类 Transformation
if (transform instanceof OneInputTransformation<?, ?>) {
transformedIds = transformOneInputTransform((OneInputTransformation<?, ?>) transform);
} else if (transform instanceof TwoInputTransformation<?, ?, ?>) {
transformedIds = transformTwoInputTransform((TwoInputTransformation<?, ?, ?>) transform);
} else if (transform instanceof AbstractMultipleInputTransformation<?>) {
transformedIds = transformMultipleInputTransform((AbstractMultipleInputTransformation<?>) transform);
} else if (transform instanceof SourceTransformation) {
transformedIds = transformSource((SourceTransformation<?>) transform);
} else if (transform instanceof LegacySourceTransformation<?>) {
transformedIds = transformLegacySource((LegacySourceTransformation<?>) transform);
} else if (transform instanceof SinkTransformation<?>) {
transformedIds = transformSink((SinkTransformation<?>) transform);
} else if (transform instanceof UnionTransformation<?>) {
transformedIds = transformUnion((UnionTransformation<?>) transform);
} else if (transform instanceof SplitTransformation<?>) {
transformedIds = transformSplit((SplitTransformation<?>) transform);
} else if (transform instanceof SelectTransformation<?>) {
transformedIds = transformSelect((SelectTransformation<?>) transform);
} else if (transform instanceof FeedbackTransformation<?>) {
transformedIds = transformFeedback((FeedbackTransformation<?>) transform);
} else if (transform instanceof CoFeedbackTransformation<?>) {
transformedIds = transformCoFeedback((CoFeedbackTransformation<?>) transform);
} else if (transform instanceof PartitionTransformation<?>) {
transformedIds = transformPartition((PartitionTransformation<?>) transform);
} else if (transform instanceof SideOutputTransformation<?>) {
transformedIds = transformSideOutput((SideOutputTransformation<?>) transform);
} else {
throw new IllegalStateException("Unknown transformation: " + transform);
}
// need this check because the iterate transformation adds itself before
// transforming the feedback edges
if (!alreadyTransformed.containsKey(transform)) {
alreadyTransformed.put(transform, transformedIds);
}
if (transform.getBufferTimeout() >= 0) {
streamGraph.setBufferTimeout(transform.getId(), transform.getBufferTimeout());
} else {
streamGraph.setBufferTimeout(transform.getId(), defaultBufferTimeout);
}
if (transform.getUid() != null) {
streamGraph.setTransformationUID(transform.getId(), transform.getUid());
}
if (transform.getUserProvidedNodeHash() != null) {
streamGraph.setTransformationUserHash(transform.getId(), transform.getUserProvidedNodeHash());
}
if (!streamGraph.getExecutionConfig().hasAutoGeneratedUIDsEnabled()) {
if (transform instanceof PhysicalTransformation &&
transform.getUserProvidedNodeHash() == null &&
transform.getUid() == null) {
throw new IllegalStateException("Auto generated UIDs have been disabled " +
"but no UID or hash has been assigned to operator " + transform.getName());
}
}
if (transform.getMinResources() != null && transform.getPreferredResources() != null) {
streamGraph.setResources(transform.getId(), transform.getMinResources(), transform.getPreferredResources());
}
streamGraph.setManagedMemoryWeight(transform.getId(), transform.getManagedMemoryWeight());
return transformedIds;
}
拿WordCount举例,在flatMap、map、reduce、addSink过程中会将生成的DataStream的StreamTransformation添加到transformations列表中。
addSource没有将StreamTransformation添加到transformations,但是flatMap生成的StreamTransformation的input持有SourceTransformation的引用。
keyBy算子会生成KeyedStream,但是它的StreamTransformation并不会添加到transformations列表中,不过reduce生成的DataStream中的StreamTransformation中持有了KeyedStream的StreamTransformation的引用,作为它的input。
所以,WordCount中有4个StreamTransformation,前3个算子均为OneInputTransformation,最后一个为SinkTransformation。
1.2 transformOneInputTransform
由于transformations中第一个为OneInputTransformation,所以代码首先会走到transformOneInputTransform((OneInputTransformation<?, ?>) transform)
private <IN, OUT> Collection<Integer> transformOneInputTransform(OneInputTransformation<IN, OUT> transform) {
//首先递归的transform该OneInputTransformation的input
Collection<Integer> inputIds = transform(transform.getInput());
//如果递归transform input时发现input已经被transform,那么直接获取结果即可
// the recursive call might have already transformed this
if (alreadyTransformed.containsKey(transform)) {
return alreadyTransformed.get(transform);
}
//获取共享slot资源组,默认分组名”default”
String slotSharingGroup = determineSlotSharingGroup(transform.getSlotSharingGroup(), inputIds);
//将该StreamTransformation添加到StreamGraph中,当做一个顶点
streamGraph.addOperator(transform.getId(),
slotSharingGroup,
transform.getCoLocationGroupKey(),
transform.getOperator(),
transform.getInputType(),
transform.getOutputType(),
transform.getName());
if (transform.getStateKeySelector() != null) {
TypeSerializer<?> keySerializer = transform.getStateKeyType().createSerializer(env.getConfig());
streamGraph.setOneInputStateKey(transform.getId(), transform.getStateKeySelector(), keySerializer);
}
//设置顶点的并行度和最大并行度
streamGraph.setParallelism(transform.getId(), transform.getParallelism());
streamGraph.setMaxParallelism(transform.getId(), transform.getMaxParallelism());
//根据该StreamTransformation有多少个input,依次给StreamGraph添加边,即input——>current
for (Integer inputId: inputIds) {
streamGraph.addEdge(inputId, transform.getId(), 0);
}
//返回该StreamTransformation的id,OneInputTransformation只有自身的一个id列表
return Collections.singleton(transform.getId());
}
transformOneInputTransform()方法的实现如下:
1、先递归的transform该OneInputTransformation的input,如果input已经transformed,那么直接从map中取数据即可
2、将该StreamTransformation作为一个图的顶点添加到StreamGraph中,并设置顶点的并行度和共享资源组
3、根据该StreamTransformation的input,构造图的边,有多少个input,就会生成多少边,不过OneInputTransformation顾名思义就是一个input,所以会构造一条边,即input——>currentId
1.3 构造顶点
//StreamGraph类
public <IN, OUT> void addOperator(
Integer vertexID,
String slotSharingGroup,
@Nullable String coLocationGroup,
StreamOperator<OUT> operatorObject,
TypeInformation<IN> inTypeInfo,
TypeInformation<OUT> outTypeInfo,
String operatorName) {
//将StreamTransformation作为一个顶点,添加到streamNodes中
if (operatorObject instanceof StoppableStreamSource) {
addNode(vertexID, slotSharingGroup, coLocationGroup, StoppableSourceStreamTask.class, operatorObject, operatorName);
} else if (operatorObject instanceof StreamSource) {
//如果operator是StreamSource,则Task类型为SourceStreamTask
addNode(vertexID, slotSharingGroup, coLocationGroup, SourceStreamTask.class, operatorObject, operatorName);
} else {
//如果operator不是StreamSource,Task类型为OneInputStreamTask
addNode(vertexID, slotSharingGroup, coLocationGroup, OneInputStreamTask.class, operatorObject, operatorName);
}
TypeSerializer<IN> inSerializer = inTypeInfo != null && !(inTypeInfo instanceof MissingTypeInfo) ? inTypeInfo.createSerializer(executionConfig) : null;
TypeSerializer<OUT> outSerializer = outTypeInfo != null && !(outTypeInfo instanceof MissingTypeInfo) ? outTypeInfo.createSerializer(executionConfig) : null;
setSerializers(vertexID, inSerializer, null, outSerializer);
if (operatorObject instanceof OutputTypeConfigurable && outTypeInfo != null) {
@SuppressWarnings("unchecked")
OutputTypeConfigurable<OUT> outputTypeConfigurable = (OutputTypeConfigurable<OUT>) operatorObject;
// sets the output type which must be know at StreamGraph creation time
outputTypeConfigurable.setOutputType(outTypeInfo, executionConfig);
}
if (operatorObject instanceof InputTypeConfigurable) {
InputTypeConfigurable inputTypeConfigurable = (InputTypeConfigurable) operatorObject;
inputTypeConfigurable.setInputType(inTypeInfo, executionConfig);
}
if (LOG.isDebugEnabled()) {
LOG.debug("Vertex: {}", vertexID);
}
}
protected StreamNode addNode(Integer vertexID,
String slotSharingGroup,
@Nullable String coLocationGroup,
Class<? extends AbstractInvokable> vertexClass,
StreamOperator<?> operatorObject,
String operatorName) {
if (streamNodes.containsKey(vertexID)) {
throw new RuntimeException("Duplicate vertexID " + vertexID);
}
//构造顶点,添加到streamNodes中,streamNodes是一个Map
StreamNode vertex = new StreamNode(environment,
vertexID,
slotSharingGroup,
coLocationGroup,
operatorObject,
operatorName,
new ArrayList<OutputSelector<?>>(),
vertexClass);
// 添加到 streamNodes
streamNodes.put(vertexID, vertex);
return vertex;
}
1.4 构造边
public void addEdge(Integer upStreamVertexID, Integer downStreamVertexID, int typeNumber) {
addEdgeInternal(upStreamVertexID,
downStreamVertexID,
typeNumber,
null, //这里开始传递的partitioner都是null
new ArrayList<String>(),
null //outputTag对侧输出有效);
}
private void addEdgeInternal(Integer upStreamVertexID,
Integer downStreamVertexID,
int typeNumber,
StreamPartitioner<?> partitioner,
List<String> outputNames,
OutputTag outputTag) {
if (virtualSideOutputNodes.containsKey(upStreamVertexID)) {
//针对侧输出
int virtualId = upStreamVertexID;
upStreamVertexID = virtualSideOutputNodes.get(virtualId).f0;
if (outputTag == null) {
outputTag = virtualSideOutputNodes.get(virtualId).f1;
}
addEdgeInternal(upStreamVertexID, downStreamVertexID, typeNumber, partitioner, null, outputTag);
} else if (virtualSelectNodes.containsKey(upStreamVertexID)) {
int virtualId = upStreamVertexID;
upStreamVertexID = virtualSelectNodes.get(virtualId).f0;
if (outputNames.isEmpty()) {
// selections that happen downstream override earlier selections
outputNames = virtualSelectNodes.get(virtualId).f1;
}
addEdgeInternal(upStreamVertexID, downStreamVertexID, typeNumber, partitioner, outputNames, outputTag);
} else if (virtualPartitionNodes.containsKey(upStreamVertexID)) {
//keyBy算子产生的PartitionTransform作为下游的input,下游的StreamTransformation添加边时会走到这
int virtualId = upStreamVertexID;
upStreamVertexID = virtualPartitionNodes.get(virtualId).f0;
if (partitioner == null) {
partitioner = virtualPartitionNodes.get(virtualId).f1;
}
addEdgeInternal(upStreamVertexID, downStreamVertexID, typeNumber, partitioner, outputNames, outputTag);
} else {
//一般的OneInputTransform会走到这里
StreamNode upstreamNode = getStreamNode(upStreamVertexID);
StreamNode downstreamNode = getStreamNode(downStreamVertexID);
// If no partitioner was specified and the parallelism of upstream and downstream
// operator matches use forward partitioning, use rebalance otherwise.
//如果上下顶点的并行度一致,则用ForwardPartitioner,否则用RebalancePartitioner
if (partitioner == null && upstreamNode.getParallelism() == downstreamNode.getParallelism()) {
partitioner = new ForwardPartitioner<Object>();
} else if (partitioner == null) {
partitioner = new RebalancePartitioner<Object>();
}
if (partitioner instanceof ForwardPartitioner) {
if (upstreamNode.getParallelism() != downstreamNode.getParallelism()) {
throw new UnsupportedOperationException("Forward partitioning does not allow " +
"change of parallelism. Upstream operation: " + upstreamNode + " parallelism: " + upstreamNode.getParallelism() +
", downstream operation: " + downstreamNode + " parallelism: " + downstreamNode.getParallelism() +
" You must use another partitioning strategy, such as broadcast, rebalance, shuffle or global.");
}
}
//一条边包括上下顶点,顶点之间的分区器等信息
StreamEdge edge = new StreamEdge(upstreamNode, downstreamNode, typeNumber, outputNames, partitioner, outputTag);
//分别给顶点添加出边和入边
getStreamNode(edge.getSourceId()).addOutEdge(edge);
getStreamNode(edge.getTargetId()).addInEdge(edge);
}
}
边的属性包括上下游的顶点,和顶点之间的partitioner等信息。如果上下游的并行度一致,那么他们之间的partitioner是ForwardPartitioner,如果上下游的并行度不一致是RebalancePartitioner,当然这前提是没有设置partitioner的前提下。如果显示设置了partitioner的情况,例如keyBy算子,在内部就确定了分区器是KeyGroupStreamPartitioner,那么它们之间的分区器就是KeyGroupStreamPartitioner。
1.5 transformSource
上述说道,transformOneInputTransform会先递归的transform该OneInputTransform的input,那么对于WordCount中在transform第一个OneInputTransform时会首先transform它的input,也就是SourceTransformation,方法在transformSource()
private <T> Collection<Integer> transformSource(SourceTransformation<T> source) {
String slotSharingGroup = determineSlotSharingGroup(source.getSlotSharingGroup(), Collections.emptyList());
//将source作为图的顶点和图的source添加StreamGraph中
streamGraph.addSource(source.getId(),
slotSharingGroup,
source.getCoLocationGroupKey(),
source.getOperator(),
null,
source.getOutputType(),
"Source: " + source.getName());
if (source.getOperator().getUserFunction() instanceof InputFormatSourceFunction) {
InputFormatSourceFunction<T> fs = (InputFormatSourceFunction<T>) source.getOperator().getUserFunction();
streamGraph.setInputFormat(source.getId(), fs.getFormat());
}
//设置source的并行度
streamGraph.setParallelism(source.getId(), source.getParallelism());
streamGraph.setMaxParallelism(source.getId(), source.getMaxParallelism());
//返回自身id的单列表
return Collections.singleton(source.getId());
}
//StreamGraph类
public <IN, OUT> void addSource(Integer vertexID,
String slotSharingGroup,
@Nullable String coLocationGroup,
StreamOperator<OUT> operatorObject,
TypeInformation<IN> inTypeInfo,
TypeInformation<OUT> outTypeInfo,
String operatorName) {
addOperator(vertexID, slotSharingGroup, coLocationGroup, operatorObject, inTypeInfo, outTypeInfo, operatorName);
sources.add(vertexID);
}
transformSource()逻辑也比较简单,就是将source添加到StreamGraph中,注意Source是图的根节点,没有input,所以它不需要添加边。
1.6 transformPartition
在WordCount中,由reduce生成的DataStream的StreamTransformation是一个OneInputTransformation,同样,在transform它的时候,会首先transform它的input,而它的input就是KeyedStream中生成的PartitionTransformation。所以代码会执行transformPartition((PartitionTransformation<?>) transform)
private <T> Collection<Integer> transformPartition(PartitionTransformation<T> partition) {
StreamTransformation<T> input = partition.getInput();
List<Integer> resultIds = new ArrayList<>();
//首先会transform PartitionTransformation的input
Collection<Integer> transformedIds = transform(input);
for (Integer transformedId: transformedIds) {
//生成一个新的虚拟节点
int virtualId = StreamTransformation.getNewNodeId();
//将虚拟节点添加到StreamGraph中
streamGraph.addVirtualPartitionNode(transformedId, virtualId, partition.getPartitioner());
resultIds.add(virtualId);
}
return resultIds;
}
//StreamGraph类
public void addVirtualPartitionNode(Integer originalId, Integer virtualId, StreamPartitioner<?> partitioner) {
if (virtualPartitionNodes.containsKey(virtualId)) {
throw new IllegalStateException("Already has virtual partition node with id " + virtualId);
}
// virtualPartitionNodes是一个Map,存储虚拟的Partition节点
virtualPartitionNodes.put(virtualId,
new Tuple2<Integer, StreamPartitioner<?>>(originalId, partitioner));
}
transformPartition会为PartitionTransformation生成一个新的虚拟节点,同时将该虚拟节点保存到StreamGraph的virtualPartitionNodes中,并且会保存该PartitionTransformation的input,将PartitionTransformation的partitioner作为其input的分区器,在WordCount中,也就是作为map算子生成的DataStream的分区器。
在进行transform reduce生成的OneInputTransform时,它的inputIds便是transformPartition时为PartitionTransformation生成新的虚拟节点,在添加边的时候会走到下面的代码。边的形式大致是OneInputTransform(map)—KeyGroupStreamPartitioner—>OneInputTransform(window)
} else if (virtualPartitionNodes.containsKey(upStreamVertexID)) {
//keyBy算子产生的PartitionTransform作为下游的input,下游的StreamTransform添加边时会走到这
int virtualId = upStreamVertexID;
upStreamVertexID = virtualPartitionNodes.get(virtualId).f0;
if (partitioner == null) {
partitioner = virtualPartitionNodes.get(virtualId).f1;
}
addEdgeInternal(upStreamVertexID, downStreamVertexID, typeNumber, partitioner, outputNames, outputTag);
1.7 transformSink
在WordCount中,transformations的最后一个StreamTransformation是SinkTransformation,方法在
transformSink()
private <T> Collection<Integer> transformSink(SinkTransformation<T> sink) {
//首先transform input
Collection<Integer> inputIds = transform(sink.getInput());
String slotSharingGroup = determineSlotSharingGroup(sink.getSlotSharingGroup(), inputIds);
//给图添加sink,并且构造并添加图的顶点
streamGraph.addSink(sink.getId(),
slotSharingGroup,
sink.getCoLocationGroupKey(),
sink.getOperator(),
sink.getInput().getOutputType(),
null,
"Sink: " + sink.getName());
streamGraph.setParallelism(sink.getId(), sink.getParallelism());
streamGraph.setMaxParallelism(sink.getId(), sink.getMaxParallelism());
//构造并添加StreamGraph的边
for (Integer inputId: inputIds) {
streamGraph.addEdge(inputId,
sink.getId(),
0
);
}
if (sink.getStateKeySelector() != null) {
TypeSerializer<?> keySerializer = sink.getStateKeyType().createSerializer(env.getConfig());
streamGraph.setOneInputStateKey(sink.getId(), sink.getStateKeySelector(), keySerializer);
}
//因为sink是图的末尾节点,没有下游的输出,所以返回空了
return Collections.emptyList();
}
//StreamGraph类
public <IN, OUT> void addSink(Integer vertexID,
String slotSharingGroup,
@Nullable String coLocationGroup,
StreamOperator<OUT> operatorObject,
TypeInformation<IN> inTypeInfo,
TypeInformation<OUT> outTypeInfo,
String operatorName) {
addOperator(vertexID, slotSharingGroup, coLocationGroup, operatorObject, inTypeInfo, outTypeInfo, operatorName);
sinks.add(vertexID);
}
transformSink()中的逻辑也比较简单,和transformSource()类似,不同的是sink是有边的,而且sink的下游没有输出了,也就不需要作为下游的input,所以返回空列表。
在transformSink()之后,也就把所有的StreamTransformation的都进行transform了,那么这时候StreamGraph中的顶点、边、partition的虚拟顶点都构建好了,返回StreamGraph即可。下一步就是根据StreamGraph构造JobGraph了。
可以看出StreamGraph的结构还是比较简单的,每个DataStream的StreamTransformation会作为一个图的顶点(PartitionTransform是虚拟顶点),根据StreamTransformation的input来构建图的边。