一文搞懂 Flink Stream Graph 构建过程源码

mn_kw

已于 2024-09-18 11:15:50 修改

阅读量1.1k

点赞数 20

文章标签： flink 数据仓库大数据

于 2024-09-14 14:46:44 首次发布

本文链接：https://blog.youkuaiyun.com/mn_kw/article/details/142257594

版权

一文搞懂 Flink Stream Graph 构建过程

1. Flink Stream Graph构建过程

1. Flink Stream Graph构建过程

链接: 一文搞懂 Flink 其他重要源码点击我

env.execute将进行任务的提交和执行，在执行之前会对任务进行StreamGraph和JobGraph的构建，然后再提交JobGraph。
那么现在就来分析一下StreamGraph的构建过程，在正式环境下，会调用StreamContextEnvironment.execute()方法：

public JobExecutionResult execute(String jobName) throws Exception {
   Preconditions.checkNotNull(jobName, "Streaming Job name should not be null.");
   // 获取 StreamGraph
   return execute(getStreamGraph(jobName));
}

public StreamGraph getStreamGraph(String jobName, boolean clearTransformations) {
   // 构建StreamGraph
   StreamGraph streamGraph = getStreamGraphGenerator().setJobName(jobName).generate();
   if (clearTransformations) {
      // 构建完StreamGraph之后，可以清空transformations列表
      this.transformations.clear();
   }
   return streamGraph;
}

实现在StreamGraphGenerator.generate(this, transformations);这里的transformations列表就是在之前调用map、flatMap、filter算子时添加进去的生成的DataStream的StreamTransformation，是DataStream的描述信息。
接着看：

public StreamGraph generate() {
   // 实例化 StreamGraph
   streamGraph = new StreamGraph(executionConfig, checkpointConfig, savepointRestoreSettings);
   streamGraph.setStateBackend(stateBackend);
   streamGraph.setChaining(chaining);
   streamGraph.setScheduleMode(scheduleMode);
   streamGraph.setUserArtifacts(userArtifacts);
   streamGraph.setTimeCharacteristic(timeCharacteristic);
   streamGraph.setJobName(jobName);
   streamGraph.setGlobalDataExchangeMode(globalDataExchangeMode);

   // 记录了已经transformed的StreamTransformation。
   alreadyTransformed = new HashMap<>();

   for (Transformation<?> transformation: transformations) {
      // 根据 transformation 创建StreamGraph
      transform(transformation);
   }

   final StreamGraph builtStreamGraph = streamGraph;

   alreadyTransformed.clear();
   alreadyTransformed = null;
   streamGraph = null;

   return builtStreamGraph;
}

1.1 transform(): 构建的核心

private Collection<Integer> transform(Transformation<?> transform) {

   if (alreadyTransformed.containsKey(transform)) {
      return alreadyTransformed.get(transform);
   }

   LOG.debug("Transforming " + transform);

   if (transform.getMaxParallelism() <= 0) {

      // if the max parallelism hasn't been set, then first use the job wide max parallelism
      // from the ExecutionConfig.
      int globalMaxParallelismFromConfig = executionConfig.getMaxParallelism();
      if (globalMaxParallelismFromConfig > 0) {
         transform.setMaxParallelism(globalMaxParallelismFromConfig);
      }
   }

   // call at least once to trigger exceptions about MissingTypeInfo
   transform.getOutputType();

   Collection<Integer> transformedIds;
   // 判断 transform 属于哪类 Transformation
   if (transform instanceof OneInputTransformation<?, ?>) {
      transformedIds = transformOneInputTransform((OneInputTransformation<?, ?>) transform);
   } else if (transform instanceof TwoInputTransformation<?, ?, ?>) {
      transformedIds = transformTwoInputTransform((TwoInputTransformation<?, ?, ?>) transform);
   } else if (transform instanceof AbstractMultipleInputTransformation<?>) {
      transformedIds = transformMultipleInputTransform((AbstractMultipleInputTransformation<?>) transform);
   } else if (transform instanceof SourceTransformation) {
      transformedIds = transformSource((SourceTransformation<?>) transform);
   } else if (transform instanceof LegacySourceTransformation<?>) {
      transformedIds = transformLegacySource((LegacySourceTransformation<?>) transform);
   } else if (transform instanceof SinkTransformation<?>) {
      transformedIds = transformSink((SinkTransformation<?>) transform);
   } else if (transform instanceof UnionTransformation<?>) {
      transformedIds = transformUnion((UnionTransformation<?>) transform);
   } else if (transform instanceof SplitTransformation<?>) {
      transformedIds = transformSplit((SplitTransformation<?>) transform);
   } else if (transform instanceof SelectTransformation<?>) {
      transformedIds = transformSelect((SelectTransformation<?>) transform);
   } else if (transform instanceof FeedbackTransformation<?>) {
      transformedIds = transformFeedback((FeedbackTransformation<?>) transform);
   } else if (transform instanceof CoFeedbackTransformation<?>) {
      transformedIds = transformCoFeedback((CoFeedbackTransformation<?>) transform);
   } else if (transform instanceof PartitionTransformation<?>) {
      transformedIds = transformPartition((PartitionTransformation<?>) transform);
   } else if (transform instanceof SideOutputTransformation<?>) {
      transformedIds = transformSideOutput((SideOutputTransformation<?>) transform);
   } else {
      throw new IllegalStateException("Unknown transformation: " + transform);
   }

   // need this check because the iterate transformation adds itself before
   // transforming the feedback edges
   if (!alreadyTransformed.containsKey(transform)) {
      alreadyTransformed.put(transform, transformedIds);
   }

   if (transform.getBufferTimeout() >= 0) {
      streamGraph.setBufferTimeout(transform.getId(), transform.getBufferTimeout());
   } else {
      streamGraph.setBufferTimeout(transform.getId(), defaultBufferTimeout);
   }

   if (transform.getUid() != null) {
      streamGraph.setTransformationUID(transform.getId(), transform.getUid());
   }
   if (transform.getUserProvidedNodeHash() != null) {
      streamGraph.setTransformationUserHash(transform.getId(), transform.getUserProvidedNodeHash());
   }

   if (!streamGraph.getExecutionConfig().hasAutoGeneratedUIDsEnabled()) {
      if (transform instanceof PhysicalTransformation &&
            transform.getUserProvidedNodeHash() == null &&
            transform.getUid() == null) {
         throw new IllegalStateException("Auto generated UIDs have been disabled " +
            "but no UID or hash has been assigned to operator " + transform.getName());
      }
   }

   if (transform.getMinResources() != null && transform.getPreferredResources() != null) {
      streamGraph.setResources(transform.getId(), transform.getMinResources(), transform.getPreferredResources());
   }

   streamGraph.setManagedMemoryWeight(transform.getId(), transform.getManagedMemoryWeight());

   return transformedIds;
}

拿WordCount举例，在flatMap、map、reduce、addSink过程中会将生成的DataStream的StreamTransformation添加到transformations列表中。
addSource没有将StreamTransformation添加到transformations，但是flatMap生成的StreamTransformation的input持有SourceTransformation的引用。
keyBy算子会生成KeyedStream，但是它的StreamTransformation并不会添加到transformations列表中，不过reduce生成的DataStream中的StreamTransformation中持有了KeyedStream的StreamTransformation的引用，作为它的input。
所以，WordCount中有4个StreamTransformation，前3个算子均为OneInputTransformation，最后一个为SinkTransformation。

1.2 transformOneInputTransform

由于transformations中第一个为OneInputTransformation，所以代码首先会走到transformOneInputTransform((OneInputTransformation<?, ?>) transform)

private <IN, OUT> Collection<Integer> transformOneInputTransform(OneInputTransformation<IN, OUT> transform) {
    //首先递归的transform该OneInputTransformation的input
   Collection<Integer> inputIds = transform(transform.getInput());
    
    //如果递归transform input时发现input已经被transform，那么直接获取结果即可
   // the recursive call might have already transformed this
   if (alreadyTransformed.containsKey(transform)) {
      return alreadyTransformed.get(transform);
   }

    //获取共享slot资源组，默认分组名”default”
   String slotSharingGroup = determineSlotSharingGroup(transform.getSlotSharingGroup(), inputIds);
    //将该StreamTransformation添加到StreamGraph中，当做一个顶点
   streamGraph.addOperator(transform.getId(),
         slotSharingGroup,
         transform.getCoLocationGroupKey(),
         transform.getOperator(),
         transform.getInputType(),
         transform.getOutputType(),
         transform.getName());

   if (transform.getStateKeySelector() != null) {
      TypeSerializer<?> keySerializer = transform.getStateKeyType().createSerializer(env.getConfig());
      streamGraph.setOneInputStateKey(transform.getId(), transform.getStateKeySelector(), keySerializer);
   }
    //设置顶点的并行度和最大并行度
   streamGraph.setParallelism(transform.getId(), transform.getParallelism());
   streamGraph.setMaxParallelism(transform.getId(), transform.getMaxParallelism());
    //根据该StreamTransformation有多少个input，依次给StreamGraph添加边，即input——>current
   for (Integer inputId: inputIds) {
      streamGraph.addEdge(inputId, transform.getId(), 0);
   }
    //返回该StreamTransformation的id，OneInputTransformation只有自身的一个id列表
   return Collections.singleton(transform.getId());
}

transformOneInputTransform()方法的实现如下：
1、先递归的transform该OneInputTransformation的input，如果input已经transformed，那么直接从map中取数据即可
2、将该StreamTransformation作为一个图的顶点添加到StreamGraph中，并设置顶点的并行度和共享资源组
3、根据该StreamTransformation的input，构造图的边，有多少个input，就会生成多少边，不过OneInputTransformation顾名思义就是一个input，所以会构造一条边，即input——>currentId

1.3 构造顶点

//StreamGraph类
public <IN, OUT> void addOperator(
      Integer vertexID,
      String slotSharingGroup,
      @Nullable String coLocationGroup,
      StreamOperator<OUT> operatorObject,
      TypeInformation<IN> inTypeInfo,
      TypeInformation<OUT> outTypeInfo,
      String operatorName) {
    //将StreamTransformation作为一个顶点，添加到streamNodes中
   if (operatorObject instanceof StoppableStreamSource) {
      addNode(vertexID, slotSharingGroup, coLocationGroup, StoppableSourceStreamTask.class, operatorObject, operatorName);
   } else if (operatorObject instanceof StreamSource) {
       //如果operator是StreamSource，则Task类型为SourceStreamTask
      addNode(vertexID, slotSharingGroup, coLocationGroup, SourceStreamTask.class, operatorObject, operatorName);
   } else {
       //如果operator不是StreamSource，Task类型为OneInputStreamTask
      addNode(vertexID, slotSharingGroup, coLocationGroup, OneInputStreamTask.class, operatorObject, operatorName);
   }

   TypeSerializer<IN> inSerializer = inTypeInfo != null && !(inTypeInfo instanceof MissingTypeInfo) ? inTypeInfo.createSerializer(executionConfig) : null;

   TypeSerializer<OUT> outSerializer = outTypeInfo != null && !(outTypeInfo instanceof MissingTypeInfo) ? outTypeInfo.createSerializer(executionConfig) : null;

   setSerializers(vertexID, inSerializer, null, outSerializer);

   if (operatorObject instanceof OutputTypeConfigurable && outTypeInfo != null) {
      @SuppressWarnings("unchecked")
      OutputTypeConfigurable<OUT> outputTypeConfigurable = (OutputTypeConfigurable<OUT>) operatorObject;
      // sets the output type which must be know at StreamGraph creation time
      outputTypeConfigurable.setOutputType(outTypeInfo, executionConfig);
   }

   if (operatorObject instanceof InputTypeConfigurable) {
      InputTypeConfigurable inputTypeConfigurable = (InputTypeConfigurable) operatorObject;
      inputTypeConfigurable.setInputType(inTypeInfo, executionConfig);
   }

   if (LOG.isDebugEnabled()) {
      LOG.debug("Vertex: {}", vertexID);
   }
}

protected StreamNode addNode(Integer vertexID,
   String slotSharingGroup,
   @Nullable String coLocationGroup,
   Class<? extends AbstractInvokable> vertexClass,
   StreamOperator<?> operatorObject,
   String operatorName) {

   if (streamNodes.containsKey(vertexID)) {
      throw new RuntimeException("Duplicate vertexID " + vertexID);
   }
    //构造顶点，添加到streamNodes中，streamNodes是一个Map
   StreamNode vertex = new StreamNode(environment,
      vertexID,
      slotSharingGroup,
      coLocationGroup,
      operatorObject,
      operatorName,
      new ArrayList<OutputSelector<?>>(),
      vertexClass);

   // 添加到 streamNodes
   streamNodes.put(vertexID, vertex);

   return vertex;
}

1.4 构造边

 public void addEdge(Integer upStreamVertexID, Integer downStreamVertexID, int typeNumber) {
   addEdgeInternal(upStreamVertexID,
         downStreamVertexID,
         typeNumber,
         null, //这里开始传递的partitioner都是null
         new ArrayList<String>(),
         null //outputTag对侧输出有效);

}

private void addEdgeInternal(Integer upStreamVertexID,
      Integer downStreamVertexID,
      int typeNumber,
      StreamPartitioner<?> partitioner,
      List<String> outputNames,
      OutputTag outputTag) {
   
   if (virtualSideOutputNodes.containsKey(upStreamVertexID)) {
      //针对侧输出
      int virtualId = upStreamVertexID;
      upStreamVertexID = virtualSideOutputNodes.get(virtualId).f0;
      if (outputTag == null) {
         outputTag = virtualSideOutputNodes.get(virtualId).f1;
      }
      addEdgeInternal(upStreamVertexID, downStreamVertexID, typeNumber, partitioner, null, outputTag);
   } else if (virtualSelectNodes.containsKey(upStreamVertexID)) {
      int virtualId = upStreamVertexID;
      upStreamVertexID = virtualSelectNodes.get(virtualId).f0;
      if (outputNames.isEmpty()) {
         // selections that happen downstream override earlier selections
         outputNames = virtualSelectNodes.get(virtualId).f1;
      }
      addEdgeInternal(upStreamVertexID, downStreamVertexID, typeNumber, partitioner, outputNames, outputTag);
   } else if (virtualPartitionNodes.containsKey(upStreamVertexID)) {
       //keyBy算子产生的PartitionTransform作为下游的input，下游的StreamTransformation添加边时会走到这
      int virtualId = upStreamVertexID;
      upStreamVertexID = virtualPartitionNodes.get(virtualId).f0;
      if (partitioner == null) {
         partitioner = virtualPartitionNodes.get(virtualId).f1;
      }
      addEdgeInternal(upStreamVertexID, downStreamVertexID, typeNumber, partitioner, outputNames, outputTag);
   } else {
       //一般的OneInputTransform会走到这里
      StreamNode upstreamNode = getStreamNode(upStreamVertexID);
      StreamNode downstreamNode = getStreamNode(downStreamVertexID);

      // If no partitioner was specified and the parallelism of upstream and downstream
      // operator matches use forward partitioning, use rebalance otherwise.
      //如果上下顶点的并行度一致，则用ForwardPartitioner，否则用RebalancePartitioner
      if (partitioner == null && upstreamNode.getParallelism() == downstreamNode.getParallelism()) {
         partitioner = new ForwardPartitioner<Object>();
      } else if (partitioner == null) {
         partitioner = new RebalancePartitioner<Object>();
      }

      if (partitioner instanceof ForwardPartitioner) {
         if (upstreamNode.getParallelism() != downstreamNode.getParallelism()) {
            throw new UnsupportedOperationException("Forward partitioning does not allow " +
                  "change of parallelism. Upstream operation: " + upstreamNode + " parallelism: " + upstreamNode.getParallelism() +
                  ", downstream operation: " + downstreamNode + " parallelism: " + downstreamNode.getParallelism() +
                  " You must use another partitioning strategy, such as broadcast, rebalance, shuffle or global.");
         }
      }
        //一条边包括上下顶点，顶点之间的分区器等信息
      StreamEdge edge = new StreamEdge(upstreamNode, downstreamNode, typeNumber, outputNames, partitioner, outputTag);
      //分别给顶点添加出边和入边
      getStreamNode(edge.getSourceId()).addOutEdge(edge);
      getStreamNode(edge.getTargetId()).addInEdge(edge);
   }
}

边的属性包括上下游的顶点，和顶点之间的partitioner等信息。如果上下游的并行度一致，那么他们之间的partitioner是ForwardPartitioner，如果上下游的并行度不一致是RebalancePartitioner，当然这前提是没有设置partitioner的前提下。如果显示设置了partitioner的情况，例如keyBy算子，在内部就确定了分区器是KeyGroupStreamPartitioner，那么它们之间的分区器就是KeyGroupStreamPartitioner。

1.5 transformSource

上述说道，transformOneInputTransform会先递归的transform该OneInputTransform的input，那么对于WordCount中在transform第一个OneInputTransform时会首先transform它的input，也就是SourceTransformation，方法在transformSource()

private <T> Collection<Integer> transformSource(SourceTransformation<T> source) {
   String slotSharingGroup = determineSlotSharingGroup(source.getSlotSharingGroup(), Collections.emptyList());
    //将source作为图的顶点和图的source添加StreamGraph中
   streamGraph.addSource(source.getId(),
         slotSharingGroup,
         source.getCoLocationGroupKey(),
         source.getOperator(),
         null,
         source.getOutputType(),
         "Source: " + source.getName());
   if (source.getOperator().getUserFunction() instanceof InputFormatSourceFunction) {
      InputFormatSourceFunction<T> fs = (InputFormatSourceFunction<T>) source.getOperator().getUserFunction();
      streamGraph.setInputFormat(source.getId(), fs.getFormat());
   }
   //设置source的并行度
   streamGraph.setParallelism(source.getId(), source.getParallelism());
   streamGraph.setMaxParallelism(source.getId(), source.getMaxParallelism());
   //返回自身id的单列表
   return Collections.singleton(source.getId());
}

//StreamGraph类
public <IN, OUT> void addSource(Integer vertexID,
   String slotSharingGroup,
   @Nullable String coLocationGroup,
   StreamOperator<OUT> operatorObject,
   TypeInformation<IN> inTypeInfo,
   TypeInformation<OUT> outTypeInfo,
   String operatorName) {
   addOperator(vertexID, slotSharingGroup, coLocationGroup, operatorObject, inTypeInfo, outTypeInfo, operatorName);
   sources.add(vertexID);
}

transformSource()逻辑也比较简单，就是将source添加到StreamGraph中，注意Source是图的根节点，没有input，所以它不需要添加边。

1.6 transformPartition

在WordCount中，由reduce生成的DataStream的StreamTransformation是一个OneInputTransformation，同样，在transform它的时候，会首先transform它的input，而它的input就是KeyedStream中生成的PartitionTransformation。所以代码会执行transformPartition((PartitionTransformation<?>) transform)

private <T> Collection<Integer> transformPartition(PartitionTransformation<T> partition) {
   StreamTransformation<T> input = partition.getInput();
   List<Integer> resultIds = new ArrayList<>();
    //首先会transform PartitionTransformation的input
   Collection<Integer> transformedIds = transform(input);
   for (Integer transformedId: transformedIds) {
       //生成一个新的虚拟节点
      int virtualId = StreamTransformation.getNewNodeId();
      //将虚拟节点添加到StreamGraph中
      streamGraph.addVirtualPartitionNode(transformedId, virtualId, partition.getPartitioner());
      resultIds.add(virtualId);
   }

   return resultIds;
}

//StreamGraph类
public void addVirtualPartitionNode(Integer originalId, Integer virtualId, StreamPartitioner<?> partitioner) {

   if (virtualPartitionNodes.containsKey(virtualId)) {
      throw new IllegalStateException("Already has virtual partition node with id " + virtualId);
   }
   // virtualPartitionNodes是一个Map，存储虚拟的Partition节点
   virtualPartitionNodes.put(virtualId,
         new Tuple2<Integer, StreamPartitioner<?>>(originalId, partitioner));
}

transformPartition会为PartitionTransformation生成一个新的虚拟节点，同时将该虚拟节点保存到StreamGraph的virtualPartitionNodes中，并且会保存该PartitionTransformation的input，将PartitionTransformation的partitioner作为其input的分区器，在WordCount中，也就是作为map算子生成的DataStream的分区器。

在进行transform reduce生成的OneInputTransform时，它的inputIds便是transformPartition时为PartitionTransformation生成新的虚拟节点，在添加边的时候会走到下面的代码。边的形式大致是OneInputTransform(map)—KeyGroupStreamPartitioner—>OneInputTransform(window)

} else if (virtualPartitionNodes.containsKey(upStreamVertexID)) {
       //keyBy算子产生的PartitionTransform作为下游的input，下游的StreamTransform添加边时会走到这
      int virtualId = upStreamVertexID;
      upStreamVertexID = virtualPartitionNodes.get(virtualId).f0;
      if (partitioner == null) {
         partitioner = virtualPartitionNodes.get(virtualId).f1;
      }
      addEdgeInternal(upStreamVertexID, downStreamVertexID, typeNumber, partitioner, outputNames, outputTag);

1.7 transformSink

在WordCount中，transformations的最后一个StreamTransformation是SinkTransformation，方法在
transformSink()

private <T> Collection<Integer> transformSink(SinkTransformation<T> sink) {
    //首先transform input
   Collection<Integer> inputIds = transform(sink.getInput());
   String slotSharingGroup = determineSlotSharingGroup(sink.getSlotSharingGroup(), inputIds);
    //给图添加sink，并且构造并添加图的顶点
   streamGraph.addSink(sink.getId(),
         slotSharingGroup,
         sink.getCoLocationGroupKey(),
         sink.getOperator(),
         sink.getInput().getOutputType(),
         null,
         "Sink: " + sink.getName());
   streamGraph.setParallelism(sink.getId(), sink.getParallelism());
   streamGraph.setMaxParallelism(sink.getId(), sink.getMaxParallelism());
    //构造并添加StreamGraph的边
   for (Integer inputId: inputIds) {
      streamGraph.addEdge(inputId,
            sink.getId(),
            0
      );
   }
   if (sink.getStateKeySelector() != null) {
      TypeSerializer<?> keySerializer = sink.getStateKeyType().createSerializer(env.getConfig());
      streamGraph.setOneInputStateKey(sink.getId(), sink.getStateKeySelector(), keySerializer);
   }
    //因为sink是图的末尾节点，没有下游的输出，所以返回空了
   return Collections.emptyList();
}
//StreamGraph类
public <IN, OUT> void addSink(Integer vertexID,
   String slotSharingGroup,
   @Nullable String coLocationGroup,
   StreamOperator<OUT> operatorObject,
   TypeInformation<IN> inTypeInfo,
   TypeInformation<OUT> outTypeInfo,
   String operatorName) {
   addOperator(vertexID, slotSharingGroup, coLocationGroup, operatorObject, inTypeInfo, outTypeInfo, operatorName);
   sinks.add(vertexID);
}

transformSink()中的逻辑也比较简单，和transformSource()类似，不同的是sink是有边的，而且sink的下游没有输出了，也就不需要作为下游的input，所以返回空列表。

在transformSink()之后，也就把所有的StreamTransformation的都进行transform了，那么这时候StreamGraph中的顶点、边、partition的虚拟顶点都构建好了，返回StreamGraph即可。下一步就是根据StreamGraph构造JobGraph了。
可以看出StreamGraph的结构还是比较简单的，每个DataStream的StreamTransformation会作为一个图的顶点（PartitionTransform是虚拟顶点），根据StreamTransformation的input来构建图的边。