简单说两句

✨ 正在努力的小新~💖 超级爱分享,分享各种有趣干货!👩💻 提供:模拟面试 | 简历诊断 | 独家简历模板🌈 感谢关注,关注了你就是我的超级粉丝啦!🔒 以下内容仅对你可见~

作者:不正经小新

🔎GZH不正经小新

🎉欢迎关注🔎点赞👍收藏⭐️留言📝

Flink StreamGraph

深入Flink StreamGraph:构建流处理拓扑的奥秘_flink

什么是StreamGraph

StreamGraph表示流式处理拓扑的类,它包含构建执行任务图所需的所有信息。

说白了就是我们写的代码直接生成的图,表示程序的拓扑结构

StreamGraph类图,可以看到这个类里面包含了执行任务所需的所有信息,比如状态后端,JobType(流or批)、checkpoint配置等等,我们今天从宏观层面看,不深入每个细节,先看大体,再慢慢深入研究~🥹🥹🥹

深入Flink StreamGraph:构建流处理拓扑的奥秘_Group_02

代码阅读

getStreamGraph方法代码清单

public StreamGraph getStreamGraph() {
        return getStreamGraph(true);
    }
  • 1.
  • 2.
  • 3.

可以看到,这里穿了个默认参数true,他的作用就是**清除之前注册 transformations**

为什么要清除? 就是防止多次执行execute时,执行相同的操作。

getStreamGraphGenerator方法代码清单

private StreamGraphGenerator getStreamGraphGenerator(List<Transformation<?>> transformations) {
        if (transformations.size() <= 0) {
            throw new IllegalStateException(
                    "No operators defined in streaming topology. Cannot execute.");
        }

        // Synchronize the cached file to config option PipelineOptions.CACHED_FILES because the
        // field cachedFile haven't been migrated to configuration.
        if (!getCachedFiles().isEmpty()) {
            configuration.set(
                    PipelineOptions.CACHED_FILES,
                    DistributedCache.parseStringFromCachedFiles(getCachedFiles()));
        }

        // We copy the transformation so that newly added transformations cannot intervene with the
        // stream graph generation.
        return new StreamGraphGenerator(
                        new ArrayList<>(transformations), config, checkpointCfg, configuration)
                .setStateBackend(defaultStateBackend)
                .setTimeCharacteristic(getStreamTimeCharacteristic())
                .setSlotSharingGroupResource(slotSharingGroupResources);
    }
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.

getStreamGraphGenerator就是是生成一个 StreamGraphGenerator 对象,用于创建流处理拓扑图

下面是对这个方法的解释

  • 检查 transformations 列表是否为空:若为空,直接给你抛出异常(可演示)
  • 同步缓存文件到配置选项:如果有缓存文件,则将这些文件同步到配置选项 PipelineOptions.CACHED_FILES 中。这是因为 cachedFile 字段还没有迁移到配置中。(这算是个小优化吧~)
  • 创建并返回 StreamGraphGenerator 对象

generate() 代码清单

public StreamGraph generate() {
        //传入配置
        streamGraph =
                new StreamGraph(
                        configuration, executionConfig, checkpointConfig, savepointRestoreSettings);
        //判断是否是批处理模式
        shouldExecuteInBatchMode = shouldExecuteInBatchMode();
        //配置StreamGraph
        configureStreamGraph(streamGraph);
        //已经转换的Transformation
        alreadyTransformed = new IdentityHashMap<>();
        //遍历所有的Transformation
        for (Transformation<?> transformation : transformations) {
            transform(transformation);
        }
        //设置插槽共享组资源
        streamGraph.setSlotSharingGroupResource(slotSharingGroupResources);

        setFineGrainedGlobalStreamExchangeMode(streamGraph);
        
        //转换成LineageGraph
        LineageGraph lineageGraph = LineageGraphUtils.convertToLineageGraph(transformations);
        streamGraph.setLineageGraph(lineageGraph);
        //遍历 streamGraph 中的所有 StreamNode 节点,
        //并检查每个节点的输入边是否需要禁用非对齐检查点。
        //如果需要禁用,则将这些输入边的 supportsUnalignedCheckpoints 属性设置为 false
        for (StreamNode node : streamGraph.getStreamNodes()) {
            if (node.getInEdges().stream().anyMatch(this::shouldDisableUnalignedCheckpointing)) {
                for (StreamEdge edge : node.getInEdges()) {
                    edge.setSupportsUnalignedCheckpoints(false);
                }
            }
        }
        //清理与返回
        final StreamGraph builtStreamGraph = streamGraph;

        alreadyTransformed.clear();
        alreadyTransformed = null;
        streamGraph = null;

        return builtStreamGraph;
    }
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.

下面就来重点研究下transform这个方法

在看之前,我们先瞅瞅这个transformations里面有哪些元素

深入Flink StreamGraph:构建流处理拓扑的奥秘_后端_03

transform代码清单

private Collection<Integer> transform(Transformation<?> transform) {
        //检查是否已经转换,如果是则直接返回
        if (alreadyTransformed.containsKey(transform)) {
            return alreadyTransformed.get(transform);
        }

        LOG.debug("Transforming " + transform);

        if (transform.getMaxParallelism() <= 0) {

            // if the max parallelism hasn't been set, then first use the job wide max parallelism
            // from the ExecutionConfig.
            int globalMaxParallelismFromConfig = executionConfig.getMaxParallelism();
            if (globalMaxParallelismFromConfig > 0) {
                transform.setMaxParallelism(globalMaxParallelismFromConfig);
            }
        }

        //省略部分代码
             

        // call at least once to trigger exceptions about MissingTypeInfo
        transform.getOutputType();


        //根据transform获取具体的实现类
        @SuppressWarnings("unchecked")
        final TransformationTranslator<?, Transformation<?>> translator =
                (TransformationTranslator<?, Transformation<?>>)
                        translatorMap.get(transform.getClass());

        Collection<Integer> transformedIds;
        if (translator != null) {
            //根据transform的具体类型,走不同的处理
            transformedIds = translate(translator, transform);
        } else {
            transformedIds = legacyTransform(transform);
        }

        // need this check because the iterate transformation adds itself before
        // transforming the feedback edges
        if (!alreadyTransformed.containsKey(transform)) {
            alreadyTransformed.put(transform, transformedIds);
        }

        return transformedIds;
    }
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.

接下来进入translate方法一探究竟

translate方法代码清单 先看【SourceTransformation】的流程

private Collection<Integer> translate(
            final TransformationTranslator<?, Transformation<?>> translator,
            final Transformation<?> transform) {
        checkNotNull(translator);
        checkNotNull(transform);

        //获取给定的父 Transformation 对象集合中每个 Transformation 对应的节点 ID 集合,
        // 【有的话会递归调用 transform 方法】
        final List<Collection<Integer>> allInputIds = getParentInputIds(transform.getInputs());

        // the recursive call might have already transformed this
        if (alreadyTransformed.containsKey(transform)) {
            return alreadyTransformed.get(transform);
        }

        final String slotSharingGroup =
                determineSlotSharingGroup(
                        transform.getSlotSharingGroup().isPresent()
                                ? transform.getSlotSharingGroup().get().getName()
                                : null,
                        allInputIds.stream()
                                .flatMap(Collection::stream)
                                .collect(Collectors.toList()));

        final TransformationTranslator.Context context =
                new ContextImpl(this, streamGraph, slotSharingGroup, configuration);

        return shouldExecuteInBatchMode
                ? translator.translateForBatch(transform, context)
                 //将给定的 Transformation 对象转换为其在流式执行模式下的运行时实现,并进行配置
                : translator.translateForStreaming(transform, context);
    }
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.

根据断点走到了SourceTransformationTranslator类中的translateInternal方法里面

translateInternal方法代码清单

private Collection<Integer> translateInternal(
            final SourceTransformation<OUT, SplitT, EnumChkT> transformation,
            final Context context,
            boolean emitProgressiveWatermarks) {
        checkNotNull(transformation);
        checkNotNull(context);

        final StreamGraph streamGraph = context.getStreamGraph();
        final String slotSharingGroup = context.getSlotSharingGroup();
        final int transformationId = transformation.getId();
        final ExecutionConfig executionConfig = streamGraph.getExecutionConfig();
        SourceOperatorFactory<OUT> operatorFactory =
                new SourceOperatorFactory<>(
                        transformation.getSource(),
                        transformation.getWatermarkStrategy(),
                        emitProgressiveWatermarks);

        //设置连接策略,通常是 AlWAYS
        operatorFactory.setChainingStrategy(transformation.getChainingStrategy());
        operatorFactory.setCoordinatorListeningID(transformation.getCoordinatorListeningID());
        //添加数据源
        streamGraph.addSource(
                transformationId,
                slotSharingGroup,
                transformation.getCoLocationGroupKey(),
                operatorFactory,
                null,
                transformation.getOutputType(),
                "Source: " + transformation.getName());
        //获取并设置并行度
        final int parallelism =
                transformation.getParallelism() != ExecutionConfig.PARALLELISM_DEFAULT
                        ? transformation.getParallelism()
                        : executionConfig.getParallelism();
        streamGraph.setParallelism(
                transformationId, parallelism, transformation.isParallelismConfigured());
        streamGraph.setMaxParallelism(transformationId, transformation.getMaxParallelism());

        streamGraph.setSupportsConcurrentExecutionAttempts(
                transformationId, transformation.isSupportsConcurrentExecutionAttempts());

        return Collections.singleton(transformationId);
    }
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.

我们看下addSource方法

这个方法逻辑其实不多,核心在addOperator这个方法里面

addOperator代码清单

private <IN, OUT> void addOperator(
            Integer vertexID,
            @Nullable String slotSharingGroup,
            @Nullable String coLocationGroup,
            StreamOperatorFactory<OUT> operatorFactory,
            TypeInformation<IN> inTypeInfo,
            TypeInformation<OUT> outTypeInfo,
            String operatorName,
            Class<? extends TaskInvokable> invokableClass) {
        //下面有说明
        addNode(
                vertexID,
                slotSharingGroup,
                coLocationGroup,
                invokableClass,
                operatorFactory,
                operatorName);
        //(设置输入输出类型的序列化器)
        setSerializers(vertexID, createSerializer(inTypeInfo), null, createSerializer(outTypeInfo));
        //(设置输出类型)
        if (operatorFactory.isOutputTypeConfigurable() && outTypeInfo != null) {
            // sets the output type which must be know at StreamGraph creation time
            operatorFactory.setOutputType(outTypeInfo, executionConfig);
        }

        if (operatorFactory.isInputTypeConfigurable()) {
            operatorFactory.setInputType(inTypeInfo, executionConfig);
        }

        if (LOG.isDebugEnabled()) {
            LOG.debug("Vertex: {}", vertexID);
        }
    }
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.

addOperator里面的第一个逻辑是addNode (StreamNode)

StreamNode包含程序中的运算符和所有属性

addNode方法代码清单

protected StreamNode addNode(
            Integer vertexID,
            @Nullable String slotSharingGroup,
            @Nullable String coLocationGroup,
            Class<? extends TaskInvokable> vertexClass,
            @Nullable StreamOperatorFactory<?> operatorFactory,
            String operatorName) {

        if (streamNodes.containsKey(vertexID)) {
            throw new RuntimeException("Duplicate vertexID " + vertexID);
        }

        //节点 ID、槽共享组、协同位置组、操作符工厂、操作符名称和任务可调用类
        StreamNode vertex =
                new StreamNode(
                        vertexID,
                        slotSharingGroup,
                        coLocationGroup,
                        operatorFactory,
                        operatorName,
                        vertexClass);

        streamNodes.put(vertexID, vertex);

        return vertex;
    }
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.

SourceTransformation的整体流程相对还是比较简单

现在来看一下相对复杂和用的比较多的的OneInputTransformation

前面的步骤是一样的,这次进入的是OneInputTransformationTranslator类中的translateInternal方法

OneInputTransformation具有一个Input,指向它前一个transformation

我们进入到translateInternal方法里面,方法里面前面的逻辑和SourceTransformationTranslator的类似

重点在下面这一块

for (Integer inputId : context.getStreamNodeIds(parentTransformations.get(0))) {
            streamGraph.addEdge(inputId, transformationId, 0);
        }

        if (transformation instanceof PhysicalTransformation) {
            streamGraph.setSupportsConcurrentExecutionAttempts(
                    transformationId,
                    ((PhysicalTransformation<OUT>) transformation)
                            .isSupportsConcurrentExecutionAttempts());
        }
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.

进入addEdge方法->addEdgeInternal方法

第一次进来走的逻辑是

addEdgeInternal方法代码清单

private void createActualEdge(
            Integer upStreamVertexID,
            Integer downStreamVertexID,
            int typeNumber,
            StreamPartitioner<?> partitioner,
            OutputTag outputTag,
            StreamExchangeMode exchangeMode,
            IntermediateDataSetID intermediateDataSetId) {
        //获取上游和下游节点
        StreamNode upstreamNode = getStreamNode(upStreamVertexID);
        StreamNode downstreamNode = getStreamNode(downStreamVertexID);

        //分区器
        // If no partitioner was specified and the parallelism of upstream and downstream
        // operator matches use forward partitioning, use rebalance otherwise.
        if (partitioner == null
                && upstreamNode.getParallelism() == downstreamNode.getParallelism()) {
            partitioner =
                    dynamic ? new ForwardForUnspecifiedPartitioner<>() : new ForwardPartitioner<>();
        } else if (partitioner == null) {
            partitioner = new RebalancePartitioner<Object>();
        }
        //删除部分代码

        if (exchangeMode == null) {
            exchangeMode = StreamExchangeMode.UNDEFINED;
        }


        int uniqueId = getStreamEdges(upstreamNode.getId(), downstreamNode.getId()).size();

        StreamEdge edge =
                new StreamEdge(
                        upstreamNode,
                        downstreamNode,
                        typeNumber,
                        partitioner,
                        outputTag,
                        exchangeMode,
                        uniqueId,
                        intermediateDataSetId);
        //连接上游和下游节点
        getStreamNode(edge.getSourceId()).addOutEdge(edge);
        getStreamNode(edge.getTargetId()).addInEdge(edge);
    }
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.

解释

  • virtualSideOutputNodes和virtualPartitionNodes都是虚拟节点
  • 虚拟节点都是不会出现在StreamGraph流中的,在添加edge的时候,如果节点是虚拟节点,就会递归的寻找上游节点,直到找到一个非虚拟节点。
  • partitioner
  • 如果没有指定,而且上下游的并行度相同,就会使用ForwardPartitioner
  • 上下游的并行度不同的话(以前老版本就是直接抛出异常了),partitioner 是 ForwardForConsecutiveHashPartitioner 类型,则将 partitioner 转换为其内部的 hashPartitioner。否则,抛出 UnsupportedOperationException 异常