Flink中的数据抽象&交换&Credit&背压问题详解

文章详细阐述了Flink中数据的流转过程,包括Flink如何在操作符间传递数据,使用NetworkBufferPool管理内存,以及数据在Task间的交换。同时,讨论了Flink如何通过Credit机制处理背压问题,确保数据流的稳定和高效。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

ebab82c125e734c19e73fabaaca7f9cf.png全网最全大数据面试提升手册!

一、数据流转——Flink的数据抽象及数据交换过程

本部分讲一下flink底层是如何定义和在操作符之间传递数据的。

129c2fd611bae3dee1c2ebec8dbedcab.png

816a73d85de3c44f98a570765bec1336.png

51f7571113446beb1164b898a7c2bea7.png

5e022409909e640375cf97e2b3442843.png

d375def64fb1c106aa9c5ecf0f7186cf.png

b704df69d73f84e81bab77ee613a0930.png

其中,第一个构造函数的checkBufferAndGetAddress()方法能够得到direct buffer的内存地址,因此可以操作堆外内存。

1.2 ByteBuffer与NetworkBufferPool
12e6cc2b223f4116dbbe2d281535a694.png

c2acedbaf41dffe2a4d931f77f71d0c8.png

public abstract class AbstractReferenceCountedByteBuf extends AbstractByteBuf {

    private volatile int refCnt = 1;
    
    ......
}

994d69b86ccd09fc4f87a785eb3bc5ea.png

public NetworkBufferPool(int numberOfSegmentsToAllocate, int segmentSize) {

        ......
        
        try {
            this.availableMemorySegments = new ArrayBlockingQueue<>(numberOfSegmentsToAllocate);
        }
        catch (OutOfMemoryError err) {
            throw new OutOfMemoryError("Could not allocate buffer queue of length "
                    + numberOfSegmentsToAllocate + " - " + err.getMessage());
        }

        try {
            for (int i = 0; i < numberOfSegmentsToAllocate; i++) {
                ByteBuffer memory = ByteBuffer.allocateDirect(segmentSize);
                availableMemorySegments.add(MemorySegmentFactory.wrapPooledOffHeapMemory(memory, null));
            }
        }

        ......
        
        long allocatedMb = (sizeInLong * availableMemorySegments.size()) >> 20;

        LOG.info("Allocated {} MB for network buffer pool (number of memory segments: {}, bytes per segment: {}).",
                allocatedMb, availableMemorySegments.size(), segmentSize);
    }

4c4c601252a04097bf19242fab0468bc.png

ce6dbf9218449437b3f36a316328c90d.png

private void redistributeBuffers() throws IOException {
        assert Thread.holdsLock(factoryLock);

        // All buffers, which are not among the required ones
        final int numAvailableMemorySegment = totalNumberOfMemorySegments - numTotalRequiredBuffers;

        if (numAvailableMemorySegment == 0) {
            // in this case, we need to redistribute buffers so that every pool gets its minimum
            for (LocalBufferPool bufferPool : allBufferPools) {
                bufferPool.setNumBuffers(bufferPool.getNumberOfRequiredMemorySegments());
            }
            return;
        }

        long totalCapacity = 0; // long to avoid int overflow

        for (LocalBufferPool bufferPool : allBufferPools) {
            int excessMax = bufferPool.getMaxNumberOfMemorySegments() -
                bufferPool.getNumberOfRequiredMemorySegments();
            totalCapacity += Math.min(numAvailableMemorySegment, excessMax);
        }

        // no capacity to receive additional buffers?
        if (totalCapacity == 0) {
            return; // necessary to avoid div by zero when nothing to re-distribute
        }

        final int memorySegmentsToDistribute = MathUtils.checkedDownCast(
                Math.min(numAvailableMemorySegment, totalCapacity));

        long totalPartsUsed = 0; // of totalCapacity
        int numDistributedMemorySegment = 0;
        for (LocalBufferPool bufferPool : allBufferPools) {
            int excessMax = bufferPool.getMaxNumberOfMemorySegments() -
                bufferPool.getNumberOfRequiredMemorySegments();

            // shortcut
            if (excessMax == 0) {
                continue;
            }

            totalPartsUsed += Math.min(numAvailableMemorySegment, excessMax);


            final int mySize = MathUtils.checkedDownCast(
                    memorySegmentsToDistribute * totalPartsUsed / totalCapacity - numDistributedMemorySegment);

            numDistributedMemorySegment += mySize;
            bufferPool.setNumBuffers(bufferPool.getNumberOfRequiredMemorySegments() + mySize);
        }

        assert (totalPartsUsed == totalCapacity);
        assert (numDistributedMemorySegment == memorySegmentsToDistribute);
    }

6b1b1e045a24a6345a27e4877f5f7be8.png

/** 该内存池需要的最少内存片数目*/
    private final int numberOfRequiredMemorySegments;

    /**
     * 当前已经获得的内存片中,还没有写入数据的空白内存片
     */
    private final ArrayDeque<MemorySegment> availableMemorySegments = new ArrayDeque<MemorySegment>();

    /**
     * 注册的所有监控buffer可用性的监听器
     */
    private final ArrayDeque<BufferListener> registeredListeners = new ArrayDeque<>();

    /** 能给内存池分配的最大分片数*/
    private final int maxNumberOfMemorySegments;

    /** 当前内存池大小 */
    private int currentPoolSize;

    /**
     * 所有经由NetworkBufferPool分配的,被本内存池引用到的(非直接获得的)分片数
     */
    private int numberOfRequestedMemorySegments;

21b73fa83cf312a48a38ac77c01e123a.png

public void setNumBuffers(int numBuffers) throws IOException {
        synchronized (availableMemorySegments) {
            checkArgument(numBuffers >= numberOfRequiredMemorySegments,
                    "Buffer pool needs at least %s buffers, but tried to set to %s",
                    numberOfRequiredMemorySegments, numBuffers);

            if (numBuffers > maxNumberOfMemorySegments) {
                currentPoolSize = maxNumberOfMemorySegments;
            } else {
                currentPoolSize = numBuffers;
            }

            returnExcessMemorySegments();

            // If there is a registered owner and we have still requested more buffers than our
            // size, trigger a recycle via the owner.
            if (owner != null && numberOfRequestedMemorySegments > currentPoolSize) {
                owner.releaseMemory(numberOfRequestedMemorySegments - numBuffers);
            }
        }
    }

6a2fd2352462188884a8c4ff10edd614.png

e2b4539c8e81ee97a6afdfe26ac4723d.png

private void sendToTarget(T record, int targetChannel) throws IOException, InterruptedException {
        RecordSerializer<T> serializer = serializers[targetChannel];

        SerializationResult result = serializer.addRecord(record);

        while (result.isFullBuffer()) {
            if (tryFinishCurrentBufferBuilder(targetChannel, serializer)) {
                // If this was a full record, we are done. Not breaking
                // out of the loop at this point will lead to another
                // buffer request before breaking out (that would not be
                // a problem per se, but it can lead to stalls in the
                // pipeline).
                if (result.isFullRecord()) {
                    break;
                }
            }
            BufferBuilder bufferBuilder = requestNewBufferBuilder(targetChannel);

            result = serializer.continueWritingWithNextBufferBuilder(bufferBuilder);
        }
        checkState(!serializer.hasSerializedData(), "All data should be written at once");
        
        
        
        if (flushAlways) {
            targetPartition.flush(targetChannel);
        }
    }

64d30b41d72b895d0c70a1c3d0414d36.png

二、数据流转过程

上段讲了各层数据的抽象,这段讲讲数据在各个task之间exchange的过程。

1. 整体过程

看这张图:

ba17261ef1d40851cf30b7da108fb804.png

471313f40c172ee0ac8f5c9463b73a8f.png

5eda5c17b3c20398f476c782d32a3bab.png

68827a5fcd3cd417352d29a3e38d87b8.png

097dc82a4c86f987324c317aa9399cfe.png

//RecordWriterOutput.java
    @Override
    public void collect(StreamRecord<OUT> record) {
        if (this.outputTag != null) {
            // we are only responsible for emitting to the main input
            return;
        }
        //这里可以看到把记录交给了recordwriter
        pushToRecordWriter(record);
    }

然后recordwriter把数据发送到对应的通道。

//RecordWriter.java
    public void emit(T record) throws IOException, InterruptedException {
        //channelselector登场了
        for (int targetChannel : channelSelector.selectChannels(record, numChannels)) {
            sendToTarget(record, targetChannel);
        }
    }
    
        private void sendToTarget(T record, int targetChannel) throws IOException, InterruptedException {
        
        //选择序列化器并序列化数据
        RecordSerializer<T> serializer = serializers[targetChannel];

        SerializationResult result = serializer.addRecord(record);

        while (result.isFullBuffer()) {
            if (tryFinishCurrentBufferBuilder(targetChannel, serializer)) {
                // If this was a full record, we are done. Not breaking
                // out of the loop at this point will lead to another
                // buffer request before breaking out (that would not be
                // a problem per se, but it can lead to stalls in the
                // pipeline).
                if (result.isFullRecord()) {
                    break;
                }
            }
            BufferBuilder bufferBuilder = requestNewBufferBuilder(targetChannel);

            //写入channel
            result = serializer.continueWritingWithNextBufferBuilder(bufferBuilder);
        }
        checkState(!serializer.hasSerializedData(), "All data should be written at once");

        if (flushAlways) {
            targetPartition.flush(targetChannel);
        }
    }

接下来是把数据推给底层设施(netty)的过程:

//ResultPartition.java
    @Override
    public void flushAll() {
        for (ResultSubpartition subpartition : subpartitions) {
            subpartition.flush();
        }
    }
    
        //PartitionRequestQueue.java
        void notifyReaderNonEmpty(final NetworkSequenceViewReader reader) {
        //这里交给了netty server线程去推
        ctx.executor().execute(new Runnable() {
            @Override
            public void run() {
                ctx.pipeline().fireUserEventTriggered(reader);
            }
        });
    }

netty相关的部分:

//AbstractChannelHandlerContext.java
    public ChannelHandlerContext fireUserEventTriggered(final Object event) {
        if (event == null) {
            throw new NullPointerException("event");
        } else {
            final AbstractChannelHandlerContext next = this.findContextInbound();
            EventExecutor executor = next.executor();
            if (executor.inEventLoop()) {
                next.invokeUserEventTriggered(event);
            } else {
                executor.execute(new OneTimeTask() {
                    public void run() {
                        next.invokeUserEventTriggered(event);
                    }
                });
            }

            return this;
        }
    }

最后真实的写入:

//PartittionRequesetQueue.java
    private void enqueueAvailableReader(final NetworkSequenceViewReader reader) throws Exception {
        if (reader.isRegisteredAsAvailable() || !reader.isAvailable()) {
            return;
        }
        // Queue an available reader for consumption. If the queue is empty,
        // we try trigger the actual write. Otherwise this will be handled by
        // the writeAndFlushNextMessageIfPossible calls.
        boolean triggerWrite = availableReaders.isEmpty();
        registerAvailableReader(reader);

        if (triggerWrite) {
            writeAndFlushNextMessageIfPossible(ctx.channel());
        }
    }
    
    private void writeAndFlushNextMessageIfPossible(final Channel channel) throws IOException {
        
        ......

                next = reader.getNextBuffer();
                if (next == null) {
                    if (!reader.isReleased()) {
                        continue;
                    }
                    markAsReleased(reader.getReceiverId());

                    Throwable cause = reader.getFailureCause();
                    if (cause != null) {
                        ErrorResponse msg = new ErrorResponse(
                            new ProducerFailedException(cause),
                            reader.getReceiverId());

                        ctx.writeAndFlush(msg);
                    }
                } else {
                    // This channel was now removed from the available reader queue.
                    // We re-add it into the queue if it is still available
                    if (next.moreAvailable()) {
                        registerAvailableReader(reader);
                    }

                    BufferResponse msg = new BufferResponse(
                        next.buffer(),
                        reader.getSequenceNumber(),
                        reader.getReceiverId(),
                        next.buffersInBacklog());

                    if (isEndOfPartitionEvent(next.buffer())) {
                        reader.notifySubpartitionConsumed();
                        reader.releaseAllResources();

                        markAsReleased(reader.getReceiverId());
                    }

                    // Write and flush and wait until this is done before
                    // trying to continue with the next buffer.
                    channel.writeAndFlush(msg).addListener(writeListener);

                    return;
                }
        
        ......
        
    }

4f1f126848133924e714131650f91a0d.jpeg

//CreditBasedPartitionRequestClientHandler.java
    private void decodeMsg(Object msg) throws Throwable {
        final Class<?> msgClazz = msg.getClass();

        // ---- Buffer --------------------------------------------------------
        if (msgClazz == NettyMessage.BufferResponse.class) {
            NettyMessage.BufferResponse bufferOrEvent = (NettyMessage.BufferResponse) msg;

            RemoteInputChannel inputChannel = inputChannels.get(bufferOrEvent.receiverId);
            if (inputChannel == null) {
                bufferOrEvent.releaseBuffer();

                cancelRequestFor(bufferOrEvent.receiverId);

                return;
            }

            decodeBufferOrEvent(inputChannel, bufferOrEvent);

        } 
        
        ......
        
    }

4f53cb4cff121f13314a0de8b6f44341.png

//StreamElementSerializer.java
    public StreamElement deserialize(DataInputView source) throws IOException {
        int tag = source.readByte();
        if (tag == TAG_REC_WITH_TIMESTAMP) {
            long timestamp = source.readLong();
            return new StreamRecord<T>(typeSerializer.deserialize(source), timestamp);
        }
        else if (tag == TAG_REC_WITHOUT_TIMESTAMP) {
            return new StreamRecord<T>(typeSerializer.deserialize(source));
        }
        else if (tag == TAG_WATERMARK) {
            return new Watermark(source.readLong());
        }
        else if (tag == TAG_STREAM_STATUS) {
            return new StreamStatus(source.readInt());
        }
        else if (tag == TAG_LATENCY_MARKER) {
            return new LatencyMarker(source.readLong(), new OperatorID(source.readLong(), source.readLong()), source.readInt());
        }
        else {
            throw new IOException("Corrupt stream, found tag: " + tag);
        }
    }

然后再次在StreamInputProcessor.processInput()循环中得到处理。

至此,数据在跨jvm的节点之间的流转过程就讲完了。

三、Credit漫谈

c0cc5d920bb5c610ddbf441f2d46289e.png

1. 背压问题
2dbf595543ca113aab34d34ea0554d9b.png

那么Flink又是如何处理背压的呢?答案也是靠这些缓冲池。

f6c8ede300e21578f5b23299b542150a.png

这张图说明了Flink在生产和消费数据时的大致情况。ResultPartition和InputGate在输出和输入数据时,都要向NetworkBufferPool申请一块MemorySegment作为缓存池。

9510dda79499edb877323f82ea4385c2.png

5a9a7d36db0647bb626c8abdaee8eb06.png
f760b3c862d56fdca6f3baf2406a63d4.png 5acfb7be735627f9dca52bd562584b2b.png

基于Credit的流控就是这样一种建立在信用(消费数据的能力)上的,面向每个虚链路(而非端到端的)流模型,如下图所示:

100d79acdfde0e8ed9bc03f1b108fc3b.png

首先,下游会向上游发送一条credit message,用以通知其目前的信用(可联想信用卡的可用额度),然后上游会根据这个信用消息来决定向下游发送多少数据。当上游把数据发送给下游时,它就从下游的信用卡上划走相应的额度(credit balance):

447cc0c7a09f4669b7b3b75c1e760d9c.png
577411d3fd963c0a91f3c41a8f3de238.png b3c38d7185b76d3d5f96d912bf40502e.png

ec3bb4ec37f06cb048f2940856c95c70.png

d9aafbfa654b6eff0d3377cbecc39c86.png

如上图所示,a是面向连接的流设计,b是端到端的流设计。其中,a的设计使得当下游节点3因某些情况必须缓存数据暂缓处理时,每个上游节点(1和2)都可以利用其缓存保存数据;而端到端的设计b里,只有节点3的缓存才可以用于保存数据(读者可以从如何实现上想想为什么)。

如果这个文章对你有帮助,不要忘记 「在看」 「点赞」 「收藏」 三连啊喂!

615978483f8e472faa4ba34c572fa58d.png

8905c06d47295c20b9692dd547f8b73c.jpeg

2022年全网首发|大数据专家级技能模型与学习指南(胜天半子篇)

互联网最坏的时代可能真的来了

我在B站读大学,大数据专业

我们在学习Flink的时候,到底在学习什么?

193篇文章暴揍Flink,这个合集你需要关注一下

Flink生产环境TOP难题与优化,阿里巴巴藏经阁YYDS

Flink CDC我吃定了耶稣也留不住他!| Flink CDC线上问题小盘点

我们在学习Spark的时候,到底在学习什么?

在所有Spark模块中,我愿称SparkSQL为最强!

硬刚Hive | 4万字基础调优面试小总结

数据治理方法论和实践小百科全书

标签体系下的用户画像建设小指南

4万字长文 | ClickHouse基础&实践&调优全视角解析

【面试&个人成长】2021年过半,社招和校招的经验之谈

大数据方向另一个十年开启 |《硬刚系列》第一版完结

我写过的关于成长/面试/职场进阶的文章

当我们在学习Hive的时候在学习什么?「硬刚Hive续集」

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值