Apache Flink作为国内最火的大数据计算引擎之一,自身支持高吞吐,低延迟,exactly-once语义,有状态流等特性,阅读源码有助加深对框架的理解和认知。
在各种大数据实时计算系统中,反压机制的实现意味着这个系统能不能经受住大量数据的冲击,像主流的大数据引擎Spark,Storm等都有自己的一套反压实现机制,Flink也有自己的反压机制,当下游Operator消费速度低于上游Operator的生产数据时会触发反压停止数据流动。这章主要解析下,Flink它是如何‘’天然”实现反压的。
在前面的章节我解析过Flink在数据交互时是基于Netty的通讯模式(可以参考我之前的文章https://blog.youkuaiyun.com/ws0owws0ow/article/details/114634462),而在Netty通讯模式之上Flink封装了一层Credit-Based的数据交互方式,每次ResultSubpartition向下游InputChannel发送NettyMessage时会带上Backlog标记(个数>0)这意味着当前ResultSubpartition的Deque中还有多少buffer(event除外)未消费,并且通过numCreditsAvailable标记下游剩余的Credit个数。而在下游InputChannel接收到带有Backlog标记时会尽可能向LocalBufferPool申请MemorySegment(credit)满足上游未被消费的buffer,如果申请到至少一个可用credit 那么通知上游ResultSubpartition的numCreditsAvailable加上申请到的Credits,但如果无法申请额外的credit并且numCreditsAvailable为0时Flink就会停止向下游发送数据。也就是说整个过程上下游的消费能力和处理能力都是相互感知的,在知道对方无法满足自己的消费速度时就会停止数据传输触发Flink反压。
而我理解的‘’天然‘’二字的意义是在于反压是基于Flink构建的优秀的Netty通讯模式之上,只需要在两端数据交互时带上自身的状态告知对方再根据对方状态开关是否需要反压。
上游的ResultSubpartition主要是通过buffersInBacklog来标记自身的囤积buffer数量 并通过Reader中的numCreditsAvailable感知下游的Credit数量
public class PipelinedSubpartition extends ResultSubpartition
implements CheckpointedResultSubpartition, ChannelStateHolder {
private static final Logger LOG = LoggerFactory.getLogger(PipelinedSubpartition.class);
// ------------------------------------------------------------------------
/** All buffers of this subpartition. Access to the buffers is synchronized on this object. */
//包括checkpoint的事件也会放进来,老版本好像是ArrayDeque,并不支持事件优先级
private final PrioritizedDeque<BufferConsumerWithPartialRecordLength> buffers = new PrioritizedDeque<>();
/** The number of non-event buffers currently in this subpartition. */
@GuardedBy("buffers")
//记录当前buffers中剩余buffer个数
//当add进去的时候会++
//当poll出来时会--
//下游会在获取buffer时查看该buffersInBacklog,如果达到阈值会触发反压,这时isBlockedByCheckpoint设置为ture
private int buffersInBacklog;
...
}
class CreditBasedSequenceNumberingViewReader implements BufferAvailabilityListener, NetworkSequenceViewReader {
private final Object requestLock = new Object();
private final InputChannelID receiverId;
private final PartitionRequestQueue requestQueue;
private volatile ResultSubpartitionView subpartitionView;
/**
* The status indicating whether this reader is already enqueued in the pipeline for transferring
* data or not.
*
* <p>It is mainly used to avoid repeated registrations but should be accessed by a single
* thread only since there is no synchronisation.
*/
private boolean isRegisteredAsAvailable = false;
/** The number of available buffers for holding data on the consumer side. */
//下游inputChannel中可用buffer个数
//如果等于0说明下游已经无buffer可用了,此时停止往下游发送数据
private int numCreditsAvailable;
CreditBasedSequenceNumberingViewReader(
InputChannelID receiverId,
int initialCredit,
PartitionRequestQueue requestQueue) {
this.receiverId = receiverId;
//initialCredit个数对应参数taskmanager.network.memory.buffers-per-channel,默认2
//这也是下游LocalBufferPool向InputChannel分配的至少能满足的初始容量
this.numCreditsAvailable = initialCredit;
this.requestQueue = requestQueue;
}
....
}
ResultSubpartition统计buffersInBacklog方式:
- 在InputChannel向本地ResultSubpartition emit数据时ResultSubpartition会把接收到的buffer(非event类型)add进buffer队列的同时会增加一个buffersInBacklog个数
- 在ResultSubpartition往下游发送数据时会从队列中poll出需要发送的buffer并且判断如果当前数据属于buffer类型则减去一个buffersInBacklog个数
备注:InputChannel 向ResultSubpartition emit过程会放在后面章节解析
private boolean add(BufferConsumer bufferConsumer, int partialRecordLength, boolean finish) {
checkNotNull(bufferConsumer);
final boolean notifyDataAvailable;
int prioritySequenceNumber = -1;
synchronized (buffers) {
if (isFinished || isReleased) {
bufferConsumer.close();
return false;
}
// Add the bufferConsumer and update the stats
//增加buffer到PrioritizedDeque里,优先级高的buffer放入队列头
if (addBuffer(bufferConsumer, partialRecordLength)) {
prioritySequenceNumber = sequenceNumber;
}
updateStatistics(bufferConsumer);//总buffer数+1
//buffersInBacklog 记录当前ResultPartition还未被消费的Buffer个数
//在ResultPartition向下游写数据时 会从buffers中poll出数据,这时buffersInBacklog - 1
increaseBuffersInBacklog(bufferConsumer);//总backlog数+1
notifyDataAvailable = finish || shouldNotifyDataAvailable();
isFinished |= finish;
}
if (prioritySequenceNumber != -1) {
notifyPriorityEvent(prioritySequenceNumber);
}
//如果可用(比如数据完整,非阻塞模式等) 则通知下游inputChannel消费
if (notifyDataAvailable) {
notifyDataAvailable();
}
return true;
}
private void increaseBuffersInBacklog(BufferConsumer buffer) {
assert Thread.holdsLock(buffers);
//只有buffer类型的数据才增加backlog个数
if (buffer != null && buffer.isBuffer()) {
buffersInBacklog++;
}
}
backlog减少过程(这后面的Netty通讯逻辑我在Netty解析章节也解析过):
BufferAndBacklog pollBuffer() {
//buffers的数据结构PrioritizedDeque<BufferConsumerWithPartialRecordLength>
//用来存储未被消费的buffer或Event
synchronized (buffers) {
//如果为true说正在执行
if (isBlockedByCheckpoint) {
return null;
}
Buffer buffer = null;
if (buffers.isEmpty()) {
flushRequested = false;
}
while (!buffers.isEmpty()) {
//获取deque中优先级最高的BufferConsumer,但不会Remove
BufferConsumer bufferConsumer = buffers.peek().getBufferConsumer();
//生成buffer
buffer = bufferConsumer.build();
checkState(bufferConsumer.isFinished() || buffers.size() == 1,
"When there are multiple buffers, an unfinished bufferConsumer can not be at the head of the buffers queue.");
if (buffers.size() == 1) {
// turn off flushRequested flag if we drained all of the available data
flushRequested = false;
}
//如果获取buffer完成就通过poll从buffers中remove掉该Buffer
if (bufferConsumer.isFinished()) {
buffers.poll().getBufferConsumer().close();
//buffersInBacklog - 1
decreaseBuffersInBacklogUnsafe(bufferConsumer.isBuffer());
}
//当生成完buffer时,netty中的可读字节为0
if (buffer.readableBytes() > 0) {
//正常情况下为0 跳出循环
break;
}
//从Netty的Channel中释放该buffer资源
buffer.recycleBuffer();
buffer = null;
if (!bufferConsumer.isFinished()) {
break;
}
}
if (buffer == null) {
return null;
}
//如果当前buffer类型是ALIGNED_EXACTLY_ONCE_CHECKPOINT_BARRIER的话会block住
if (buffer.getDataType().isBlockingUpstream()) {
isBlockedByCheckpoint = true;
}
//buffersInBacklog - 1
updateStatistics(buffer);
// Do not report last remaining buffer on buffers as available to read (assuming it's unfinished).
// It will be reported for reading either on flush or when the number of buffers in the queue
// will be 2 or more.
return new BufferAndBacklog(
buffer,
//buffersInBacklog 当前未被消费掉还囤积在当前ResultSubpartition的个数
getBuffersInBacklog(),
isDataAvailableUnsafe() ? getNextBufferTypeUnsafe() : Buffer.DataType.NONE,
sequenceNumber++);
}
}
private void decreaseBuffersInBacklogUnsafe(boolean isBuffer) {
assert Thread.holdsLock(buffers);
//buffer数据类型
if (isBuffer) {
buffersInBacklog--;
}
}
numCreditsAvailable维护着下游inputChannel的Credit个数 ,默认初始化值为2,也就是下游LocalBufferPool向InputChannel分配的至少能满足的初始容量,统计逻辑:
- 在往下游写NettyMessage时numCreditsAvailable -1
- 在下游成功向LocalBufferPool申请了credits时 numCreditsAvailable 加上已申请的credits个数
//从reader中Poll出BufferAndAvailability 并封装成BufferResponse发送至Client端,Client会调用channelRead接受处理该buffer
private void writeAndFlushNextMessageIfPossible(final Channel channel) throws IOException {
if (fatalError || !channel.isWritable()) {
return;
}
// The logic here is very similar to the combined input gate and local
// input channel logic. You can think of this class acting as the input
// gate and the consumed views as the local input channels.
BufferAndAvailability next = null;
try {
while (true) {
//Poll出刚刚add进availableReaders的NetworkSequenceViewReader
NetworkSequenceViewReader reader = pollAvailableReader();
// No queue with available data. We allow this here, because
// of the write callbacks that are executed after each write.
if (reader == null) {
return;
}
//从NetworkSequenceViewReader中poll出buffer
//减少一个下游的Credit个数
//减少一个Backlog个数
next = reader.getNextBuffer();
if (next == null) {
if (!reader.isReleased()) {
continue;
}
Throwable cause = reader.getFailureCause();
if (cause != null) {
ErrorResponse msg = new ErrorResponse(
new ProducerFailedException(cause),
reader.getReceiverId());
ctx.writeAndFlush(msg);
}
} else {
// This channel was now removed from the available reader queue.
// We re-add it into the queue if it is still available
if (next.moreAvailable()) {
registerAvailableReader(reader);
}
//封装的NettyMessage
BufferResponse msg = new BufferResponse(
next.buffer(),
next.getSequenceNumber(),
reader.getReceiverId(),
next.buffersInBacklog());
// Write and flush and wait until this is done before
// trying to continue with the next buffer.
//向channel中写入BufferResponse并添加一个future回调接口
//回调接口根据返会的future类型做出对应动作
channel.writeAndFlush(msg).addListener(writeListener);
return;
}
}
} catch (Throwable t) {
if (next != null) {
next.buffer().recycleBuffer();
}
throw new IOException(t.getMessage(), t);
}
}
public BufferAndAvailability getNextBuffer() throws IOException {
//从deque中poll出buffer
BufferAndBacklog next = subpartitionView.getNextBuffer();
if (next != null) {
//下游inputChannel中可用credit个数 -1
if (next.buffer().isBuffer() && --numCreditsAvailable < 0) {
throw new IllegalStateException("no credit available");
}
final Buffer.DataType nextDataType = getNextDataType(next);
return new BufferAndAvailability(
next.buffer(),
nextDataType,
next.buffersInBacklog(),
next.getSequenceNumber());
} else {
return null;
}
}
InputGate根据上游发送过来的ChannelId提取出对应inputChannel并把接收到的buffer添加进inputChannel的Deque中,然后判断发送过来的backlog是否>0,如果有则尽可能向LocalBufferPool申请MemorySegment(credit),如果有成果申请到credit则通知上游增加credit个数:
//消息类型主要是两种:
//BufferResponse 和 ErrorResponse
private void decodeMsg(Object msg) throws Throwable {
final Class<?> msgClazz = msg.getClass();
// ---- Buffer --------------------------------------------------------
if (msgClazz == NettyMessage.BufferResponse.class) {
NettyMessage.BufferResponse bufferOrEvent = (NettyMessage.BufferResponse) msg;
//inputChannels存储了当前inputGate中所有的inputChannel
RemoteInputChannel inputChannel = inputChannels.get(bufferOrEvent.receiverId);
if (inputChannel == null || inputChannel.isReleased()) {
bufferOrEvent.releaseBuffer();
cancelRequestFor(bufferOrEvent.receiverId);
return;
}
try {
//开始处理buffer
decodeBufferOrEvent(inputChannel, bufferOrEvent);
} catch (Throwable t) {
inputChannel.onError(t);
}
} else if (msgClazz == NettyMessage.ErrorResponse.class) {
// ---- Error ---------------------------------------------------------
.....
}
public void onBuffer(Buffer buffer, int sequenceNumber, int backlog) throws IOException {
boolean recycleBuffer = true;
try {
if (expectedSequenceNumber != sequenceNumber) {
onError(new BufferReorderingException(expectedSequenceNumber, sequenceNumber));
return;
}
final boolean wasEmpty;
boolean firstPriorityEvent = false;
//PrioritizedDeque<SequenceBuffer> 存放的是待消费的Buffer的双端队列
synchronized (receivedBuffers) {
// Similar to notifyBufferAvailable(), make sure that we never add a buffer
// after releaseAllResources() released all buffers from receivedBuffers
// (see above for details).
if (isReleased.get()) {
return;
}
wasEmpty = receivedBuffers.isEmpty();
//查看buffer时候标记了优先级
if (buffer.getDataType().hasPriority()) {
//如果是把该buffer加入receivedBuffers的队列头部
receivedBuffers.addPriorityElement(new SequenceBuffer(buffer, sequenceNumber));
if (channelStatePersister.checkForBarrier(buffer)) {
// checkpoint was not yet started by task thread,
// so remember the numbers of buffers to spill for the time when it will be started
numBuffersOvertaken = receivedBuffers.getNumUnprioritizedElements();
}
firstPriorityEvent = receivedBuffers.getNumPriorityElements() == 1;
} else {
//如果是非优先级的就直接按序add进receivedBuffers
receivedBuffers.add(new SequenceBuffer(buffer, sequenceNumber));
channelStatePersister.maybePersist(buffer);
}
++expectedSequenceNumber;
}
recycleBuffer = false;
if (firstPriorityEvent) {
notifyPriorityEvent(sequenceNumber);
}
if (wasEmpty) {
notifyChannelNonEmpty();
}
//如果上游ResultSubpartition有囤积的backlog
if (backlog >= 0) {
onSenderBacklog(backlog);
}
} finally {
//最后释放掉nettyChannel中的buffer资源
if (recycleBuffer) {
buffer.recycleBuffer();
}
}
}
尝试向LocalBufferPool中申请Buffer,Credit为成功申请个数,这里分为两种情况:
- 至少申请成功1个buffer,向上游ResultSubpartition发送addCredit通知,ResultSubpartition的numCreditsAvailable + =Credit个数
- 申请buffer失败,当上游ResultSubpartition的numCreditsAvailable=0时停止向下游发送数据,触发反压机制
void onSenderBacklog(int backlog) throws IOException {
//返回生成FloatingBuffer的个数
//initialCredit个数对应参数taskmanager.network.memory.buffers-per-channel,默认2
//LocalBufferPool初始化最大memorySegment个数 = subpartition个数 * initialCredit + floatingBuffer个数
int numRequestedBuffers = bufferManager.requestFloatingBuffers(backlog + initialCredit);
if (numRequestedBuffers > 0 && unannouncedCredit.getAndAdd(numRequestedBuffers) == 0) {
//只要成功申请到buffer就发送unannouncedCredit个数至上游ResultSubpartition
notifyCreditAvailable();
}
}
int requestFloatingBuffers(int numRequired) {
//成功申请的buffer个数
int numRequestedBuffers = 0;
synchronized (bufferQueue) {
// Similar to notifyBufferAvailable(), make sure that we never add a buffer after channel
// released all buffers via releaseAllResources().
if (inputChannel.isReleased()) {
return numRequestedBuffers;
}
//需要申请的buffer个数
numRequiredBuffers = numRequired;
//判断当前bufferQueue是否有足够的Buffer,如果没足够可用buffer则向LocalBufferPool申请MemorySegment
while (bufferQueue.getAvailableBufferSize() < numRequiredBuffers && !isWaitingForFloatingBuffers) {
BufferPool bufferPool = inputChannel.inputGate.getBufferPool();
//从LocalBufferPool中申请MemorySegment
Buffer buffer = bufferPool.requestBuffer();
if (buffer != null) {
//申请成功就添加到floatingBuffers中
bufferQueue.addFloatingBuffer(buffer);
numRequestedBuffers++;
} else if (bufferPool.addBufferListener(this)) {
isWaitingForFloatingBuffers = true;
break;
}
}
}
return numRequestedBuffers;
}
当申请成功时通知上游增加Credit个数
protected void channelRead0(ChannelHandlerContext ctx, NettyMessage msg) throws Exception {
try {
Class<?> msgClazz = msg.getClass();
// ----------------------------------------------------------------
// Intermediate result partition requests
// ----------------------------------------------------------------
//Subpartition发送过来的注册请求
if (msgClazz == PartitionRequest.class) {
....
} else if (msgClazz == AddCredit.class) {
AddCredit request = (AddCredit) msg;
//增加credit
outboundQueue.addCreditOrResumeConsumption(request.receiverId, reader -> reader.addCredit(request.credit));
} else if (msgClazz == ResumeConsumption.class) {
ResumeConsumption request = (ResumeConsumption) msg;
outboundQueue.addCreditOrResumeConsumption(request.receiverId, NetworkSequenceViewReader::resumeConsumption);
} else {
LOG.warn("Received unexpected client request: {}", msg);
}
} catch (Throwable t) {
respondWithError(ctx, t);
}
}
//增加credit个数
public void addCredit(int creditDeltas) {
numCreditsAvailable += creditDeltas;
}
增加完Credit后继续向下游发送消费通知,而这里就会遇到真正触发反压开关的逻辑!reader.isAvailable()
void addCreditOrResumeConsumption(
InputChannelID receiverId,
Consumer<NetworkSequenceViewReader> operation) throws Exception {
if (fatalError) {
return;
}
//allReaders存储的是已注册过来的远程分区客户端集合
NetworkSequenceViewReader reader = allReaders.get(receiverId);
if (reader != null) {
//增加credit个数
operation.accept(reader);
//增加credit后继续向下游发送数据
enqueueAvailableReader(reader);
} else {
throw new IllegalStateException("No reader for receiverId = " + receiverId + " exists.");
}
}
private void enqueueAvailableReader(final NetworkSequenceViewReader reader) throws Exception {
//判断下游是否有可用的credit(buffer)
//当下游没有足够的credit时,isAvailable = false,此时触发反压,停止向下游发送数据
if (reader.isRegisteredAsAvailable() || !reader.isAvailable()) {
return;
}
// Queue an available reader for consumption. If the queue is empty,
// we try trigger the actual write. Otherwise this will be handled by
// the writeAndFlushNextMessageIfPossible calls.
//存放的是可往下游写的NetworkSequenceViewReader
//调用下方registerAvailableReader时会把当前reader添加进去
//当后面往下游写数据时会从availableReaders.poll出来
boolean triggerWrite = availableReaders.isEmpty();
//添加reader进 availableReaders
registerAvailableReader(reader);
//如果为空属于可触发写条件
if (triggerWrite) {
//向下游写数据
writeAndFlushNextMessageIfPossible(ctx.channel());
}
}
当每次ResultSubpartition往下游发送消息时 都会经过reader.isAvailable()判断,这个主要是判断下游是否有可用Credit等,如果没有则直接返会 停止往下游写入数据
public boolean isAvailable() {
//判断是否满足向下游write buffer的条件,如果满足返会true
return subpartitionView.isAvailable(numCreditsAvailable);
}
public boolean isAvailable(int numCreditsAvailable) {
synchronized (buffers) {
//判断是否有足够的credit
if (numCreditsAvailable > 0) {
return isDataAvailableUnsafe();
}
//判断数据类型是否是Event类型
final Buffer.DataType dataType = getNextBufferTypeUnsafe();
return dataType.isEvent();
}
}
private boolean isDataAvailableUnsafe() {
assert Thread.holdsLock(buffers);
//isBlockedByCheckpoint:ALIGNED_EXACTLY_ONCE_CHECKPOINT_BARRIE类型的buffer情况下为ture
//flushRequested:正常情况下为ture
//getNumberOfFinishedBuffers:至少有一个finishedBuffer情况下为true
return !isBlockedByCheckpoint && (flushRequested || getNumberOfFinishedBuffers() > 0);
}
以上就是整个Flink的反压机制流程,“天然”得借助了Flink构建的优秀的Netty通讯模式,在数据交互和存储时记录下自身的backlog和credit,让上下游彼此感知对方状态。就像处理ALIGNED_EXACTLY_ONCE_CHECKPOINT_BARRIE类型的Checkpoint一样(ResultSubpartition匹配到此类型Checkpoint也会阻塞发送数据),在生产中我们应该尽量避免反压,毕竟这样可能会造成整个阻塞端一直囤积数据至OOM,所以除了监控Web页面的反压情况,更多Details可以通过Metrics相关API监控来保障集群的正常运作。Thanks。