一、唤醒Selector
上篇文章说到,消息是怎么被生产出来,简单来说,就是消息被追加到了RecordAccumulator中,以ByteBuffer的形式存了下来。
那么在消息添加到ByteBuffer中后,后续的步骤又是怎么样的呢?
首先上篇文章提到了Kafka追加消息时加锁的粒度,是一个Deque<RecordBatch> dq,我们根据topic(tp)来获取到要发往这个Topic的队列。这个队列中的元素 RecordBatch ,便是即将发往Broker的消息的载体,一个RecordBatch可以包含多条消息,我们的消息便是追加到了最底层指向的一个ByteBuffer中。
Deque<RecordBatch> dq = getOrCreateDeque(tp);
synchronized (dq) {
if (closed) {
throw new IllegalStateException("Cannot send after the producer is closed.");
}
RecordAppendResult appendResult = tryAppend(timestamp, key, value, callback, dq);
if (appendResult != null) {
return appendResult;
}
}
当我们的RecordBatch可能满了后(只要新的消息放不下,就会被认为是满了)
private RecordAppendResult tryAppend(long timestamp, byte[] key, byte[] value, Callback callback, Deque<RecordBatch> deque) {
// 获取deque中最后一个
RecordBatch last = deque.peekLast();
if (last != null) {
FutureRecordMetadata future = last.tryAppend(timestamp, key, value, callback, time.milliseconds());
if (future == null) {
last.records.close();
} else {
return new RecordAppendResult(future, deque.size() > 1 || last.records.isFull(), false);
}
}
return null;
}
满了以后,在最底层的Buffer的引用会被赋给 RecordBatch 中 MemoryRecords 的 ByteBuffer,然后flip一下,变成读模式。
public void close() {
if (writable) {
// close the compressor to fill-in wrapper message metadata if necessary
compressor.close();
// flip the underlying buffer to be ready for reads
// flip 基础buffer来供读
buffer = compressor.buffer();
buffer.flip();
// reset the writable flag
writable = false;
}
}
那么什么时候会发送呢?在我们的RecordBatch满了,或者创建了新的RecordBatch时,会唤醒sender => NetworkClient => Selector => nioSelector
// batch满了或者创建了新的batch,就去唤醒sender
if (result.batchIsFull || result.newBatchCreated) {
log.trace("Waking up the sender since topic {} partition {} is either full or getting a new batch", record.topic(), partition);
this.sender.wakeup();
}
唤醒nioSelector有什么用呢?如果对java的Selector有所了解,那么就应该知道Selector在select()或select(long time)的时候,是会阻塞的。
在创建kafkaProducer时,变回初始化一个Sender线程,这个Sender线程会循环地拉取数据发往broker。其中就有一个步骤,进行了select() 的操作。
当我们装填好了一个ByteBuffer,就会去唤醒Selector,不要阻塞了,继续执行,进行下一个循环,我们来看看这个循环。这个循环实际上就是我们的消息运输和销毁的流水线。
/**
* Run a single iteration of sending
*
* 发送消息的核心方法
*
* @param now The current POSIX time in milliseconds
*/
void run(long now) {
/** 1、从metadata获取元数据 */
Cluster cluster = metadata.fetch();
/** 2、从Accumulator选出可以发送的node节点 */
// get the list of partitions with data ready to send
// 获取待发送的带数据的分区列表
// 复符合发送消息条件的节点会被返回
RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);
/** 3、如果ReadyCheckResult 中有 unknownLeader 的node,更新一下元数据 */
// if there are any partitions whose leaders are not known yet, force metadata update
// 如果有任何分区还没选举出leader,强制metadata进行更新
if (result.unknownLeadersExist) {
this.metadata.requestUpdate();
}
/** 4、循环调用client(NetworkClient)中的ready方法,从io层面检查消息是否符合发送条件 */
/*
* 移除还没有准备好要发送的节点
*/
// remove any nodes we aren't ready to send to
Iterator<Node> iter = result.readyNodes.iterator();
long notReadyTimeout = Long.MAX_VALUE;
while (iter.hasNext()) {
Node node = iter.next();
if (!this.client.ready(node, now)) {
iter.remove();
notReadyTimeout = Math.min(notReadyTimeout, this.client.connectionDelay(node, now));
}
}
// create produce requests
// 创建produce请求
Map<Integer/* nodeId */, List<RecordBatch>> batches = this.accumulator.drain(cluster,
result.readyNodes,
this.maxRequestSize,
now);
if (guaranteeMessageOrder) {
// Mute all the partitions drained
for (List<RecordBatch> batchList : batches.values()) {
for (RecordBatch batch : batchList)
this.accumulator.mutePartition(batch.topicPartition);
}
}
/** 6、处理RecordAccumulator中超时的消息*/
List<RecordBatch> expiredBatches = this.accumulator.abortExpiredBatches(this.requestTimeout, now);
// update sensors
for (RecordBatch expiredBatch : expiredBatches)
this.sensors.recordErrors(expiredBatch.topicPartition.topic(), expiredBatch.recordCount);
sensors.updateProduceRequestMetrics(batches);
/** 7、将待发送的消息封装成ClientRequest */
List<ClientRequest> requests = createProduceRequests(batches, now);
// If we have any nodes that are ready to send + have sendable data, poll with 0 timeout so this can immediately
// loop and try sending more data. Otherwise, the timeout is determined by nodes that have partitions with data
// that isn't yet sendable (e.g. lingering, backing off). Note that this specifically does not include nodes
// with sendable data that aren't ready to send since they would cause busy looping.
long pollTimeout = Math.min(result.nextReadyCheckDelayMs, notReadyTimeout);
if (result.readyNodes.size() > 0) {
log.trace("Nodes with data ready to send: {}", result.readyNodes);
log.trace("Created {} produce requests: {}", requests.size(), requests);
pollTimeout = 0;
}
/** 8、将ClientRequest写入KafkaChannel中的send字段 */
for (ClientRequest request : requests)
client.send(request, now);
/** 9、真正的把消息发送出去,并处理客户端的ack,处理超时请求,调用用户自定义的Callback等。*/
// if some partitions are already ready to be sent, the select time would be 0;
// otherwise if some partition already has some data accumulated but not ready yet,
// the select time will be the time difference between now and its linger expiry time;
// otherwise the select time will be the time difference between now and the metadata expiry time;
this.client.poll(pollTimeout, now);
}
二、从RecordAccumulator获取可以发送的消息
获取之前先进行前置判断:
简单地获取元数据、判断每个tp的leader选出来了没之类的,如果存在未选出leader的tp,告诉元数据,等下要更新一下。
检验通过的节点(就是leader所在的那个节点),将会准备翻牌,将RecordBatch(封装了ByteBuffer)从RecordAccumulator中提取出来。
这里会循环所有的node(leader节点),根据这个节点所属的Topic分区(TopicPartition)信息,去获取Deque,我们知道Deque是可以根据 Deque<RecordBatch> deque = getDeque(tp)获取的。
讲道理是要获取所有可发送的消息的,但kafka限定了一次request的大小,如果一次request的大小过大,则一次io需要的时间就越长(当然从宏观上来看,一次发的越多,效率可能越高),但总不可能为了提升总体效率,导致一条消息要发几秒吧?
the maximum request size to attempt to send to the server
所以当RecordBatch累加的大小超过一定限制后,循环会break,最终返回给sender一个 Map<Integer/* nodeId */, List> batches
for (Node node : nodes){
//....
do{
// Only drain the batch if it is not during backoff period.
if (!backoff) {
if (size + first.records.sizeInBytes() > maxSize && !ready.isEmpty()) {
// there is a rare case that a single batch size is larger than the request size due
// to compression; in this case we will still eventually send this batch in a single
// request
// 数据量已经满了,需要结束循环
break;
} else {
// 数据量没满,那么取出每个deque的第一个元素,关闭memoryRecord(关闭Compressor,并将MemoryRecords设置为只读)
RecordBatch batch = deque.pollFirst();
batch.records.close();
size += batch.records.sizeInBytes();
ready.add(batch);
batch.drainedMs = now;
}
}
}while xxxx
}
三、将RecordBatch封装成可发送对象ClientRequest
从RecordAccumulator中获取出来的这个Map,将会被封装成可发送对象 List<ClientRequest>。
我们可以看到 List<ClientRequest> 被循环分成了两个Map:
Map<TopicPartition, ByteBuffer> produceRecordsByPartition
Map<TopicPartition, RecordBatch> recordsByPartition
produceRecordsByPartition 的构成十分简单,便是key为tp,val为ByteBuffer的Map。
这个Map会被封装成Struct,Struct可以理解为类似于json的一种数据结构。在Kafka中,数据的传输都是用的Struct这种数据结构。
private ClientRequest produceRequest(long now, int destination, short acks, int timeout, List<RecordBatch> batches) {
// 将batches重新整理成两个map
Map<TopicPartition, ByteBuffer> produceRecordsByPartition = new HashMap<TopicPartition, ByteBuffer>(batches.size());
final Map<TopicPartition, RecordBatch> recordsByPartition = new HashMap<TopicPartition, RecordBatch>(batches.size());
for (RecordBatch batch : batches) {
TopicPartition tp = batch.topicPartition;
produceRecordsByPartition.put(tp, batch.records.buffer());
recordsByPartition.put(tp, batch);
}
// 组装 request
ProduceRequest request = new ProduceRequest(acks, timeout, produceRecordsByPartition);
// TODO:创建request,这个requestSend就是真正的发送对象
RequestSend send = new RequestSend(Integer.toString(destination),
this.client.nextRequestHeader(ApiKeys.PRODUCE),
request.toStruct());
// 封装回调
RequestCompletionHandler callback = new RequestCompletionHandler() {
public void onComplete(ClientResponse response) {
handleProduceResponse(response, recordsByPartition, time.milliseconds());
}
};
return new ClientRequest(now, acks != 0, send, callback);
}
recordsByPartition 则用于封装回调,比如说失败了后的重试、ByteBuffer的释放等等。并调用发送消息时的那个回调。
@Override
public Future<RecordMetadata> send(ProducerRecord<K, V> record, Callback callback) {
// intercept the record, which can be potentially modified; this method does not throw exceptions
ProducerRecord<K, V> interceptedRecord = this.interceptors == null ? record : this.interceptors.onSend(record);
return doSend(interceptedRecord, callback);
}
四、数据的发送
数据的发送前面的文章已经说过,这个封装好的ClientRequest会被丢到KafkaSelector,最终被nioSelector发送出去,抵达Broker。浅析KafkaChannel、NetworkReceive、Send,以及底层优秀的实现:KafkaSelector的实现。
参考书籍:
《Kafka技术内幕》 郑奇煌著
《Apache Kafka源码剖析》 徐郡明著