在消费者调用poll拉消息的时候,消费者会先检测当前是否需要执行分区再分配操作,如果需要则直接返回空的结果,这样在不超时的情况下,方法 KafkaConsumer#pollOnce 会立即被再次调用,从而开始对当前 topic 分区执行再分配,即调用 ConsumerCoordinator#poll 方法。
public boolean poll(Timer timer) {
maybeUpdateSubscriptionMetadata();
// 触发执行注册的监听 offset 提交完成的方法
invokeCompletedOffsetCommitCallbacks();
// 确保当前是 AUTO_TOPICS 或 AUTO_PATTERN(USER_ASSIGNED 不需要再平衡)订阅模式,
// 且目标 GroupCoordinator 节点可达,如果不可达,则会尝试寻找一个可用的节点
if (subscriptions.partitionsAutoAssigned()) {
if (protocol == null) {
throw new IllegalStateException("User configured " + ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG +
" to empty while trying to subscribe for group protocol to auto assign partitions");
}
// Always update the heartbeat last poll time so that the heartbeat thread does not leave the
// group proactively due to application inactivity even if (say) the coordinator cannot be found.
pollHeartbeat(timer.currentTimeMs());
if (coordinatorUnknown() && !ensureCoordinatorReady(timer)) {
return false;
}
// 需要执行再平衡
if (rejoinNeededOrPending()) {
// due to a race condition between the initial metadata fetch and the initial rebalance,
// we need to ensure that the metadata is fresh before joining initially. This ensures
// that we have matched the pattern against the cluster's topics at least once before joining.
if (subscriptions.hasPatternSubscription()) {
// For consumer group that uses pattern-based subscription, after a topic is created,
// any consumer that discovers the topic after metadata refresh can trigger rebalance
// across the entire consumer group. Multiple rebalances can be triggered after one topic
// creation if consumers refresh metadata at vastly different times. We can significantly
// reduce the number of rebalances caused by single topic creation by asking consumer to
// refresh metadata before re-joining the group as long as the refresh backoff time has
// passed.
if (this.metadata.timeToAllowUpdate(timer.currentTimeMs()) == 0) {
this.metadata.requestUpdate();
}
if (!client.ensureFreshMetadata(timer)) {
return false;
}
maybeUpdateSubscriptionMetadata();
}
/*
* 1. 检查目标 GroupCoordinator 节点是否准备好接收请求
* 2. 启动心跳线程
* 3. 执行分区再分配操作
*/
// ensureActiveGroup 执行具体的分区再分配操作
if (!ensureActiveGroup(timer)) {
return false;
}
}
} else {
// For manually assigned partitions, if there are no ready nodes, await metadata.
// If connections to all nodes fail, wakeups triggered while attempting to send fetch
// requests result in polls returning immediately, causing a tight loop of polls. Without
// the wakeup, poll() with no channels would block for the timeout, delaying re-connection.
// awaitMetadataUpdate() initiates new connections with configured backoff and avoids the busy loop.
// When group management is used, metadata wait is already performed for this scenario as
// coordinator is unknown, hence this check is not required.
if (metadata.updateRequested() && !client.hasReadyNodes(timer.currentTimeMs())) {
client.awaitMetadataUpdate(timer);
}
}
// 异步提交 offset
maybeAutoCommitOffsetsAsync(timer.currentTimeMs());
return true;
}
当我们使用 AUTO_TOPICS 或 AUTO_PATTERN 模式订阅 Kafka topic 时,我们并不需要考虑当前消费者具体消费哪个分区,Kafka 会依据分区分配策略为消费者分配一个或多个分区进行消费(一个分区至多被一个消费者消费,不允许多个消费者同时消费同一个分区)。但是消费者可能会中途加入,也可能会中途退出,topic 的分区数目也是允许改变的,此时就需要依赖分区再分配机制为注册的消费者重新分配分区。
分区再分配操作分为 3 个阶段,并且是一个与集群交互联动的过程,这里我们以客户端视角,当消费者检测到需要重新分配分区时会触发执行:
- 发送 GroupCoordinatorRequest 请求获取目标可用的 GroupCoordinator 实例所在的 broker 节点,如果没有则选择负载最小的节点并尝试建立连接;
- 向 GroupCoordinator 实例所在节点发送 JoinGroupRequest 请求申请加入目标 group,GroupCoordinator 实例会在既定时间范围内等待消费者的申请加入请求,如果提前检测到已经接收到 group 名下所有消费者的申请,或者等待时间超时,则会返回 JoinGroupResponse 响应,主要目的是告知谁是新的 Group Leader 消费者,以及最终确定的分区分配策略;
- Group Leader 依据指定的分区分配策略为当前 group 名下的消费者分配分区,并向目标 GroupCoordinator 实例所在节点发送 SyncGroupRequest 请求以告知最终的分区分配结果。
判定需要执行分区再分配操作的条件,位于 ConsumerCoordinator#rejoinNeededOrPending
public boolean rejoinNeededOrPending() {
// USER_ASSIGNED 订阅模式不需要执行分区再分配
if (!subscriptions.partitionsAutoAssigned())
return false;
// we need to rejoin if we performed the assignment and metadata has changed;
// also for those owned-but-no-longer-existed partitions we should drop them as lost
// 再平衡过程中分区数量发生变化
if (assignmentSnapshot != null && !assignmentSnapshot.matches(metadataSnapshot)) {
requestRejoin();
return true;
}
// we need to join if our subscription has changed since the last join
// 消费者 topic 订阅信息发生变化
if (joinedSubscription != null && !joinedSubscription.equals(subscriptions.subscription())) {
requestRejoin();
return true;
}
// 其它标识需要再平衡的操作,例如分区再分配执行失败、重置年代信息等
return super.rejoinNeededOrPending();
}
如果判定需要执行分区再分配操作,消费者接下去会调用 AbstractCoordinator#ensureActiveGroup 方法确认所属 group 对应的目标 GroupCoordinator 实例所在节点是否准备好接收请求,如果对应节点不可用,则会发送 GroupCoordinatorRequest 请求查找负载较小且可用的节点,并与之建立连接。接着会调用 AbstractCoordinator#joinGroupIfNeeded 方法开始执行分区再分配策略,实现如下:
boolean joinGroupIfNeeded(final Timer timer) {
// 如果需要执行分区再分配,且目前正在进行中
while (rejoinNeededOrPending()) {
// 再次检查目标 GroupCoordinator 节点是否准备好接收请求
if (!ensureCoordinatorReady(timer)) {
return false;
}
// call onJoinPrepare if needed. We set a flag to make sure that we do not call it a second
// time if the client is woken up before a pending rebalance completes. This must be called
// on each iteration of the loop because an event requiring a rebalance (such as a metadata
// refresh which changes the matched subscription set) can occur while another rebalance is
// still in progress.
// 执行前期准备工作
if (needsJoinPrepare) {
// need to set the flag before calling onJoinPrepare since the user callback may throw
// exception, in which case upon retry we should not retry onJoinPrepare either.
/*
* 1. 如果开启了 offset 自动提交,则同步提交 offset
* 2. 调用注册的 ConsumerRebalanceListener 监听器的 onPartitionsRevoked 方法
* 3. 取消当前消费者的 leader 身份(如果是的话),恢复成为一个普通的消费者
*/
needsJoinPrepare = false;
onJoinPrepare(generation.generationId, generation.memberId);
}
// 创建并发送 JoinGroupRequest 请求,申请加入目标 group
final RequestFuture<ByteBuffer> future = initiateJoinGroup();
client.poll(future, timer);
if (!future.isDone()) {
// we ran out of time
return false;
}
// 执行分区分配成功
if (future.succeeded()) {
Generation generationSnapshot;
// Generation data maybe concurrently cleared by Heartbeat thread.
// Can't use synchronized for {@code onJoinComplete}, because it can be long enough
// and shouldn't block heartbeat thread.
// See {@link PlaintextConsumerTest#testMaxPollIntervalMsDelayInAssignment}
synchronized (AbstractCoordinator.this) {
generationSnapshot = this.generation;
}
if (generationSnapshot != Generation.NO_GENERATION) {
// Duplicate the buffer in case `onJoinComplete` does not complete and needs to be retried.
ByteBuffer memberAssignment = future.value().duplicate();
onJoinComplete(generationSnapshot.generationId, generationSnapshot.memberId, generationSnapshot.protocol, memberAssignment);
// Generally speaking we should always resetJoinGroupFuture once the future is done, but here
// we can only reset the join group future after the completion callback returns. This ensures
// that if the callback is woken up, we will retry it on the next joinGroupIfNeeded.
// And because of that we should explicitly trigger resetJoinGroupFuture in other conditions below.
resetJoinGroupFuture();
needsJoinPrepare = true;
} else {
log.info("Generation data was cleared by heartbeat thread. Initiating rejoin.");
resetStateAndRejoin();
resetJoinGroupFuture();
return false;
}
} else {
resetJoinGroupFuture();
// 执行分区分配失败,依据失败类型考虑是否重试
final RuntimeException exception = future.exception();
if (exception instanceof UnknownMemberIdException ||
exception instanceof RebalanceInProgressException ||
exception instanceof IllegalGenerationException ||
exception instanceof MemberIdRequiredException)
continue;
else if (!future.isRetriable())
throw exception;
timer.sleep(rebalanceConfig.retryBackoffMs);
}
}
return true;
}