Kafka副本管理—— 为何去掉replica.lag.max.messages参数

Kafka副本同步机制深入解析
本文探讨了Kafka中replica.lag.max.messages参数的移除原因,分析了其在不同场景下的适用性和限制,介绍了新版本Kafka如何通过replica.lag.time.max.ms参数更有效地处理副本滞后问题。

https://www.cnblogs.com/huxi2b/p/5903354.html

今天查看Kafka 0.10.0的官方文档,发现了这样一句话:Configuration parameter replica.lag.max.messages was removed. Partition leaders will no longer consider the number of lagging messages when deciding which replicas are in sync. 即replica.lag.max.messages参数被正式地移除了,现在topic每个分区的leader副本都不再使用这个参数作为判断follower副本同步状态的依据。看到之后顿觉十分好奇于是抽出半天时间仔细研究了一下,终于弄明白了移除该参数的原因,特此记录一下。

  首先我们来看一下这个参数本来的含义: If a replica falls more than this many messages behind the leader, the leader will remove the follower from ISR and treat it as dead. 即如果某个副本落后leader副本的消息数超过了这个值,那么leader副本就会把该follower副本从ISR中移除。Kafka 0.8.2.2的代码是这样使用该参数的:

1

2

3

4

val stuckReplicas = candidateReplicas.filter(r => (time.milliseconds - r.logEndOffsetUpdateTimeMs) > keepInSyncTimeMs)

val slowReplicas = candidateReplicas.filter(r =>

      r.logEndOffset.messageOffset >= 0 &&

      leaderLogEndOffset.messageOffset - r.logEndOffset.messageOffset > keepInSyncMessages)  // keepInSyncMessages即replica.lag.max.messages的值

   这段代码表明若分区leader副本的结束位移(以下皆称LEO, log end offset)与该follower副本LEO的差值超过了这个阈值,那么就会被视为slow副本,并加入到slowReplicas集合中。该集合中的所有副本都将被认为是与leader副本不同步(out of sync)。但是Kafka 0.9之后的代码变成了这个样子:

1

val laggingReplicas = candidateReplicas.filter(r => (time.milliseconds - r.lastCaughtUpTimeMs) > maxLagMs)

显然,新版本(0.9及以后)的Kafka只使用一个参数来确定滞后副本(lagging replica),而不再使用replica.lag.max.messages参数。这是因为什么原因呢?

  在详细解释此事之前我们先明确一些公共的术语以方便后续的讨论:

  • AR:assigned replicas。通常情况下,每个分区都会被分配多个副本。具体的副本数量由参数offsets.topic.replication.factor指定。分区的AR数据保存在Zookeeper的/brokers/topics/<topic>节点中
  • ISR:in-sync replicas。与leader副本保持同步状态的副本集合(leader副本本身也在ISR中)。ISR数据保存在Zookeeper的/brokers/topics/<topic>/partitions/<partitionId>/state节点中
  • High Watermark:副本高水位值,简称HW,它表示该分区最新一条已提交消息(committed message)的位移
  • LEO:log end offset。从名字上来看似乎是日志结束位移,但其实是下一条消息的位移,即追加写下一条消息的位移

  值得一提的,HW表示的是最新一条已提交消息的位移。注意这里是已提交的,说明这条消息已经完全备份过了(fully replicated),而LEO可能会比HW值大——因为对于分区的leader副本而言,它的日志随时会被追加写入新消息,而这些新消息很可能还没有被完全复制到其他follower副本上,所以LEO值可能会比HW值大。两者的关系可参考下图:

消费者只能消费到HW线以下的消息,即上图中绿色的部分;而紫色的消息就是未完全备份的消息,因而不能被消费者消费。

  明白了这些术语之后,还有个问题需要研究下: follower部分与leader副本不同步,这是什么意思?不同步(out of sync)意味着follower副本无法追上leader副本的LEO,而这又是什么意思呢?我们举个简单的例子来说明。设想我们有一个topic,它只有一个分区,备份因子是3。假设这三个副本分别保存在broker1,broker2和broker3上。leader副本在broker1上,其他两个broker上的副本都是follower副本,且当前所有的副本都在ISR中。现在我们设置replica.lag.max.messages等于4——表示只要follower副本落后leader副本的消息数小于4,该follower副本就不会被踢出ISR。如果此时有个producer程序每次给这个topic发送3条消息,那么初始状态如下:

 

很显然,目前2个follower副本与leader副本是同步的,即它们都能追上leader副本的LEO。假设此时producer生产了1条新消息给leader副本,而同时broker3上的follower副本经历了一次Full GC,那么现在的日志状态如下图:

 

从上图可以发现,leader副本的HW值和LEO值已然变得不一样了。不过更重要的是,最新生产的这条消息是不会被视为“已提交”的,除非broker3被踢出ISR或者broker3上的follower副本追上了leader的LEO。由于replica.lag.max.messages=4,而broker3上的follower副本也只是落后leader副本1条消息,所以此时broker3上的副本并不满足条件因而也不会被踢出ISR。对于broker3上的副本而言,事情变得相当简单——只需追上leader的LEO即可。如果我们假设broker3因为Full GC停顿了100ms之后追上了leader的进度,那么此时的日志状态应该如下图所示:

 此时一切都很完美了,leader的HW值与LEO值相同;2个follower副本都与leader副本是同步的。

   那么有什么可能的原因会使得follower副本与leader副本不同步呢?归纳起来有三种原因:

  • 速度跟不上——follower副本在一段时间内都没法追上leader副本的消息写入速度,比如follower副本所在broker的网络IO开销过大导致备份消息的速度慢于从leader处获取消息的速度
  • 进程卡住了——follower副本在一段时间内根本就没有向leader副本发起FetchRequest请求(该请求就是获取消息数据),比如太过频繁的GC或其他失败导致
  • 新创建的——如果用户增加了备份因子,很显然新follower副本在启动过程初始肯定是全力追赶leader副本,因而与其是不同步的

  replica.lag.max.messags参数就是用于检测第一种情况的。当然Kafka还提供了一个参数 replica.lag.time.max.ms来检测另外两种情况。比如如果设置replica.lag.time.max.ms=500ms,只要follower副本每隔500ms都能发送FetchRequest请求给leader,那么该副本就不会被标记成dead从而被踢出ISR。

  由于本文重点关注replica.lag.max.messages参数,那么我们来说一下Kafka检测第一种情况会碰到的问题。回到之前提到的那个例子,如果producer一次性发送消息的速率是2条/秒,即一个batch都有2条消息,那么显然设置replica.lag.max.messages=4是个相当安全且合适的数值。为什么? 因为在leader副本接收到producer发送过来的消息之后而follower副本开始备份这些消息之前,follower副本落后leader的消息数不会超过3条消息。但如果follower副本落后leader的消息数超过了3条,那么你肯定希望leader把这个特别慢的follower副本踢出ISR以防止增加producer消息生产的延时。从这个简单的例子上来看,这个参数似乎工作得很好,为什么要移除它呢?根本原因在于如果要正确设置这个参数的值,需要用户结合具体使用场景自己去评估——基于这个原因,新版本Kafka把这个参数移除了。

  好了,我来详细解释一下这个根本原因。首先,对于一个参数的设置,有一点是很重要的:用户应该对他们知道的参数进行设置,而不是对他们需要进行猜测的参数进行设置。对于该参数来说,我们只能去猜它应该设置成哪些值,而不是根据我们的需要对其进行设置。为什么?举个例子,假设在刚才那个topic的环境中producer程序突然发起了一波消息生产的瞬时高峰流量增加,比如producer现在一次性发送4条消息过来了,也就是说与replica.lag.max.messages值相等了。此时,这两个follower副本都会被认为是与leader副本不同步了,从而被踢出ISR,具体日志状态如下图所示:

   从上图看,这两个follower副本与leader不再同步,但其实他们都是存活状态(alive)的且没有任何性能问题。那么在下次FetchRequest时它们就能追上leader的LEO,并重新被加入ISR——于是就出现了这样的情况:它们不断地被踢出ISR然后重新加回ISR,造成了与leader不同步、再同步、又不同步、再次同步的情况发生。想想就知道这是多大的开销!问题的关键就在replica.lag.max.messages这个参数上。用户通过猜测设置该值,猜测producer的速度,猜测leader副本的入站流量。

  可能有用户会说该参数默认值是4000,应该足够使用了吧。但有一点需要注意的是,这个参数是全局的!即所有topic都受到这个参数的影响。假设集群中有两个topic: t1和t2。假设它们的流量差异非常巨大,t1的消息生产者一次性生产5000条消息,直接就突破了4000这个默认值;而另一个topic,t2,它的消息生产者一次性生产10条消息,那么Kafka就需要相当长的时间才能辨别出t2各个分区中那些滞后的副本。很显然这种流量差异巨大的topic很容易地在同一个集群上部署,那么这种情况下replica.lag.max.messages参数应该设置什么值呢? 显然没有合适的值,对吧?

  综上所述,新版本的Kafka去除了这个参数,改为只使用一个参数就能够同时检测由于slow以及由于进程卡壳而导致的滞后(lagging)——即follower副本落后leader副本的时间间隔。这个唯一的参数就是replica.lag.time.max.ms,默认是10秒。对于第2,3种不同步原因而言,该参数没有什么具体的变化。但是对于第一种情况,检测机制有了一些微调——如果一个follower副本落后leader的时间持续性地超过了这个阈值,那么这个副本就要被标记为dead从而被踢出ISR。这样即使出现刚刚提到的producer瞬时峰值流量,只要follower没有持续性地落后,它就不会反复地在ISR中移进移出。

  最后说一句,这是Kafka副本调优的一个需求,具体的细节详见KIP-16 --- Automated Replica Lag Tuning 

[root@kafka1 hejia]# /usr/local/kafka_2.13-3.7.2/bin/kafka-server-start.sh /usr/local/kafka_2.13-3.7.2/config/server.properties [2025-09-01 17:10:47,478] INFO Registered kafka:type=kafka.Log4jController MBean (kafka.utils.Log4jControllerRegistration$) [2025-09-01 17:10:47,911] INFO Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation (org.apache.zookeeper.common.X509Util) [2025-09-01 17:10:48,125] INFO Registered signal handlers for TERM, INT, HUP (org.apache.kafka.common.utils.LoggingSignalHandler) [2025-09-01 17:10:48,127] INFO starting (kafka.server.KafkaServer) [2025-09-01 17:10:48,128] INFO Connecting to zookeeper on 192.168.14.124:2181,192.168.14.125:2181,192.168.14.126:2181 (kafka.server.KafkaServer) [2025-09-01 17:10:48,145] INFO [ZooKeeperClient Kafka server] Initializing a new session to 192.168.14.124:2181,192.168.14.125:2181,192.168.14.126:2181. (kafka.zookeeper.ZooKeeperClient) [2025-09-01 17:10:48,151] INFO Client environment:zookeeper.version=3.8.4-9316c2a7a97e1666d8f4593f34dd6fc36ecc436c, built on 2024-02-12 22:16 UTC (org.apache.zookeeper.ZooKeeper) [2025-09-01 17:10:48,152] INFO Client environment:host.name=kafka1 (org.apache.zookeeper.ZooKeeper) [2025-09-01 17:10:48,152] INFO Client environment:java.version=1.8.0_171 (org.apache.zookeeper.ZooKeeper) [2025-09-01 17:10:48,152] INFO Client environment:java.vendor=Oracle Corporation (org.apache.zookeeper.ZooKeeper) [2025-09-01 17:10:48,152] INFO Client environment:java.home=/usr/local/jdk1.8.0_171/jre (org.apache.zookeeper.ZooKeeper) [2025-09-01 17:10:48,152] INFO Client environment:java.class.path=/usr/local/jdk1.8.0_171/jre/lib/ext:/usr/local/jdk1.8.0_171/lib/tools.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/activation-1.1.1.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/aopalliance-repackaged-2.6.1.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/argparse4j-0.7.0.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/audience-annotations-0.12.0.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/caffeine-2.9.3.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/checker-qual-3.19.0.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/commons-beanutils-1.9.4.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/commons-cli-1.4.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/commons-collections-3.2.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/commons-digester-2.1.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/commons-io-2.14.0.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/commons-lang3-3.8.1.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/commons-logging-1.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/commons-validator-1.7.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/connect-api-3.7.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/connect-basic-auth-extension-3.7.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/connect-json-3.7.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/connect-mirror-3.7.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/connect-mirror-client-3.7.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/connect-runtime-3.7.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/connect-transforms-3.7.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/error_prone_annotations-2.10.0.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/hk2-api-2.6.1.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/hk2-locator-2.6.1.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/hk2-utils-2.6.1.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jackson-annotations-2.16.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jackson-core-2.16.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jackson-databind-2.16.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jackson-dataformat-csv-2.16.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jackson-datatype-jdk8-2.16.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jackson-jaxrs-base-2.16.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jackson-jaxrs-json-provider-2.16.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jackson-module-jaxb-annotations-2.16.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jackson-module-scala_2.13-2.16.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jakarta.activation-api-1.2.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jakarta.annotation-api-1.3.5.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jakarta.inject-2.6.1.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jakarta.validation-api-2.0.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jakarta.ws.rs-api-2.1.6.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jakarta.xml.bind-api-2.3.3.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/javassist-3.29.2-GA.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/javax.activation-api-1.2.0.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/javax.annotation-api-1.3.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/javax.servlet-api-3.1.0.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/javax.ws.rs-api-2.1.1.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jaxb-api-2.3.1.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jersey-client-2.39.1.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jersey-common-2.39.1.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jersey-container-servlet-2.39.1.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jersey-container-servlet-core-2.39.1.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jersey-hk2-2.39.1.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jersey-server-2.39.1.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jetty-client-9.4.56.v20240826.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jetty-continuation-9.4.56.v20240826.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jetty-http-9.4.56.v20240826.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jetty-io-9.4.56.v20240826.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jetty-security-9.4.56.v20240826.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jetty-server-9.4.56.v20240826.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jetty-servlet-9.4.56.v20240826.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jetty-servlets-9.4.56.v20240826.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jetty-util-9.4.56.v20240826.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jetty-util-ajax-9.4.56.v20240826.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jline-3.25.1.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jopt-simple-5.0.4.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jose4j-0.9.4.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/jsr305-3.0.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/kafka_2.13-3.7.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/kafka-clients-3.7.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/kafka-group-coordinator-3.7.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/kafka-log4j-appender-3.7.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/kafka-metadata-3.7.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/kafka-raft-3.7.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/kafka-server-3.7.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/kafka-server-common-3.7.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/kafka-shell-3.7.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/kafka-storage-3.7.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/kafka-storage-api-3.7.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/kafka-streams-3.7.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/kafka-streams-examples-3.7.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/kafka-streams-scala_2.13-3.7.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/kafka-streams-test-utils-3.7.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/kafka-tools-3.7.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/kafka-tools-api-3.7.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/lz4-java-1.8.0.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/maven-artifact-3.8.8.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/metrics-core-2.2.0.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/metrics-core-4.1.12.1.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/netty-buffer-4.1.115.Final.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/netty-codec-4.1.115.Final.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/netty-common-4.1.115.Final.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/netty-handler-4.1.115.Final.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/netty-resolver-4.1.115.Final.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/netty-transport-4.1.115.Final.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/netty-transport-classes-epoll-4.1.115.Final.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/netty-transport-native-epoll-4.1.115.Final.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/netty-transport-native-unix-common-4.1.115.Final.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/opentelemetry-proto-1.0.0-alpha.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/osgi-resource-locator-1.0.3.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/paranamer-2.8.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/pcollections-4.0.1.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/plexus-utils-3.3.1.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/protobuf-java-3.25.5.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/reflections-0.10.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/reload4j-1.2.25.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/rocksdbjni-7.9.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/scala-collection-compat_2.13-2.10.0.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/scala-java8-compat_2.13-1.0.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/scala-library-2.13.12.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/scala-logging_2.13-3.9.4.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/scala-reflect-2.13.12.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/slf4j-api-1.7.36.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/slf4j-reload4j-1.7.36.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/snappy-java-1.1.10.5.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/swagger-annotations-2.2.8.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/trogdor-3.7.2.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/zookeeper-3.8.4.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/zookeeper-jute-3.8.4.jar:/usr/local/kafka_2.13-3.7.2/bin/../libs/zstd-jni-1.5.6-4.jar (org.apache.zookeeper.ZooKeeper) [2025-09-01 17:10:48,153] INFO Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib (org.apache.zookeeper.ZooKeeper) [2025-09-01 17:10:48,153] INFO Client environment:java.io.tmpdir=/tmp (org.apache.zookeeper.ZooKeeper) [2025-09-01 17:10:48,153] INFO Client environment:java.compiler=<NA> (org.apache.zookeeper.ZooKeeper) [2025-09-01 17:10:48,154] INFO Client environment:os.name=Linux (org.apache.zookeeper.ZooKeeper) [2025-09-01 17:10:48,154] INFO Client environment:os.arch=amd64 (org.apache.zookeeper.ZooKeeper) [2025-09-01 17:10:48,154] INFO Client environment:os.version=3.10.0-1160.el7.x86_64 (org.apache.zookeeper.ZooKeeper) [2025-09-01 17:10:48,154] INFO Client environment:user.name=root (org.apache.zookeeper.ZooKeeper) [2025-09-01 17:10:48,154] INFO Client environment:user.home=/root (org.apache.zookeeper.ZooKeeper) [2025-09-01 17:10:48,154] INFO Client environment:user.dir=/home/hejia (org.apache.zookeeper.ZooKeeper) [2025-09-01 17:10:48,154] INFO Client environment:os.memory.free=1003MB (org.apache.zookeeper.ZooKeeper) [2025-09-01 17:10:48,154] INFO Client environment:os.memory.max=1024MB (org.apache.zookeeper.ZooKeeper) [2025-09-01 17:10:48,154] INFO Client environment:os.memory.total=1024MB (org.apache.zookeeper.ZooKeeper) [2025-09-01 17:10:48,157] INFO Initiating client connection, connectString=192.168.14.124:2181,192.168.14.125:2181,192.168.14.126:2181 sessionTimeout=18000 watcher=kafka.zookeeper.ZooKeeperClient$ZooKeeperClientWatcher$@6150c3ec (org.apache.zookeeper.ZooKeeper) [2025-09-01 17:10:48,161] INFO jute.maxbuffer value is 4194304 Bytes (org.apache.zookeeper.ClientCnxnSocket) [2025-09-01 17:10:48,170] INFO zookeeper.request.timeout value is 0. feature enabled=false (org.apache.zookeeper.ClientCnxn) [2025-09-01 17:10:48,172] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient) [2025-09-01 17:10:48,174] INFO Opening socket connection to server /192.168.14.126:2181. (org.apache.zookeeper.ClientCnxn) [2025-09-01 17:10:48,178] INFO Socket connection established, initiating session, client: /192.168.14.124:42150, server: /192.168.14.126:2181 (org.apache.zookeeper.ClientCnxn) [2025-09-01 17:10:48,184] INFO Session establishment complete on server /192.168.14.126:2181, session id = 0x3000359362800e9, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn) [2025-09-01 17:10:48,187] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient) [2025-09-01 17:10:48,627] INFO Cluster ID = eQFewTrDRa6Csx8shnSzQQ (kafka.server.KafkaServer) [2025-09-01 17:10:48,709] INFO KafkaConfig values: advertised.listeners = null alter.config.policy.class.name = null alter.log.dirs.replication.quota.window.num = 11 alter.log.dirs.replication.quota.window.size.seconds = 1 authorizer.class.name = auto.create.topics.enable = true auto.include.jmx.reporter = true auto.leader.rebalance.enable = true background.threads = 10 broker.heartbeat.interval.ms = 2000 broker.id = 0 broker.id.generation.enable = true broker.rack = null broker.session.timeout.ms = 9000 client.quota.callback.class = null compression.type = producer connection.failed.authentication.delay.ms = 100 connections.max.idle.ms = 600000 connections.max.reauth.ms = 0 control.plane.listener.name = null controlled.shutdown.enable = true controlled.shutdown.max.retries = 3 controlled.shutdown.retry.backoff.ms = 5000 controller.listener.names = null controller.quorum.append.linger.ms = 25 controller.quorum.election.backoff.max.ms = 1000 controller.quorum.election.timeout.ms = 1000 controller.quorum.fetch.timeout.ms = 2000 controller.quorum.request.timeout.ms = 2000 controller.quorum.retry.backoff.ms = 20 controller.quorum.voters = [] controller.quota.window.num = 11 controller.quota.window.size.seconds = 1 controller.socket.timeout.ms = 30000 create.topic.policy.class.name = null default.replication.factor = 1 delegation.token.expiry.check.interval.ms = 3600000 delegation.token.expiry.time.ms = 86400000 delegation.token.master.key = null delegation.token.max.lifetime.ms = 604800000 delegation.token.secret.key = null delete.records.purgatory.purge.interval.requests = 1 delete.topic.enable = true early.start.listeners = null eligible.leader.replicas.enable = false fetch.max.bytes = 57671680 fetch.purgatory.purge.interval.requests = 1000 group.consumer.assignors = [org.apache.kafka.coordinator.group.assignor.UniformAssignor, org.apache.kafka.coordinator.group.assignor.RangeAssignor] group.consumer.heartbeat.interval.ms = 5000 group.consumer.max.heartbeat.interval.ms = 15000 group.consumer.max.session.timeout.ms = 60000 group.consumer.max.size = 2147483647 group.consumer.min.heartbeat.interval.ms = 5000 group.consumer.min.session.timeout.ms = 45000 group.consumer.session.timeout.ms = 45000 group.coordinator.new.enable = false group.coordinator.rebalance.protocols = [classic] group.coordinator.threads = 1 group.initial.rebalance.delay.ms = 0 group.max.session.timeout.ms = 1800000 group.max.size = 2147483647 group.min.session.timeout.ms = 6000 initial.broker.registration.timeout.ms = 60000 inter.broker.listener.name = null inter.broker.protocol.version = 3.7-IV4 kafka.metrics.polling.interval.secs = 10 kafka.metrics.reporters = [] leader.imbalance.check.interval.seconds = 300 leader.imbalance.per.broker.percentage = 10 listener.security.protocol.map = PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL listeners = PLAINTEXT://192.168.14.124:9092 log.cleaner.backoff.ms = 15000 log.cleaner.dedupe.buffer.size = 134217728 log.cleaner.delete.retention.ms = 86400000 log.cleaner.enable = true log.cleaner.io.buffer.load.factor = 0.9 log.cleaner.io.buffer.size = 524288 log.cleaner.io.max.bytes.per.second = 1.7976931348623157E308 log.cleaner.max.compaction.lag.ms = 9223372036854775807 log.cleaner.min.cleanable.ratio = 0.5 log.cleaner.min.compaction.lag.ms = 0 log.cleaner.threads = 1 log.cleanup.policy = [delete] log.dir = /tmp/kafka-logs log.dirs = /data/kafka/log log.flush.interval.messages = 9223372036854775807 log.flush.interval.ms = null log.flush.offset.checkpoint.interval.ms = 60000 log.flush.scheduler.interval.ms = 9223372036854775807 log.flush.start.offset.checkpoint.interval.ms = 60000 log.index.interval.bytes = 4096 log.index.size.max.bytes = 10485760 log.local.retention.bytes = -2 log.local.retention.ms = -2 log.message.downconversion.enable = true log.message.format.version = 3.0-IV1 log.message.timestamp.after.max.ms = 9223372036854775807 log.message.timestamp.before.max.ms = 9223372036854775807 log.message.timestamp.difference.max.ms = 9223372036854775807 log.message.timestamp.type = CreateTime log.preallocate = false log.retention.bytes = -1 log.retention.check.interval.ms = 300000 log.retention.hours = 168 log.retention.minutes = null log.retention.ms = null log.roll.hours = 168 log.roll.jitter.hours = 0 log.roll.jitter.ms = null log.roll.ms = null log.segment.bytes = 1073741824 log.segment.delete.delay.ms = 60000 max.connection.creation.rate = 2147483647 max.connections = 2147483647 max.connections.per.ip = 2147483647 max.connections.per.ip.overrides = max.incremental.fetch.session.cache.slots = 1000 message.max.bytes = 1048588 metadata.log.dir = null metadata.log.max.record.bytes.between.snapshots = 20971520 metadata.log.max.snapshot.interval.ms = 3600000 metadata.log.segment.bytes = 1073741824 metadata.log.segment.min.bytes = 8388608 metadata.log.segment.ms = 604800000 metadata.max.idle.interval.ms = 500 metadata.max.retention.bytes = 104857600 metadata.max.retention.ms = 604800000 metric.reporters = [] metrics.num.samples = 2 metrics.recording.level = INFO metrics.sample.window.ms = 30000 min.insync.replicas = 1 node.id = 0 num.io.threads = 8 num.network.threads = 3 num.partitions = 1 num.recovery.threads.per.data.dir = 1 num.replica.alter.log.dirs.threads = null num.replica.fetchers = 1 offset.metadata.max.bytes = 4096 offsets.commit.required.acks = -1 offsets.commit.timeout.ms = 5000 offsets.load.buffer.size = 5242880 offsets.retention.check.interval.ms = 600000 offsets.retention.minutes = 10080 offsets.topic.compression.codec = 0 offsets.topic.num.partitions = 50 offsets.topic.replication.factor = 1 offsets.topic.segment.bytes = 104857600 password.encoder.cipher.algorithm = AES/CBC/PKCS5Padding password.encoder.iterations = 4096 password.encoder.key.length = 128 password.encoder.keyfactory.algorithm = null password.encoder.old.secret = null password.encoder.secret = null principal.builder.class = class org.apache.kafka.common.security.authenticator.DefaultKafkaPrincipalBuilder process.roles = [] producer.id.expiration.check.interval.ms = 600000 producer.id.expiration.ms = 86400000 producer.purgatory.purge.interval.requests = 1000 queued.max.request.bytes = -1 queued.max.requests = 500 quota.window.num = 11 quota.window.size.seconds = 1 remote.log.index.file.cache.total.size.bytes = 1073741824 remote.log.manager.task.interval.ms = 30000 remote.log.manager.task.retry.backoff.max.ms = 30000 remote.log.manager.task.retry.backoff.ms = 500 remote.log.manager.task.retry.jitter = 0.2 remote.log.manager.thread.pool.size = 10 remote.log.metadata.custom.metadata.max.bytes = 128 remote.log.metadata.manager.class.name = org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManager remote.log.metadata.manager.class.path = null remote.log.metadata.manager.impl.prefix = rlmm.config. remote.log.metadata.manager.listener.name = null remote.log.reader.max.pending.tasks = 100 remote.log.reader.threads = 10 remote.log.storage.manager.class.name = null remote.log.storage.manager.class.path = null remote.log.storage.manager.impl.prefix = rsm.config. remote.log.storage.system.enable = false replica.fetch.backoff.ms = 1000 replica.fetch.max.bytes = 1048576 replica.fetch.min.bytes = 1 replica.fetch.response.max.bytes = 10485760 replica.fetch.wait.max.ms = 500 replica.high.watermark.checkpoint.interval.ms = 5000 replica.lag.time.max.ms = 30000 replica.selector.class = null replica.socket.receive.buffer.bytes = 65536 replica.socket.timeout.ms = 30000 replication.quota.window.num = 11 replication.quota.window.size.seconds = 1 request.timeout.ms = 30000 reserved.broker.max.id = 1000 sasl.client.callback.handler.class = null sasl.enabled.mechanisms = [GSSAPI] sasl.jaas.config = null sasl.kerberos.kinit.cmd = /usr/bin/kinit sasl.kerberos.min.time.before.relogin = 60000 sasl.kerberos.principal.to.local.rules = [DEFAULT] sasl.kerberos.service.name = null sasl.kerberos.ticket.renew.jitter = 0.05 sasl.kerberos.ticket.renew.window.factor = 0.8 sasl.login.callback.handler.class = null sasl.login.class = null sasl.login.connect.timeout.ms = null sasl.login.read.timeout.ms = null sasl.login.refresh.buffer.seconds = 300 sasl.login.refresh.min.period.seconds = 60 sasl.login.refresh.window.factor = 0.8 sasl.login.refresh.window.jitter = 0.05 sasl.login.retry.backoff.max.ms = 10000 sasl.login.retry.backoff.ms = 100 sasl.mechanism.controller.protocol = GSSAPI sasl.mechanism.inter.broker.protocol = GSSAPI sasl.oauthbearer.clock.skew.seconds = 30 sasl.oauthbearer.expected.audience = null sasl.oauthbearer.expected.issuer = null sasl.oauthbearer.jwks.endpoint.refresh.ms = 3600000 sasl.oauthbearer.jwks.endpoint.retry.backoff.max.ms = 10000 sasl.oauthbearer.jwks.endpoint.retry.backoff.ms = 100 sasl.oauthbearer.jwks.endpoint.url = null sasl.oauthbearer.scope.claim.name = scope sasl.oauthbearer.sub.claim.name = sub sasl.oauthbearer.token.endpoint.url = null sasl.server.callback.handler.class = null sasl.server.max.receive.size = 524288 security.inter.broker.protocol = PLAINTEXT security.providers = null server.max.startup.time.ms = 9223372036854775807 socket.connection.setup.timeout.max.ms = 30000 socket.connection.setup.timeout.ms = 10000 socket.listen.backlog.size = 50 socket.receive.buffer.bytes = 102400 socket.request.max.bytes = 104857600 socket.send.buffer.bytes = 102400 ssl.allow.dn.changes = false ssl.allow.san.changes = false ssl.cipher.suites = [] ssl.client.auth = none ssl.enabled.protocols = [TLSv1.2] ssl.endpoint.identification.algorithm = https ssl.engine.factory.class = null ssl.key.password = null ssl.keymanager.algorithm = SunX509 ssl.keystore.certificate.chain = null ssl.keystore.key = null ssl.keystore.location = null ssl.keystore.password = null ssl.keystore.type = JKS ssl.principal.mapping.rules = DEFAULT ssl.protocol = TLSv1.2 ssl.provider = null ssl.secure.random.implementation = null ssl.trustmanager.algorithm = PKIX ssl.truststore.certificates = null ssl.truststore.location = null ssl.truststore.password = null ssl.truststore.type = JKS telemetry.max.bytes = 1048576 transaction.abort.timed.out.transaction.cleanup.interval.ms = 10000 transaction.max.timeout.ms = 900000 transaction.partition.verification.enable = true transaction.remove.expired.transaction.cleanup.interval.ms = 3600000 transaction.state.log.load.buffer.size = 5242880 transaction.state.log.min.isr = 1 transaction.state.log.num.partitions = 50 transaction.state.log.replication.factor = 1 transaction.state.log.segment.bytes = 104857600 transactional.id.expiration.ms = 604800000 unclean.leader.election.enable = false unstable.api.versions.enable = false unstable.metadata.versions.enable = false zookeeper.clientCnxnSocket = null zookeeper.connect = 192.168.14.124:2181,192.168.14.125:2181,192.168.14.126:2181 zookeeper.connection.timeout.ms = 18000 zookeeper.max.in.flight.requests = 10 zookeeper.metadata.migration.enable = false zookeeper.metadata.migration.min.batch.size = 200 zookeeper.session.timeout.ms = 18000 zookeeper.set.acl = false zookeeper.ssl.cipher.suites = null zookeeper.ssl.client.enable = false zookeeper.ssl.crl.enable = false zookeeper.ssl.enabled.protocols = null zookeeper.ssl.endpoint.identification.algorithm = HTTPS zookeeper.ssl.keystore.location = null zookeeper.ssl.keystore.password = null zookeeper.ssl.keystore.type = null zookeeper.ssl.ocsp.enable = false zookeeper.ssl.protocol = TLSv1.2 zookeeper.ssl.truststore.location = null zookeeper.ssl.truststore.password = null zookeeper.ssl.truststore.type = null (kafka.server.KafkaConfig) [2025-09-01 17:10:48,784] INFO [ThrottledChannelReaper-Fetch]: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper) [2025-09-01 17:10:48,785] INFO [ThrottledChannelReaper-Produce]: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper) [2025-09-01 17:10:48,789] INFO [ThrottledChannelReaper-Request]: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper) [2025-09-01 17:10:48,791] INFO [ThrottledChannelReaper-ControllerMutation]: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper) [2025-09-01 17:10:48,953] INFO Loading logs from log dirs ArraySeq(/data/kafka/log) (kafka.log.LogManager) [2025-09-01 17:10:48,960] INFO No logs found to be loaded in /data/kafka/log (kafka.log.LogManager) [2025-09-01 17:10:48,982] INFO Loaded 0 logs in 28ms (kafka.log.LogManager) [2025-09-01 17:10:48,985] INFO Starting log cleanup with a period of 300000 ms. (kafka.log.LogManager) [2025-09-01 17:10:48,987] INFO Starting log flusher with a default period of 9223372036854775807 ms. (kafka.log.LogManager) [2025-09-01 17:10:49,700] INFO [kafka-log-cleaner-thread-0]: Starting (kafka.log.LogCleaner$CleanerThread) [2025-09-01 17:10:49,723] INFO [feature-zk-node-event-process-thread]: Starting (kafka.server.FinalizedFeatureChangeListener$ChangeNotificationProcessorThread) [2025-09-01 17:10:49,754] INFO [MetadataCache brokerId=0] Updated cache from existing None to latest Features(version=3.7-IV4, finalizedFeatures={}, finalizedFeaturesEpoch=0). (kafka.server.metadata.ZkMetadataCache) [2025-09-01 17:10:49,807] INFO [zk-broker-0-to-controller-forwarding-channel-manager]: Starting (kafka.server.NodeToControllerRequestThread) [2025-09-01 17:10:50,663] INFO Updated connection-accept-rate max connection creation rate to 2147483647 (kafka.network.ConnectionQuotas) [2025-09-01 17:10:50,694] INFO [SocketServer listenerType=ZK_BROKER, nodeId=0] Created data-plane acceptor and processors for endpoint : ListenerName(PLAINTEXT) (kafka.network.SocketServer) [2025-09-01 17:10:50,719] INFO [zk-broker-0-to-controller-alter-partition-channel-manager]: Starting (kafka.server.NodeToControllerRequestThread) [2025-09-01 17:10:50,797] INFO [ExpirationReaper-0-Produce]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) [2025-09-01 17:10:50,928] INFO [ExpirationReaper-0-Fetch]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) [2025-09-01 17:10:50,929] INFO [ExpirationReaper-0-DeleteRecords]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) [2025-09-01 17:10:50,931] INFO [ExpirationReaper-0-ElectLeader]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) [2025-09-01 17:10:50,939] INFO [ExpirationReaper-0-RemoteFetch]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) [2025-09-01 17:10:50,963] INFO [LogDirFailureHandler]: Starting (kafka.server.ReplicaManager$LogDirFailureHandler) [2025-09-01 17:10:50,970] INFO [AddPartitionsToTxnSenderThread-0]: Starting (kafka.server.AddPartitionsToTxnManager) [2025-09-01 17:10:51,183] INFO Creating /brokers/ids/0 (is it secure? false) (kafka.zk.KafkaZkClient) [2025-09-01 17:10:51,209] ERROR Error while creating ephemeral at /brokers/ids/0, node already exists and owner '0x200038df42600ef' does not match current session '0x3000359362800e9' (kafka.zk.KafkaZkClient$CheckedEphemeral) [2025-09-01 17:10:51,235] ERROR [KafkaServer id=0] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:2238) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:2176) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:2143) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:110) at kafka.server.KafkaServer.startup(KafkaServer.scala:404) at kafka.Kafka$.main(Kafka.scala:112) at kafka.Kafka.main(Kafka.scala) [2025-09-01 17:10:51,238] INFO [KafkaServer id=0] shutting down (kafka.server.KafkaServer) [2025-09-01 17:10:51,239] INFO [SocketServer listenerType=ZK_BROKER, nodeId=0] Stopping socket server request processors (kafka.network.SocketServer) [2025-09-01 17:10:51,245] INFO [SocketServer listenerType=ZK_BROKER, nodeId=0] Stopped socket server request processors (kafka.network.SocketServer) [2025-09-01 17:10:51,248] INFO [ReplicaManager broker=0] Shutting down (kafka.server.ReplicaManager) [2025-09-01 17:10:51,250] INFO [LogDirFailureHandler]: Shutting down (kafka.server.ReplicaManager$LogDirFailureHandler) [2025-09-01 17:10:51,250] INFO [LogDirFailureHandler]: Stopped (kafka.server.ReplicaManager$LogDirFailureHandler) [2025-09-01 17:10:51,250] INFO [LogDirFailureHandler]: Shutdown completed (kafka.server.ReplicaManager$LogDirFailureHandler) [2025-09-01 17:10:51,251] INFO [ReplicaFetcherManager on broker 0] shutting down (kafka.server.ReplicaFetcherManager) [2025-09-01 17:10:51,252] INFO [ReplicaFetcherManager on broker 0] shutdown completed (kafka.server.ReplicaFetcherManager) [2025-09-01 17:10:51,252] INFO [ReplicaAlterLogDirsManager on broker 0] shutting down (kafka.server.ReplicaAlterLogDirsManager) [2025-09-01 17:10:51,252] INFO [ReplicaAlterLogDirsManager on broker 0] shutdown completed (kafka.server.ReplicaAlterLogDirsManager) [2025-09-01 17:10:51,253] INFO [ExpirationReaper-0-Fetch]: Shutting down (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) [2025-09-01 17:10:51,253] INFO [ExpirationReaper-0-Fetch]: Stopped (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) [2025-09-01 17:10:51,254] INFO [ExpirationReaper-0-Fetch]: Shutdown completed (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) [2025-09-01 17:10:51,254] INFO [ExpirationReaper-0-RemoteFetch]: Shutting down (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) [2025-09-01 17:10:51,255] INFO [ExpirationReaper-0-RemoteFetch]: Stopped (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) [2025-09-01 17:10:51,255] INFO [ExpirationReaper-0-RemoteFetch]: Shutdown completed (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) [2025-09-01 17:10:51,255] INFO [ExpirationReaper-0-Produce]: Shutting down (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) [2025-09-01 17:10:51,255] INFO [ExpirationReaper-0-Produce]: Stopped (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) [2025-09-01 17:10:51,256] INFO [ExpirationReaper-0-Produce]: Shutdown completed (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) [2025-09-01 17:10:51,256] INFO [ExpirationReaper-0-DeleteRecords]: Shutting down (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) [2025-09-01 17:10:51,256] INFO [ExpirationReaper-0-DeleteRecords]: Stopped (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) [2025-09-01 17:10:51,256] INFO [ExpirationReaper-0-DeleteRecords]: Shutdown completed (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) [2025-09-01 17:10:51,257] INFO [ExpirationReaper-0-ElectLeader]: Shutting down (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) [2025-09-01 17:10:51,257] INFO [ExpirationReaper-0-ElectLeader]: Stopped (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) [2025-09-01 17:10:51,257] INFO [ExpirationReaper-0-ElectLeader]: Shutdown completed (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) [2025-09-01 17:10:51,264] INFO [AddPartitionsToTxnSenderThread-0]: Shutting down (kafka.server.AddPartitionsToTxnManager) [2025-09-01 17:10:51,265] INFO [AddPartitionsToTxnSenderThread-0]: Stopped (kafka.server.AddPartitionsToTxnManager) [2025-09-01 17:10:51,265] INFO [AddPartitionsToTxnSenderThread-0]: Shutdown completed (kafka.server.AddPartitionsToTxnManager) [2025-09-01 17:10:51,267] INFO [ReplicaManager broker=0] Shut down completely (kafka.server.ReplicaManager) [2025-09-01 17:10:51,267] INFO [zk-broker-0-to-controller-alter-partition-channel-manager]: Shutting down (kafka.server.NodeToControllerRequestThread) [2025-09-01 17:10:51,268] INFO [zk-broker-0-to-controller-alter-partition-channel-manager]: Stopped (kafka.server.NodeToControllerRequestThread) [2025-09-01 17:10:51,268] INFO [zk-broker-0-to-controller-alter-partition-channel-manager]: Shutdown completed (kafka.server.NodeToControllerRequestThread) [2025-09-01 17:10:51,269] INFO Node to controller channel manager for alter-partition shutdown (kafka.server.NodeToControllerChannelManagerImpl) [2025-09-01 17:10:51,269] INFO [zk-broker-0-to-controller-forwarding-channel-manager]: Shutting down (kafka.server.NodeToControllerRequestThread) [2025-09-01 17:10:51,270] INFO [zk-broker-0-to-controller-forwarding-channel-manager]: Stopped (kafka.server.NodeToControllerRequestThread) [2025-09-01 17:10:51,270] INFO [zk-broker-0-to-controller-forwarding-channel-manager]: Shutdown completed (kafka.server.NodeToControllerRequestThread) [2025-09-01 17:10:51,270] INFO Node to controller channel manager for forwarding shutdown (kafka.server.NodeToControllerChannelManagerImpl) [2025-09-01 17:10:51,271] INFO Shutting down. (kafka.log.LogManager) [2025-09-01 17:10:51,273] INFO [kafka-log-cleaner-thread-0]: Shutting down (kafka.log.LogCleaner$CleanerThread) [2025-09-01 17:10:51,275] INFO [kafka-log-cleaner-thread-0]: Stopped (kafka.log.LogCleaner$CleanerThread) [2025-09-01 17:10:51,275] INFO [kafka-log-cleaner-thread-0]: Shutdown completed (kafka.log.LogCleaner$CleanerThread) [2025-09-01 17:10:51,336] INFO Shutdown complete. (kafka.log.LogManager) [2025-09-01 17:10:51,490] INFO [feature-zk-node-event-process-thread]: Shutting down (kafka.server.FinalizedFeatureChangeListener$ChangeNotificationProcessorThread) [2025-09-01 17:10:51,490] INFO [feature-zk-node-event-process-thread]: Stopped (kafka.server.FinalizedFeatureChangeListener$ChangeNotificationProcessorThread) [2025-09-01 17:10:51,490] INFO [feature-zk-node-event-process-thread]: Shutdown completed (kafka.server.FinalizedFeatureChangeListener$ChangeNotificationProcessorThread) [2025-09-01 17:10:51,491] INFO [ZooKeeperClient Kafka server] Closing. (kafka.zookeeper.ZooKeeperClient) [2025-09-01 17:10:51,623] INFO Session: 0x3000359362800e9 closed (org.apache.zookeeper.ZooKeeper) [2025-09-01 17:10:51,624] INFO EventThread shut down for session: 0x3000359362800e9 (org.apache.zookeeper.ClientCnxn) [2025-09-01 17:10:51,745] INFO [ZooKeeperClient Kafka server] Closed. (kafka.zookeeper.ZooKeeperClient) [2025-09-01 17:10:51,748] INFO [ThrottledChannelReaper-Fetch]: Shutting down (kafka.server.ClientQuotaManager$ThrottledChannelReaper) [2025-09-01 17:10:51,758] INFO [ThrottledChannelReaper-Fetch]: Stopped (kafka.server.ClientQuotaManager$ThrottledChannelReaper) [2025-09-01 17:10:51,778] INFO [ThrottledChannelReaper-Fetch]: Shutdown completed (kafka.server.ClientQuotaManager$ThrottledChannelReaper) [2025-09-01 17:10:51,778] INFO [ThrottledChannelReaper-Produce]: Shutting down (kafka.server.ClientQuotaManager$ThrottledChannelReaper) [2025-09-01 17:10:51,792] INFO [ThrottledChannelReaper-Produce]: Stopped (kafka.server.ClientQuotaManager$ThrottledChannelReaper) [2025-09-01 17:10:51,792] INFO [ThrottledChannelReaper-Produce]: Shutdown completed (kafka.server.ClientQuotaManager$ThrottledChannelReaper) [2025-09-01 17:10:51,792] INFO [ThrottledChannelReaper-Request]: Shutting down (kafka.server.ClientQuotaManager$ThrottledChannelReaper) [2025-09-01 17:10:51,792] INFO [ThrottledChannelReaper-Request]: Stopped (kafka.server.ClientQuotaManager$ThrottledChannelReaper) [2025-09-01 17:10:51,792] INFO [ThrottledChannelReaper-Request]: Shutdown completed (kafka.server.ClientQuotaManager$ThrottledChannelReaper) [2025-09-01 17:10:51,806] INFO [ThrottledChannelReaper-ControllerMutation]: Shutting down (kafka.server.ClientQuotaManager$ThrottledChannelReaper) [2025-09-01 17:10:51,806] INFO [ThrottledChannelReaper-ControllerMutation]: Stopped (kafka.server.ClientQuotaManager$ThrottledChannelReaper) [2025-09-01 17:10:51,806] INFO [ThrottledChannelReaper-ControllerMutation]: Shutdown completed (kafka.server.ClientQuotaManager$ThrottledChannelReaper) [2025-09-01 17:10:51,830] INFO [SocketServer listenerType=ZK_BROKER, nodeId=0] Shutting down socket server (kafka.network.SocketServer) [2025-09-01 17:10:52,317] INFO [SocketServer listenerType=ZK_BROKER, nodeId=0] Shutdown completed (kafka.network.SocketServer) [2025-09-01 17:10:52,363] INFO Metrics scheduler closed (org.apache.kafka.common.metrics.Metrics) [2025-09-01 17:10:52,363] INFO Metrics reporters closed (org.apache.kafka.common.metrics.Metrics) [2025-09-01 17:10:52,415] INFO Broker and topic stats closed (kafka.server.BrokerTopicStats) [2025-09-01 17:10:52,554] INFO App info kafka.server for 0 unregistered (org.apache.kafka.common.utils.AppInfoParser) [2025-09-01 17:10:52,610] INFO [KafkaServer id=0] shut down completed (kafka.server.KafkaServer) [2025-09-01 17:10:52,610] ERROR Exiting Kafka due to fatal exception during startup. (kafka.Kafka$) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:2238) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:2176) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:2143) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:110) at kafka.server.KafkaServer.startup(KafkaServer.scala:404) at kafka.Kafka$.main(Kafka.scala:112) at kafka.Kafka.main(Kafka.scala) [2025-09-01 17:10:52,953] INFO [KafkaServer id=0] shutting down (kafka.server.KafkaServer)
最新发布
09-02
<think> 我们遇到了一个新的错误:`org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists` 这个错误发生在Kafka broker尝试在ZooKeeper上注册自己的broker ID时(路径`/brokers/ids/0`),因为该节点已经存在,并且当前的ZooKeeper会话(0x3000359362800e9)与创建该节点的原有会话(0x200038df42600ef)不同。 原因分析: 1. 可能之前的Kafka broker实例异常退出,导致在ZooKeeper上注册的临时节点(ephemeral node)没有被正确删除。 2. 当前尝试启动的broker(id=0)在ZooKeeper中已经有一个同名的节点,且该节点是由另一个会话创建的,因此当前会话不能修改它。 解决方案: 1. 我们可以手动删除ZooKeeper上的这个节点(注意:需要先连接到ZooKeeper集群)。 2. 或者等待ZooKeeper会话超时后自动删除(但需要等待,不推荐)。 具体操作步骤(手动删除): a. 连接到ZooKeeper集群(任意一个节点): `zkCli.sh -server 192.168.14.124:2181,192.168.14.125:2181,192.168.14.126:2181` b. 删除节点(注意:节点路径是`/brokers/ids/0`): `delete /brokers/ids/0` c. 注意:如果该节点有子节点,可能需要先删除子节点,但一般情况下这个节点就是一个临时节点,没有子节点。 但是,需要注意的是:如果原来的broker进程还在运行(可能在其他地方),那么删除节点可能会导致那个broker出现问题。因此,在删除之前,我们需要确认是否确实没有其他broker0在运行。 根据之前的操作,我们已经停止了本机的Kafka进程,所以可以安全删除。 另外,我们也可以考虑重启ZooKeeper,但这会影响整个集群,不推荐。 完成删除操作后,再次启动Kafka即可。 预防措施: 1. 确保Kafka进程正常停止,使用提供的停止脚本。 2. 如果遇到类似问题,可以编写一个检查脚本,在启动Kafka前检查ZooKeeper上是否存在该broker节点,如果存在并且对应的会话已经失效(比如通过`stat`命令查看节点信息中的会话ID,然后使用ZooKeeper的`stat`命令检查会话是否活跃),则删除。 但是,手动操作需要谨慎,尤其是在生产环境。 具体操作演示(通过zkCli): ``` [root@zk1 ~]# zkCli.sh -server 192.168.14.124:2181 Connecting to 192.168.14.124:2181 ... [zk: 192.168.14.124:2181(CONNECTED) 0] ls /brokers/ids [0] # 这里我们看到有id为0的broker节点 [zk: 192.168.14.124:2181(CONNECTED) 1] delete /brokers/ids/0 # 如果删除成功,则退出zkCli,然后启动Kafka ``` 注意:如果节点存在但是无法删除(比如提示节点不存在,可能是路径不对),请确保路径正确。在Kafka中,broker的注册路径就是`/brokers/ids/<broker.id>`。 再次启动Kafka: `/usr/local/kafka_2.13-3.7.2/bin/kafka-server-start.sh /usr/local/kafka_2.13-3.7.2/config/server.properties` 如果成功,将不会出现这个错误。 另外,还有一种可能性:在配置中,`broker.id`被设置为0,而之前有一个broker0已经成功注册,但未注销。所以,确认当前启动的broker0是否确实应该使用这个id,或者考虑更换broker.id(如果允许的话)。但是在集群中,每个broker的id应该是唯一的,且固定。所以不建议随意更换,除非是测试环境。 综上所述,我们选择手动删除ZooKeeper上的节点。 注意:在删除节点前,最好先确认该节点对应的broker确实已经下线,避免影响集群。 如果是在生产环境,建议在低峰期操作,并确保操作正确。 现在,我们可以尝试删除节点并重新启动Kafka。 </think> ### ZooKeeper 节点冲突问题分析与解决方案 #### **错误原因分析** 日志核心报错: ```java org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists ``` **问题本质**:Kafka 尝试在 ZooKeeper 注册 broker ID(路径 `/brokers/ids/0`)时,发现该节点已存在且属于**另一个 ZooKeeper 会话**(原会话 ID:`0x200038df42600ef`)。 关键日志证据: ``` [2025-09-01 17:10:51,209] ERROR ... node already exists and owner '0x200038df42600ef' does not match current session '0x3000359362800e9' ``` --- ### **解决步骤** #### 1. **检查 ZooKeeper 节点状态** ```bash # 连接 ZooKeeper 集群(任一节点) bin/zkCli.sh -server 192.168.14.124:2181 # 查看冲突节点 [zk] ls /brokers/ids # 应显示 [0] [zk] stat /brokers/ids/0 # 检查 ephemeralOwner 值(原会话ID) ``` #### 2. **安全删除冲突节点** ```bash # 强制删除节点(生产环境需先确认原 Broker 已下线) [zk] delete /brokers/ids/0 # 验证删除结果 [zk] ls /brokers/ids # 应返回空列表 [] ``` #### 3. **重启 Kafka Broker** ```bash /usr/local/kafka_2.13-3.7.2/bin/kafka-server-start.sh /usr/local/kafka_2.13-3.7.2/config/server.properties ``` --- ### **根本原因与预防** | **场景** | **解决方案** | **预防措施** | |------------------------------|------------------------------------------|----------------------------------------| | **Broker 异常崩溃** | 手动清理 ZooKeeper 节点 | 配置 `zookeeper.session.timeout.ms=6000`(更短超时) | | **网络分区导致会话失效** | 检查 ZooKeeper 集群健康 | 部署 ZooKeeper 集群监控(如 Prometheus + ZK Exporter) | | **重复 broker.id** | 确保每台 Broker 的 `server.properties` 中 `broker.id` 唯一 | 自动化配置管理(Ansible/Puppet) | | **ZooKeeper 数据未同步** | 重启 ZooKeeper 集群 | 使用 `syncLimit` 和 `tickTime` 优化集群响应 | --- ### **高级恢复方案** #### 方案1:通过 API 强制注销 ```bash # 使用 ZK 四字命令重置会话 echo stat | nc 192.168.14.124 2181 | grep Connections # 强制结束旧会话(危险!需谨慎) echo kill 0x200038df42600ef | nc 192.168.14.124 2181 ``` #### 方案2:重建 ZooKeeper 数据(极端情况) ```bash # 备份原有数据 zk/bin/zkServer.sh stop cp -r zk/data /backup/zk-data-$(date +%s) # 清空数据重新初始化 rm -rf zk/data/version-2/* zk/bin/zkServer.sh start ``` > ️ **警告**:操作 ZooKeeper 数据需在集群维护窗口进行,避免数据不一致! --- ### **推荐配置优化** 在 `config/server.properties` 中添加: ```properties # 缩短会话超时时间(单位:毫秒) zookeeper.session.timeout.ms=8000 # 增加会话重试机制 zookeeper.connection.timeout.ms=30000 zookeeper.max.in.flight.requests=20 ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值