Kafka CommitFailedException

最新推荐文章于 2024-05-31 17:47:49 发布

走向自由

最新推荐文章于 2024-05-31 17:47:49 发布

阅读量900

点赞数

CC 4.0 BY-SA版权

分类专栏： spark

本文链接：https://blog.youkuaiyun.com/adorechen/article/details/109045937

spark 专栏收录该内容

24 篇文章

订阅专栏

本文介绍了Kafka消费者的心跳机制、活跃检测以及如何通过调整`max.poll.interval.ms`和`max.poll.records`参数来确保消费者在处理大量任务时仍能保持活性。当消费者处理消息时间超过`max.poll.interval.ms`，它将离开组以避免阻塞其他消费者。同时，`max.poll.records`限制每次调用`poll()`返回的记录数，有助于调整处理速度和重新平衡的影响。遇到CommitFailedException时，可能是因为消费者未能及时调用`poll()`。解决方案包括调整配置参数，如将`max.poll.records`设为300，`max.poll.interval.ms`设为600000。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1.Exception:

org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.

2.判断consumer是否活跃的机制

Detecting Consumer Failures
After subscribing to a set of topics, the consumer will automatically join the group when poll(long) is invoked. The poll API is designed to ensure consumer liveness. As long as you continue to call poll, the consumer will stay in the group and continue to receive messages from the partitions it was assigned. Underneath the covers, the consumer sends periodic heartbeats to the server. If the consumer crashes or is unable to send heartbeats for a duration of session.timeout.ms, then the consumer will be considered dead and its partitions will be reassigned.
It is also possible that the consumer could encounter a "livelock" situation where it is continuing to send heartbeats, but no progress is being made. To prevent the consumer from holding onto its partitions indefinitely in this case, we provide a liveness detection mechanism using the max.poll.interval.ms setting. Basically if you don't call poll at least as frequently as the configured max interval, then the client will proactively leave the group so that another consumer can take over its partitions. When this happens, you may see an offset commit failure (as indicated by a CommitFailedException thrown from a call to commitSync()). This is a safety mechanism which guarantees that only active members of the group are able to commit offsets. So to stay in the group, you must continue to call poll.

The consumer provides two configuration settings to control the behavior of the poll loop:

max.poll.interval.ms: By increasing the interval between expected polls, you can give the consumer more time to handle a batch of records returned from poll(long). The drawback is that increasing this value may delay a group rebalance since the consumer will only join the rebalance inside the call to poll. You can use this setting to bound the time to finish a rebalance, but you risk slower progress if the consumer cannot actually call poll often enough.
max.poll.records: Use this setting to limit the total records returned from a single call to poll. This can make it easier to predict the maximum that must be handled within each poll interval. By tuning this value, you may be able to reduce the poll interval, which will reduce the impact of group rebalancing.

For use cases where message processing time varies unpredictably, neither of these options may be sufficient. The recommended way to handle these cases is to move message processing to another thread, which allows the consumer to continue calling poll while the processor is still working. Some care must be taken to ensure that committed offsets do not get ahead of the actual position. Typically, you must disable automatic commits and manually commit processed offsets for records only after the thread has finished handling them (depending on the delivery semantics you need). Note also that you will need to pause the partition so that no new records are received from poll until after thread has finished handling those previously returned.

判断consumer是否存活的两个参数

session.timeout.ms is for heartbeat thread. If coordinator fails to get any heartbeat from a consumer before this time interval elapsed, it marks consumer as failed and triggers a new round of rebalance.

heartbeat.interval.ms is used to have other healthy consumers aware of the rebalance much faster. If coordinator triggers a rebalance, other consumers will only know of this by receiving the heartbeat response with REBALANCE_IN_PROGRESS exception encapsulated. Quicker the heartbeat request is sent, faster the consumer knows it needs to rejoin the group.a relatively low value, better 1/3 of the session.timeout.ms

判断consumer存活期间是否干活的两个参数

max.poll.interval.ms: By increasing the interval between expected polls, you can give the consumer more time to handle a batch of records returned from poll(long). The drawback is that increasing this value may delay a group rebalance since the consumer will only join the rebalance inside the call to poll. You can use this setting to bound the time to finish a rebalance, but you risk slower progress if the consumer cannot actually call poll often enough.

max.poll.records: Use this setting to limit the total records returned from a single call to poll. This can make it easier to predict the maximum that must be handled within each poll interval. By tuning this value, you may be able to reduce the poll interval, which will reduce the impact of group rebalancing.

3.解决方案代码

修改配置参数，调大间隔或调小一次处理的最大任务数量

        props.put("max.poll.records", 300);  // default 500
        props.put("max.poll.interval.ms", "600000");    // default 300000

参考文章

http://kafka.apache.org/10/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html

https://blog.youkuaiyun.com/shibuwodai_/article/details/80678717

https://www.jianshu.com/p/1120e26244c2