Kafka偏移量(Offset)管理

最新推荐文章于 2025-03-31 18:00:00 发布

Tai_Park

最新推荐文章于 2025-03-31 18:00:00 发布

阅读量4.4k

点赞数 2

分类专栏： Spark和他的小伙伴们文章标签： Kafka offset spark streaming

本文链接：https://blog.youkuaiyun.com/qq_36329973/article/details/104825902

版权

1.定义

Kafka中的每个partition都由一系列有序的、不可变的消息组成，这些消息被连续的追加到partition中。partition中的每个消息都有一个连续的序号，用于partition唯一标识一条消息。

Offset记录着下一条将要发送给Consumer的消息的序号。

流处理系统常见的三种语义：

最多一次	每个记录要么处理一次，要么根本不处理
至少一次	这比最多一次强，因为它确保不会丢失任何数据。但是可能有重复的
有且仅有一次	每条记录将被精确处理一次，没有数据会丢失，也没有数据会被多次处理

The semantics of streaming systems are often captured in terms of how many times each record can be processed by the system. There are three types of guarantees that a system can provide under all possible operating conditions (despite failures, etc.)

At most once: Each record will be either processed once or not processed at all.

At least once: Each record will be processed one or more times. This is stronger than at-most once as it ensure that no data will be lost. But there may be duplicates.

Exactly once: Each record will be processed exactly once - no data will be lost and no data will be processed multiple times. This is obviously the strongest guarantee of the three.

2.Kafka offset Management with Spark Streaming

Offset首先建议存放到Zookeeper中，Zookeeper相比于HBASE等来说更为轻量级，且是做HA(高可用性集群，High Available)的，offset更安全。

对于offset管理常见的两步操作：

保存offsets
获取offsets

3.环境准备

启动一个Kafka生产者，测试使用topic：tp_kafka：

./kafka-console-producer.sh --broker-list hadoop000:9092 --topic tp_kafka

启动一个Kafka消费者：

./kafka-console-consumer.sh --zookeeper hadoop000:2181 --topic tp_kafka

在IDEA中

最低0.47元/天解锁文章