kafka官方文档中对于它的偏移解释:
auto.offset.reset
解释:
What to do when there is no initial offset in Kafka or if the current
offset does not exist any more on the server (e.g. because that data
has been deleted):
earliest: automatically reset the offset to the earliest offset
latest: automatically reset the offset to the latest offset
none: throw exception to the consumer if no previous offset is found for the consumer’s group
anything else: throw exception to the consumer.
最主要的就是earliest(重置偏移量到最早),latest(重置偏移量到最新)。
flume对于Source为kafka其中的偏移参数解释:
Other Kafka Consumer Properties
These properties are used to configure the Kafka Consumer. Any
consumer property supported by Kafka can be used. The only requirement
is to prepend the property name with the prefix kafka.consumer. For
example: kafka.consumer.auto.offset.reset
也就是说在flume中支持kafka的consumer任何属性,在这篇文章中主要是配置
kafka.consumer.auto.offset.reset属性和了解groupId的含义。
首先了解goupId的含义,kafka的source会通过在flume中配置的groupId记录它们在consumer中的值,每次切换了groupId后,当前groupId不会存储之前groupId中的值。
例如说:consumer中总共消费了10条值,其中有7条来自group1,3条来自group2。那在flume记录中,gruop1中就只有7条值,gruop2就只有3条值。
flume中的groupId解释:
groupId
再来看earliest,latest的含义。earliest:更换groupId后,偏移量从最早开始(flume会加载其他groupId的数据)。latest:更换groupId后,偏移量从最新开始(flume只加载当前groupId的数据)
实验部分:
kafka创建topic:
kafka-topics.sh --create --zookeeper 192.168.26.101:2181,192.168.26.102:2181,192.168.26.103:2181 --replication-factor 1 --partitions 1 --topic oyzm
list:
kafka-topics.sh --zookeeper 192.168.26.101:2181,192.168.26.102:2181,192.168.26.103:2181 --list
使用生产者发送消息:(数据随便发送几条,111,222)
kafka-console-producer.sh --broker-list 192.168.26.101:9092 --topic oyzm
flume配置:groupId为flume,偏移量为earliest
agent.sources = kafkaSource
agent.channels = memoryChannel
agent.sinks = hdfsSink
# The channel can be defined as follows.
agent.sources.kafkaSource.channels = memoryChannel
agent.sources.kafkaSource.type=org.apache.flume.source.kafka.KafkaSource
agent.sources.kafkaSource.zookeeperConnect=192.168.26.101:2181,192.168.26.102:2181,192.168.26.103:2181
agent.sources.kafkaSource.topic=oyzm
agent.sources.kafkaSource.groupId=flume
agent.sources.kafkaSource.kafka.consumer.timeout.ms=100
agent.sources.kafkaSource.migrateZookeeperOffsets = false
agent.sources.kafkaSource.kafka.consumer.auto.offset.reset = earliest
agent.channels.memoryChannel.type=memory
agent.channels.memoryChannel.capacity=1000
agent.channels.memoryChannel.transactionCapacity=100
# the sink of hdfs
agent.sinks.hdfsSink.type=hdfs
agent.sinks.hdfsSink.channel = memoryChannel
agent.sinks.hdfsSink.hdfs.path=hdfs://master:9000/user/flume_test
agent.sinks.hdfsSink.hdfs.writeFormat=Text
agent.sinks.hdfsSink.hdfs.fileType=DataStream
启动flume
flume-ng agent -f test1.conf -n agent1 -Dflume.root.logger=INFO,console
此时可以hdfs中看见有文件被写入,查看文件内容会为111,222。
接下来将hdfs的文件清空,把flume配置中的groupId改为flume1。继续在kafka的生产者界面中输入数据333,444。
启动改动后的flume配置,这时在hdfs中可以看见文件被写入,查看文件内容为111,222,333,444。earliest成功偏移。
如果修改偏移为latest,那文件的内容会为333,444