kafka-笔记02

彼山有桥

已于 2022-03-11 21:07:27 修改

阅读量1.9k

点赞数

文章标签： kafka 分布式 zookeeper

于 2022-03-11 17:10:38 首次发布

本文链接：https://blog.youkuaiyun.com/m0_61276219/article/details/123428800

版权

本文详细介绍了Kafka的分区操作、Java API使用，包括数据分区的四种策略，消费者代码与offset提交。讨论了Kafka的数据消费模型、日志寻址机制、数据丢失保障以及CAP理论的应用。还涉及了Kafka在Zookeeper中的角色与Controller Broker的职责。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1 kafka集群操作

#1、创建topic
#创建一个名字为test的主题， 有三个分区，有两个副本
#node01执行以下命令来创建topic	
cd /opt/module/kafka_2.11-1.0.0
bin/kafka-topics.sh --create --partitions 3 --replication-factor 2 --topic test --zookeeper hadoop102:2181, hadoop103:2181, hadoop104:2181



#2、查看主题命令
#查看kafka当中存在的主题
node01使用以下命令来查看kafka当中存在的topic主题
cd /opt/module/kafka_2.11-1.0.0
bin/kafka-topics.sh  --list --zookeeper hadoop102:2181, hadoop103:2181, hadoop104:2181

#3、生产者生产数据
#模拟生产者来生产数据
#node01服务器执行以下命令来模拟生产者进行生产数据
cd /opt/module/kafka_2.11-1.0.0
bin/kafka-console-producer.sh --broker-list hadoop102:9092, hadoop103:9092, hadoop104:9092 --topic test

#4、消费者消费数据
#node02服务器执行以下命令来模拟消费者进行消费数据
cd /opt/module/kafka_2.11-1.0.0
bin/ kafka-console-consumer.sh --from-beginning --topic test  --zookeeper node01:2181,node02:2181,node03:2181

2 kafka的JavaAPI操作

需要添加的依赖

<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients -->
<dependency>
    <groupId>org.apache.kafka</groupId>
    <artifactId>kafka-clients</artifactId>
    <version>1.0.0</version>
</dependency>    
    <dependency>
        <groupId>org.apache.kafka</groupId>
        <artifactId>kafka-streams</artifactId>
        <version>1.0.0</version>
    </dependency>

</dependencies>

<build>
    <plugins>
        <!-- java编译插件 -->
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.2</version>
            <configuration>
                <source>1.8</source>
                <target>1.8</target>
                <encoding>UTF-8</encoding>
            </configuration>
        </plugin>
    </plugins>
</build>

2.1 生产者代码

/**
* 订单的生产者代码，
*/
public class OrderProducer {
public static void main(String[] args) throws InterruptedException {
/* 1、连接集群，通过配置文件的方式
* 2、发送数据-topic:order，value
*/
Properties props = new Properties(); 
props.put("bootstrap.servers", "hadoop102:9092"); 
props.put("acks", "all");
props.put("retries", 0);
props.put("batch.size", 16384);
props.put("linger.ms", 1);
props.put("buffer.memory", 33554432); 
props.put("key.serializer",
"org.apache.kafka.common.serialization.StringSerializer"); 
props.put("value.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
 KafkaProducer<String, String> kafkaProducer = new KafkaProducer<String, String>
(props);
for (int i = 0; i < 1000; i++) {
// 发送数据 ,需要一个producerRecord对象,最少参数 String topic, V value kafkaProducer.send(new ProducerRecord<String, String>("order", "订单信
息！"+i));
Thread.sleep(100);
}
}
}

2.2 kafka当中的数据分区

首先，kafka当中的数据是有顺序的，只是说的是分区里面的顺序，每个分区里面的数据都是有顺序的

如果想保证kafka当中的数据消费也是有序的，生产是有序的。设置一个分区（相当于单机版），不建议使用；

kafka的分区方式：

第一种：既没有指定key，也没有指定分区号，使用轮询的方式；

//ProducerRecord<String, String> producerRecord1 = new ProducerRecord<>("mypartition", "mymessage" + i);
  //kafkaProducer.send(producerRecord1);

第二种：指定数据key，使用key的hashcode码值来进行分区，一定要注意，key要变化；如果数据key，没有变化 key.hashCode % numPartitions = 固定值所有的数据都会写入到某一个分区里面去

//ProducerRecord<String, String> producerRecord2 = new ProducerRecord<>("mypartition", "mykey"+i, "mymessage" + i);

第三种：指定分区号来进行分区

/  ProducerRecord<String, String> producerRecord3 = new ProducerRecord<>("mypartition", 0, "mykey", "mymessage" + i);
 // kafkaProducer.send(producerRecord3);

第四种：自定义分区策略，不需要指定分区号，如果指定了分区号，还是会将数据发送到指定的分区里面去

#主代码中添加配置
 props.put("partitioner.class", "cn.itcast.kafka.partitioner.KafkaCustomPartitioner");

kafkaProducer.send(new ProducerRecord<String, String>("mypartition","mymessage"+i));


#自定义分区策略
public class KafkaCustomPartitioner implements Partitioner {
	@Override
	public void configure(Map<String, ?> configs) {
	}

	@Override
	public int partition(String topic, Object arg1, byte[] keyBytes, Object arg3, byte[] arg4, Cluster cluster) {
		List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
	    int partitionNum = partitions.size();
		Random random = new Random();
		int partition = random.nextInt(partitionNum);
	    return partition;
	}

	@Override
	public void close() {
		
	}

}

2.3 消费者代码

/**
* 消费订单数据--- javaben.tojson
*/
public class OrderConsumer {
public static void main(String[] args) {
// 1\连接集群
Properties props = new Properties(); 
props.put("bootstrap.servers", "hadoop-01:9092"); 
props.put("group.id", "test");

//以下两行代码 ---消费者自动提交offset值 
props.put("enable.auto.commit", "true"); 
props.put("auto.commit.interval.ms",  "1000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> kafkaConsumer = new KafkaConsumer<String, String>
(props);
//		 2、发送数据 发送数据需要，订阅下要消费的topic。	order kafkaConsumer.subscribe(Arrays.asList("order")); 
while (true) {
ConsumerRecords<String, String> consumerRecords = kafkaConsumer.poll(100);// jdk queue offer插入、poll获取元素。 blockingqueue put插入原生， take获取元素
for (ConsumerRecord<String, String> record : consumerRecords) { System.out.println("消费的数据为：" + record.value());
}
}
}
}

2.3 消费者手动提交offset

消费者处理完成一个分区里面的数据，就提交一次offset值，记录到对应的分区里面的数据消费到哪里。实际生产环境一般手动提交offset

每次消费完成后手动提交offset：

使用异步提交的方法，不会阻塞程序的消费

KafkaConsumer.commitAsync();

同步进行提交，消费数据完成后，提交offset，提交完成后才能进行下一次消费

KafkaConsumer.commitSync()

//关闭自动提交确认选项
props.put("enable.auto.commit", "false");

提交的偏移量应始终是应用程序将读取的下一条消息的偏移量。因此，在调用commitSync（偏移量）时，应该在最后处理的消息的偏移量中添加一个

try {
while(running) {
ConsumerRecords<String, String> records = consumer.poll(Long.MAX_VALUE); 
for (TopicPartition partition : records.partitions()) {
List<ConsumerRecord<String, String>> partitionRecords = records.records(partition);
for (ConsumerRecord<String, String> record : partitionRecords) { System.out.println(record.offset() + ": " + record.value());
}
long lastOffset = partitionRecords.get(partitionRecords.size() -1).offset();
consumer.commitSync(Collections.singletonMap(partition, new OffsetAndMetadata(lastOffset + 1)));
}
}
} finally { consumer.close();}

2.4 指定分区数据进行消费

要使用此模式，您只需使用要使用的分区的完整列表调用assign（Collection），而不是使用subscribe订阅主题。

主题与分区订阅只能二选一

String topic = "foo";
TopicPartition partition0 = new TopicPartition(topic, 0); 
TopicPartition partition1 = new TopicPartition(topic, 1); consumer.assign(Arrays.asList(partition0,  partition1));
//手动指定消费指定分区的数据---end
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100); 
for (ConsumerRecord<String, String> record : records)
System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());

3 重复消费与数据丢失

kafka的数据消费模型：

exactly once：消费且仅消费一次

at least once：最少消费一次，可能出现数据重复消费的问题

at most once：至多消费一次，可能出现数据丢失的问题

kafka在push/jpull过程中常见一下情况：

写入HBASE成功，但是offset提交失败：重复消费

写入hbase失败，但是offset提交成功：数据丢失

写入HBASE成功，提交offset成功：正常消费情况

写入HBASE失败，提交offset失败:重新进行消费

数据重复消费或者数据丢失的原因造成：offset没有管理好

解决办法：将offset的值保存到rerdis里面去或者HBASE里面去

默认的offset保存位置：

可以保存到zk里面去，
key保存到kafka自带的一个topic里面去，_consumer_offsets

关于数据消费：

高阶API，high level API 将offset的值，保存到zk当中，早期kafka版本默认使用high level api进行消费
低阶API，将offset的值保存到kafka的默认topic里面，新的版本都是使用low level API进行消费，将数据的offset保存到一个topic里面去

4 kafka的日志寻址机制：

一个topic由多个partition组成

一个partition里面有多个segment文件段

一个segment里面有两个文件

.log文件：存放日志数据的文件

.index文件：索引文件

每当.log文件达到1GB的时候，就会产生一个新的segment

*下一个segment的文件名字，是上一个segment文件最后一条数据的offset值

那么可以使使用二分查找法来查找数据的offset究竟在哪一个segment段里面。

如果确定了数据的offset在第一个segment里面，怎么继续快速找到是哪一行数据

.index文件里面存放了一些数据索引值，不会将.log文件里面每一条数据都进行索引，每过一段就索引一次，减少索引文件大小

kafka的log的寻址机制（掌握）