Kafka

最新推荐文章于 2023-08-06 09:39:40 发布

Phantomape

最新推荐文章于 2023-08-06 09:39:40 发布

阅读量1.1k

点赞数

CC 4.0 BY-SA版权

分类专栏： Cloud

本文链接：https://blog.youkuaiyun.com/Phantomape/article/details/53167391

Cloud 专栏收录该内容

2 篇文章

订阅专栏

本文介绍了Kafka这一高效的消息系统，对比传统企业级消息系统，Kafka具备分布式、高吞吐量等特点，适用于大规模的日志处理场景。文章还详细解释了Kafka的主题、分区、生产者、消费者等核心概念，并探讨了其在数据持久化、传输效率及状态管理方面的实现机制。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

About Kafka

Overview

Kafka is a distributed messaging system for collecting and delivering high volumes of log data with low latency. On the one hand, Kafka is distributed and scalable, and offers high throughput. On the other hand, Kafka provides an API similar to a messaging system and allows applications to consume log events in real time

Traditional Messaging System

Traditional enterprise messaging systems are not to be a good fit for log processing. The reason is as follows:
1. There is a mismatch in features offered by enterprise systems like out of order and unneeded features
2. Many systems do not focus as strongly on throughput as their primary design constraint like no batch operations
3. Weak in distributed support
4. Many messaging systems assume near immediate consumption of messages, so the queue of unconsumed messages is always fairly small.

Kafka

A stream of messages of a particular type is defined by a topic. To balance load, a topic is divided into multiple partitions and each broker stores one or more of those partitions. A producer can publish messages to a topic. The published messages are then stored at a set of servers called brokers. A consumer can subscribe to one or more topics from the brokers, and consume the subscribed messages by pulling data from the brokers
Kafka Architecture

1. Efficiency on Single Partition

Simple Storage:
Every time a producer publishes a message to a partition, the broker simply appends the message to the last segment file. The segment files are written to disk only after a configurable number of messages have been published or a certain amount of time has elapsed. A message is only exposed to the consumers after it is flushed and a message stored in Kafka does not have an explicit message id. Instead, each message is addressed by its logical offset in the log.
Efficient Transfer:
Avoid explicitly caching messages in memory at the Kafka layer. Instead, cache them on the underlying file system page cache, which avoids double buffering. Since Kafka does not cache messages in process at all, it has very little overhead in garbage collecting its memory, making efficient implementation in a VM-based language feasible. Plus, Kafka is a multi-subscriber system and a single message may be consumed multiple times by different consumer applications
Stateless Broker:
The volume of info consumed is stored in consumer, which reduce complexity and the overhead on the broker. A message is automatically deleted if it has been retained in the broker longer than a certain period, typically 7 days; thus, A consumer can deliberately rewind back to an old offset and re-consume data.

2. Distributed Coordination

A partition within a topic is the smallest unit of parallelism. Kafka has the concept of consumer groups. Each consumer group consists of one or more consumers that jointly consume a set of subscribed topics, and each message is delivered to only one of the consumers within the group.
Not having a central “master” node, but instead let consumers coordinate among themselves in a decentralized fashion. To facilitate the coordination, we employ a highly available consensus service Zookeeper. Zookeeper has a very simple, file system like API and it has the following functions:

Zookeeper can create a path, set the value of a path, read the value of a path, delete a path, and list the children of a path. It does a few more interesting things: (a) one can register a watcher on a path and get notified when the children of a path or the value of a path has changed; (b) a path can be created as ephemeral (as oppose to persistent), which means that if the creating client is gone, the path is automatically removed by the Zookeeper server; (c) zookeeper replicates its data to multiple servers, which makes the data highly reliable and available. Kafka uses Zookeeper for the following tasks: (1) detecting the addition and the removal of brokers and consumers, (2) triggering a rebalance process in each consumer when the above events happen, and (3) maintaining the consumption relationship and keeping track of the consumed offset of each partition

3. Delivery Guarantees

In general, Kafka only guarantees at-least-once delivery
Kafka guarantees that messages from a single partition are delivered to a consumer in order

Install Kafka

Download and extract Kafka from http://kafka.apache.org/downloads.html. For me, I have unzipped it in C:\Kafka

Set Up Kafka

Go to your Kafka config directory. For me its C:\Kafka\kafka_2.11-0.10.1.0\config
Edit file “server.properties”
Find & edit line “log.dirs=/tmp/kafka-logs” to “log.dir= C:\Kafka\kafka_2.11-0.10.1.0\kafka-logs”.(Actually it doesn’t matter where you put your log)
If your Zookeeper is running on some other machine or cluster you can edit “zookeeper.connect:2181” to your custom IP and port. For this demo we are using same machine so no need to change. Also Kafka port & broker.id are configurable in this file. Leave other settings as it is.
Your Kafka will run on default port 9092 & connect to zookeeper’s default port which is 2181.

Run Kafka Server

Go to your Kafka installation directory C:\Kafka\kafka_2.11-0.10.1.0\
Open a command prompt here by pressing Shift + right click and choose “Open command window here” option)
Now type .\bin\windows\kafka-server-start.bat .\config\server.properties and press Enter.(Before you run Kafka server, make sure you have opened ZooKeeper)