Kafka系列之Stream核心原理(二)

原创

已于 2025-07-16 21:30:57 修改 · 1k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#kafka #分布式 #大数据

于 2023-05-20 11:53:08 首次发布

KafkaStreams通过流分区和任务实现数据处理的并行性，利用Kafka的分区和消息传递层提供容错和扩展性。应用程序的处理器拓扑被分解为任务，任务分配给线程执行，通过Kafka的协调功能自动分配和重新分配。本地状态存储支持有状态操作，并通过变更日志主题提供容错和恢复。KafkaStreams的容错机制包括任务的自动重启和状态的备份恢复，确保故障透明且高效。

Kafka Streams simplifies application development by building on the Kafka producer and consumer libraries and leveraging the native capabilities of Kafka to offer data parallelism, distributed coordination, fault tolerance, and operational simplicity. In this section, we describe how Kafka Streams works underneath the covers

Kafka Streams 通过构建在 Kafka 生产者和消费者库上并利用 Kafka 的本机功能来提供数据并行性、分布式协调、容错和操作简单性，从而简化了应用程序开发。在本节中，我们将描述 Kafka Streams 在幕后是如何工作的

The picture below shows the anatomy of an application that uses the Kafka Streams library. Let’s walk through some details
下图显示了使用 Kafka Streams 库的应用程序的剖析。让我们来看看一些细节

Stream Partitions and Tasks流分区和任务

The messaging layer of Kafka partitions data for storing and transporting it. Kafka Streams partitions data for processing it. In both cases, this partitioning is what enables data locality, elasticity, scalability, high performance, and fault tolerance. Kafka Streams uses the concepts of partitions and tasks as logical units of its parallelism model based on Kafka topic partitions. There are close links between Kafka Streams and Kafka in the context of parallelism:

Kafka 的消息传递层对数据进行分区以存储和传输数据。Kafka Streams 对数据进行分区以进行处理。在这两种情况下，这种分区都是实现数据局部性、弹性、可扩展性、高性能和容错的原因。Kafka Streams 使用分区和任务的概念作为其基于 Kafka 主题分区的并行模型的逻辑单元。在并行性方面，Kafka Streams 和 Kafka 之间有着密切的联系：

Each stream partition is a totally ordered sequence of data records and maps to a Kafka topic partition.
每个流分区都是一个完全有序的数据记录序列，并映射到一个 Kafka主题分区
A data record in the stream maps to a Kafka message from that topic.
流中的数据记录映射到来自该主题的 Kafka消息
The keys of data records determine the partitioning of data in both Kafka and Kafka Streams, i.e., how data is routed to specific partitions within topics.
数据记录的键决定了 Kafka 和 Kafka Streams 中数据的分区，即数据如何路由到主题内的特定分区

An application’s processor topology is scaled by breaking it into multiple tasks. More specifically, Kafka Streams creates a fixed number of tasks based on the input stream partitions for the application, with each task assigned a list of partitions from the input streams (i.e., Kafka topics). The assignment of partitions to tasks never changes so that each task is a fixed unit of parallelism of the application. Tasks can then instantiate their own processor t