pykafka的简单使用

最新推荐文章于 2025-03-29 21:05:19 发布
原创最新推荐文章于 2025-03-29 21:05:19 发布 · 1.3k 阅读
2 ·
CC 4.0 BY-SA版权
文章标签：
#kafka #pykafka #生产者 #消费者 #python
经验分享专栏收录该内容
52 篇文章
订阅专栏
本文档详细介绍了Kafka的消费者和生产者API，包括主题、消费者组、消息处理和分区策略。消费者API涉及自动提交偏移量、重试逻辑和消息反序列化等功能，而生产者API涵盖了同步和异步生产消息、指定分区写入以及消息确认机制。通过示例代码展示了如何创建和使用Kafka的简单消费者和均衡消费者。
摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >
# 一些概念
"""
topic ( pykafka.topic.Topic) – 此消费者应该消费的主题
cluster ( pykafka.cluster.Cluster) – 此消费者应连接到的集群
consumer_group ( str ) – 此消费者应加入的消费者组的名称。消费者组名称在集群级别命名空间，这意味着使用相同组名称的两个消费者将被视为同一组的一部分。
fetch_message_max_bytes ( int ) – 每次获取请求时尝试获取的消息字节数
num_consumer_fetchers ( int ) – 用于进行 FetchRequests 的工作人员数量
auto_commit_enable ( bool ) – 如果为 true，则定期向 kafka 提交已从消耗 () 调用返回的消息的偏移量。要求 consumer_group不是None。
auto_commit_interval_ms ( int ) – 消费者的偏移量提交给 kafka 的频率（以毫秒为单位）。如果auto_commit_enable为False ，则忽略此设置。
queued_max_messages ( int ) – 在内部缓冲以供消费的最大消息数 pykafka.simpleconsumer.SimpleConsumer
fetch_min_bytes ( int ) – 服务器应为获取请求返回的最小数据量（以字节为单位）。如果没有足够的数据可用，请求将阻塞，直到有足够的数据可用。
fetch_error_backoff_ms ( int ) –未使用。见pykafka.simpleconsumer.SimpleConsumer。
fetch_wait_max_ms ( int ) – 如果没有足够的数据来立即满足fetch_min_bytes，服务器将在响应获取请求之前阻塞的最长时间（以毫秒为单位）。
offsets_channel_backoff_ms ( int ) – 重试失败的偏移提交和获取的退避时间。
offsets_commit_max_retries ( int ) – 偏移量提交工作者在引发错误之前应该重试的次数。
auto_offset_reset ( pykafka.common.OffsetType) – 如果偏移量超出范围怎么办。此设置指示在遇到OffsetOutOfRangeError时如何重置使用者的内部偏移计数器。
consumer_timeout_ms ( int ) – 在返回 None 之前，消费者在没有可用消息的情况下可能花费的时间（以毫秒为单位）。
rebalance_max_retries ( int ) – 在引发错误之前重新平衡应重试的次数。
rebalance_backoff_ms ( int ) – 重新平衡期间重试之间的退避时间（以毫秒为单位）。
zookeeper_connection_timeout_ms ( int ) – 消费者在与 zookeeper 建立连接时等待的最长时间（以毫秒为单位）。
zookeeper_connect ( str ) – 已弃用::2.7,3.6 逗号分隔 (ip1:port1,ip2:port2) 字符串，指示要连接的 zookeeper 节点。
zookeeper_hosts ( str ) – 要连接的 ZooKeeper 主机的 KazooClient 格式字符串。
zookeeper ( kazoo.client.KazooClient) – 连接到 Zookeeper 实例的 KazooClient。如果提供，zookeeper_connect 将被忽略。
auto_start ( bool ) – __init__ 完成后消费者是否应该开始与zookeeper 通信。如果为 false，则可以使用start()开始通信。
reset_offset_on_start ( bool ) – 消费者是否应将其内部偏移计数器重置为self._auto_offset_reset并在启动时立即提交该偏移量
post_rebalance_callback ( function ) – 在进行重新平衡时要调用的函数。这个函数应该接受三个参数：pykafka.balancedconsumer.BalancedConsumer刚刚完成重新平衡的 实例，重新平衡之前它拥有的分区字典，以及重新平衡之后它拥有的分区字典。这些字典将分区 id 映射到这些分区的最近已知的偏移量。此函数可以选择返回一个字典，将分区 id 映射到偏移量。如果是这样，消费者将在继续消费之前将其偏移量重置为提供的值。请注意，在此回调运行时 BalancedConsumer 处于定义不佳的状态，因此访问其属性（例如hold_offsets或partitions) 可能会产生令人困惑的结果。相反，回调应该真正依赖于提供的 partition-id dicts，它们是明确定义的。
use_rdkafka ( bool ) – 如果可用，使用 librdkafka 支持的消费者
compacted_topic ( bool ) – 设置为从压缩的主题中读取。强制消费者使用不太严格的消息排序逻辑，因为压缩的主题不会以严格的递增顺序提供偏移量。
members_protocol ( pykafka.membershipprotocol.GroupMembershipProtocol) – 此消费者应遵守的组成员协议
deserializer ( function ) – 定义如何反序列化从 Kafka 返回的消息的函数。具有签名 d(value, partition_key) 的函数，该函数返回 (deserialized_value, deserialized_partition_key) 的元组。传递给该函数的参数是消息值和分区键的字节表示，返回的数据应该是根据客户端代码的序列化逻辑转换的这些字段。有关库存实现，请参阅pykafka.utils.__init__。
reset_offset_on_fetch ( bool ) – fetch_offsets 期间是否更新偏移量。禁用只读用例以防止副作用。
"""
# 生产者.py
import queue
import time

from pykafka import KafkaClient


class KafkaTest(object):
    """
    测试kafka常用api
    """

    def __init__(self, host, topic):
        self._client = KafkaClient(hosts=host)
        self._topic = self._client.topics[topic.encode()]

    def producer_partition(self):
        """
        生产者分区查看，主要查看生产消息时offset的变化
        """
        partitions = self._topic.partitions
        print("查看所有分区:", partitions)

        earliest_offset = self._topic.earliest_available_offsets()
        print("获取最早可用的offset:", earliest_offset)

        last_offset = self._topic.latest_available_offsets()
        print("最近可用的offset:", last_offset)

        # 同步生产消息
        p = self._topic.get_producer(sync=True)
        p.produce(str(time.time()).encode())

        # 查看offset的变化
        last_offset_new = self._topic.latest_available_offsets()
        print("最新最近可用的offset:", last_offset_new)

    def producer_designated_partition(self):
        """
        往指定分区写消息，如果要控制打印到某个分区,
        需要再获取生产者的时候指定选取函数，
        并且再生成消息的时候额外指一个key
        """

        def assign_patition(pid, key):
            """
            指定特定分区，这里测试写入第一个分区(id=0)
            :param: 分区列表
            """
            print("为消息分配分区partition:", pid, key)
            return pid[0]

        # sync ( bool ) – 对生产的调用是否应该在返回之前等待消息发送。
        # 如果为True，则如果交付给 kafka 失败，则会从produce()引发异常
        # partitioner 分区器
        p = self._topic.get_producer(sync=True, partitioner=assign_patition)
        p.produce(str(time.time()).encode(), partition_key=b"partition_key_0")

    def async_produce_message(self):
        """
        异步生产消息，消息会被推到一个队列里面，
        另外一个线程会在队列中消息大小满足一个阈值(min_queued_messages)
        或到达一段时间(linger_ms)后统一发送，默认5s
        :return:
        """
        last_offset = self._topic.latest_available_offsets()
        print("最近的偏移量offse=", last_offset)

        # 记录最初的偏移量
        old_offset = last_offset[0].offset[0]
        p = self._topic.get_producer(sync=False, partitioner=lambda pid, key: pid[0])
        p.produce(str(time.time()).encode())
        s_time = time.time()
        while True:
            last_offset = self._topic.latest_available_offsets()
            print("最近可用的offset=", last_offset)
            if last_offset[0].offset[0] != old_offset:
                e_time = time.time()
                print("cont time = ", e_time - s_time)
                break
            time.sleep(1)

    def get_produce_message_report(self):
        """
        查看异步发送消息报告，默认会等待5秒后才能获得报告
        :return:
        """
        last_offset = self._topic.latest_available_offsets()
        print("最近的偏移量offset=", last_offset)
        # delivery_reports 获取消息的传递确认
        p = self._topic.get_producer(sync=False, delivery_reports=True, partitioner=lambda pid, key: pid[0])
        p.produce(str(time.time()).encode())
        s_time = time.time()
        delivery_report = p.get_delivery_report()  # 必须设置delivery_reports=True才能获取反馈
        e_time = time.time()
        print('等待{}s, 递交报告{}'.format(e_time - s_time, delivery_report))
        last_offset = self._topic.latest_available_offsets
        print("最近的偏移量offset=", last_offset)

    def get_produce_message_report_exc(self):
        message = "test message test message1"
        # 生产环境，为了达到高吞吐量，要采用异步的方式，通过delivery_reports =True来启用队列接口；
        last_offset = self._topic.latest_available_offsets()
        print("最近的偏移量offset=", last_offset)
        producer = self._topic.get_producer(sync=False, delivery_reports=True)
        producer.produce(bytes(message.encode()))
        last_offset = self._topic.latest_available_offsets()
        print("最近的偏移量offset=", last_offset)
        try:
            print("get info,直接运行速度太快，msg无法获取，直接报异常了")
            msg, exc = producer.get_delivery_report(block=False)
            print("msg", msg)
            if exc is not None:
                print('Failed to deliver msg {}: {}'.format(msg.partition_key, repr(exc)))
            else:
                print('Successfully delivered msg {}'.format(msg.partition_key))

        except queue.Empty:
            pass
        except Exception as e:
            print(e)


if __name__ == '__main__':
    host = "192.168.0.120:9092"
    topic = "trademark_info"
    kafka_ins = KafkaTest(host=host, topic=topic)
    # kafka_ins.producer_partition()
    # kafka_ins.producer_designated_partition()
    # kafka_ins.async_produce_message()
    # kafka_ins.get_produce_message_report()
    kafka_ins.get_produce_message_report_exc()
# 消费者.py
from pykafka import KafkaClient


class KafkaTest(object):
    """
    测试kafka常用api
    """

    def __init__(self, host, topic):
        self._client = KafkaClient(hosts=host)
        self._topic = self._client.topics[topic.encode()]

    def simple_consumer(self, offset=0):
        """
        消费者指定消费
        :param offset:偏移量
        :return:
        """
        partitions = self._topic.partitions
        print("查看所有分区partitions=", partitions)
        last_offset = self._topic.latest_available_offsets()
        print("最近的偏移量offset=", last_offset)
        consumer = self._topic.get_simple_consumer(consumer_group="trademark_info",
                                                   partitions=[partitions[0]])  # 选择一个分区进行消费
        offset_list = consumer.held_offsets
        print("当前消费者分区offset情况{}".format(offset_list))  # 消费者拥有的分区offset的情况
        consumer.reset_offsets([(partitions[0], offset)])  # 设置offset偏移值
        msg = consumer.consume()
        print("消费 :{}".format(msg.value.decode()))
        msg = consumer.consume()
        print("消费 :{}".format(msg.value.decode()))
        msg = consumer.consume()
        print("消费 :{}".format(msg.value.decode()))
        offset = consumer.held_offsets  # 返回从分区 id 到每个分区的持有偏移量的映射
        print("当前消费者分区offset情况{}".format(offset_list))  # 返回从分区 id 到每个分区的持有偏移量的映射

    def balance_consumer(self, offset=0):
        """
        使用balance consumer 去消费 kafka
        :param offset:
        :return:
        """
        # managed=True 设置后，使用新式reblance分区方法，不需要使用zk，
        # 而False是通过zk来实现reblance的需要使用zk
        consumer = self._topic.get_balanced_consumer(consumer_group="trademark_info", managed=True)
        partitions = self._topic.partitions
        print("分区 {}".format(partitions))
        earliest_offsets = self._topic.earliest_available_offsets()
        print("最早可用offset {}".format(earliest_offsets))
        last_offsets = self._topic.latest_available_offsets()
        print("最近可用offset {}".format(last_offsets))
        offset = consumer.held_offsets
        print("当前消费者分区offset情况{}".format(offset))
        while True:
            msg = consumer.consume()
            offset = consumer.held_offsets
            print("{}, 当前消费者分区offset情况{}".format(msg.value.decode(), offset))


if __name__ == '__main__':
    host = "192.168.0.120:9092"
    topic = "trademark_info"
    kafka_ins = KafkaTest(host=host, topic=topic)
    kafka_ins.simple_consumer()
    # kafka_ins.balance_consumer()