Apache Beam入门：批流统一处理指南

最新推荐文章于 2025-12-03 20:31:34 发布

原创最新推荐文章于 2025-12-03 20:31:34 发布 · 251 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#人工智能

Apache Beam入门：统一批流处理模型

Apache Beam是一个开源的统一编程模型，用于定义和执行批处理和流处理数据流水线。它提供了高级抽象，允许开发者编写一次代码，即可在多种执行引擎上运行，如Apache Flink、Apache Spark和Google Cloud Dataflow。

核心概念

Pipeline：表示数据处理任务的整个流程，从数据读取到转换再到输出。

PCollection：代表数据集，可以是有限的（批处理）或无限的（流处理）。

Transform：对PCollection进行的操作，如过滤、聚合、合并等。

Runner：决定Pipeline在哪个执行引擎上运行，例如DirectRunner（本地测试）、FlinkRunner或DataflowRunner。

环境配置

安装Apache Beam的Python SDK：

pip install apache-beam

对于Java开发者，需添加以下Maven依赖：

<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-sdks-java-core</artifactId>
  <version>2.40.0</version>
</dependency>

批处理示例

以下是一个简单的WordCount批处理程序，统计文本中单词的出现频率：

import apache_beam as beam

def run():
    with beam.Pipeline() as pipeline:
        (
            pipeline
            | 'Read lines' >> beam.io.ReadFromText('input.txt')
            | 'Split words' >> beam.FlatMap(lambda line: line.split())
            | 'Count words' >> beam.combiners.Count.PerElement()
            | 'Format results' >> beam.Map(lambda word_count: f"{word_count[0]}: {word_count[1]}")
            | 'Write results' >> beam.io.WriteToText('output.txt')
        )

if __name__ == '__main__':
    run()

流处理示例

流处理需要配置额外的参数，如窗口和时间戳。以下是一个模拟流数据的WordCount：

import apache_beam as beam
from apache_beam.options.p