Apache Flink API 大全详解

最新推荐文章于 2025-09-29 10:22:12 发布

原创最新推荐文章于 2025-09-29 10:22:12 发布 · 1k 阅读

9 ·

CC 4.0 BY-SA版权

文章标签：

#apache #flink #大数据 #真流处理引擎

大数据专栏收录该内容

127 篇文章

订阅专栏

Apache Flink API 大全详解

本文将全面解析 Apache Flink 的核心 API 体系，涵盖从基础操作到高级功能的完整接口，帮助您构建强大的流批一体应用。

一、Flink API 分层架构

二、DataStream API（核心流处理）

1. 数据源与接收器

// Kafka源
DataStream<String> kafkaSource = env.addSource(new FlinkKafkaConsumer<>(
    "topic", new SimpleStringSchema(), properties));

// 文件源
DataStream<String> fileSource = env.readTextFile("hdfs:///input");

// Kafka接收器
stream.addSink(new FlinkKafkaProducer<>(
    "output-topic", new SimpleStringSchema(), properties));

2. 基础转换操作

DataStream<Tuple2<String, Integer>> processed = source
    .filter(value -> value.startsWith("A"))  // 过滤
    .map(value -> new Tuple2<>(value, 1))    // 映射
    .flatMap((String value, Collector<Tuple2<String, Integer>> out) -> {
        for (String word : value.split(" ")) {
            out.collect(new Tuple2<>(word, 1));
        }
    })  // 扁平化
    .keyBy(0)  // 按键分区
    .sum(1);   // 聚合

3. 窗口操作

// 事件时间滚动窗口
stream.keyBy(0)
    .window(TumblingEventTimeWindows.of(Time.minutes(5)))
    .reduce(new MyReduceFunction());

// 处理时间滑动窗口
stream.keyBy(0)
    .window(SlidingProcessingTimeWindows.of(Time.minutes(10), Time.minutes(2)))
    .aggregate(new MyAggregateFunction());

// 会话窗口
stream.keyBy(0)
    .window(EventTimeSessionWindows.withGap(Time.minutes(15)))
    .apply(new WindowFunction());

三、Table API & SQL（声明式处理）

1. 表环境配置

// 创建表环境
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);

// 注册表
tableEnv.executeSql("CREATE TABLE Orders (" +
    "user_id STRING, " +
    "product STRING, " +
    "amount INT, " +
    "order_time TIMESTAMP(3), " +
    "WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND" +
    ") WITH ( ... )");

2. SQL 查询

-- 滚动窗口聚合
SELECT 
    user_id,
    TUMBLE_START(order_time, INTERVAL '1' HOUR) AS window_start,
    SUM(amount) AS total_amount
FROM Orders
GROUP BY 
    user_id,
    TUMBLE(order_time, INTERVAL '1' HOUR)

3. Table API 操作

Table orders = tableEnv.from("Orders");

Table result = orders
    .filter($("amount").gt(100))
    .window(Tumble.over(lit(1).hours()).on($("order_time")).as("w"))
    .groupBy($("user_id"), $("w"))
    .select(
        $("user_id"),
        $("w").start().as("window_start"),
        $("amount").sum().as("total_amount"));

四、ProcessFunction（底层控制）

1. 时间与状态处理

public class FraudDetector extends KeyedProcessFunction<String, Transaction, Alert> {
    
    private ValueState<Boolean> flagState;
    private ValueState<Long> timerState;
    
    @Override
    public void open(Configuration parameters) {
        ValueStateDescriptor<Boolean> flagDescriptor = new ValueStateDescriptor<>(
            "flag", Boolean.class);
        flagState = getRuntimeContext().getState(flagDescriptor);
        
        ValueStateDescriptor<Long> timerDescriptor = new ValueStateDescriptor<>(
            "timer-state", Long.class);
        timerState = getRuntimeContext().getState(timerDescriptor);
    }
    
    @Override
    public void processElement(Transaction tx, Context ctx, Collector<Alert> out) {
        // 状态访问
        Boolean lastFlag = flagState.value();
        
        if (tx.getAmount() > 10000) {
            // 设置状态
            flagState.update(true);
            
            // 注册事件时间定时器
            long timer = ctx.timestamp() + Time.minutes(10).toMilliseconds();
            ctx.timerService().registerEventTimeTimer(timer);
            timerState.update(timer);
        }
    }
    
    @Override
    public void onTimer(long timestamp, OnTimerContext ctx, Collector<Alert> out) {
        // 定时器触发逻辑
        timerState.clear();
        flagState.clear();
    }
}

2. 侧输出流

final OutputTag<String> sideOutputTag = new OutputTag<String>("side-output"){};

DataStream<Integer> mainStream = stream.process(new ProcessFunction<String, Integer>() {
    @Override
    public void processElement(String value, Context ctx, Collector<Integer> out) {
        if (value.contains("ERROR")) {
            ctx.output(sideOutputTag, "Error detected: " + value);
        }
        out.collect(value.length());
    }
});

DataStream<String> sideOutputStream = mainStream.getSideOutput(sideOutputTag);

五、状态管理API

1. 状态类型

状态类型	接口	使用场景
ValueState	`T value()`/`update(T)`	单值状态（如计数器）
ListState	`add(T)`/`get()`	列表状态（如窗口元素）
MapState	`put(K,V)`/`get(K)`	键值状态（如用户画像）
ReducingState	`add(T)`	聚合状态（自动reduce）
AggregatingState	`add(IN)`	复杂聚合状态

2. 状态TTL配置

StateTtlConfig ttlConfig = StateTtlConfig.newBuilder(Time.days(30))
    .setUpdateType(StateTtlConfig.UpdateType.OnReadAndWrite)
    .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
    .cleanupInRocksdbCompactFilter(1000)
    .build();

ValueStateDescriptor<String> stateDescriptor = new ValueStateDescriptor<>("state", String.class);
stateDescriptor.enableTimeToLive(ttlConfig);

六、连接器API（Connectors）

1. Kafka连接器

// 精确一次生产者
kafkaSource.addSink(new FlinkKafkaProducer<>(
    "output-topic",
    new KafkaSerializationSchema<String>() {
        @Override
        public ProducerRecord<byte[], byte[]> serialize(
            String element, @Nullable Long timestamp) {
            return new ProducerRecord<>("topic", element.getBytes());
        }
    },
    properties,
    FlinkKafkaProducer.Semantic.EXACTLY_ONCE
));

2. 文件系统连接器

// 流式写入Parquet
stream.sinkTo(FileSink.forRowFormat(
        new Path("hdfs:///output"),
        new SimpleStringEncoder<String>("UTF-8"))
    .withRollingPolicy(DefaultRollingPolicy.builder()
        .withRolloverInterval(TimeUnit.MINUTES.toMillis(15))
        .build())
    .build());

3. JDBC连接器

JdbcSink.sink(
    "INSERT INTO orders (user, product, amount) VALUES (?, ?, ?)",
    (statement, record) -> {
        statement.setString(1, record.user);
        statement.setString(2, record.product);
        statement.setInt(3, record.amount);
    },
    JdbcExecutionOptions.builder()
        .withBatchSize(1000)
        .withBatchIntervalMs(200)
        .build(),
    new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
        .withUrl("jdbc:mysql://localhost:3306/db")
        .withDriverName("com.mysql.jdbc.Driver")
        .withUsername("user")
        .withPassword("pass")
        .build()
);

七、CEP（复杂事件处理）

1. 模式定义

Pattern<LoginEvent, ?> pattern = Pattern.<LoginEvent>begin("first")
    .where(SimpleCondition.of(event -> "FAIL".equals(event.getType())))
    .next("second")
    .where(SimpleCondition.of(event -> "FAIL".equals(event.getType())))
    .next("third")
    .where(SimpleCondition.of(event -> "FAIL".equals(event.getType())))
    .within(Time.minutes(5));

2. 模式检测

PatternStream<LoginEvent> patternStream = CEP.pattern(
    loginEventStream.keyBy(LoginEvent::getUserId), 
    pattern
);

DataStream<Alert> alerts = patternStream.select(
    (Map<String, List<LoginEvent>> pattern) -> {
        LoginEvent first = pattern.get("first").get(0);
        LoginEvent second = pattern.get("second").get(0);
        LoginEvent third = pattern.get("third").get(0);
        return new Alert("三次登录失败: " + first.getUserId());
    }
);

八、机器学习库（Flink ML）

1. 特征工程

// 创建特征工程管道
Estimator<Vector, Vector, Vectorizer> vectorizer = new Vectorizer()
    .setInputCols("feature1", "feature2")
    .setOutputCol("features");

// 创建KMeans算法
KMeans kmeans = new KMeans()
    .setK(3)
    .setSeed(42);

2. 模型训练与预测

// 构建管道
Pipeline pipeline = new Pipeline().add(vectorizer).add(kmeans);

// 训练模型
PipelineModel model = pipeline.fit(trainingData);

// 执行预测
Table predictions = model.transform(testData)[0];

九、Gelly图计算API

1. 图创建

Graph<Long, String, Double> graph = Graph.fromDataSet(
    vertices, // DataSet<Vertex<Long, String>>
    edges,    // DataSet<Edge<Long, Double>>
    env
);

2. 图算法

// PageRank算法
DataSet<Vertex<Long, Double>> pageRanks = graph.run(
    new PageRank<Long>(0.85, 10)
);

// 社区检测
DataSet<Vertex<Long, Long>> communities = graph.run(
    new LabelPropagation<Long>(10)
);

十、PyFlink API（Python接口）

1. Python DataStream API

from pyflink.datastream import StreamExecutionEnvironment

env = StreamExecutionEnvironment.get_execution_environment()

ds = env.from_collection([1, 2, 3, 4, 5])
ds.filter(lambda x: x % 2 == 0).print()

env.execute("Python DataStream Job")

2. Python Table API

from pyflink.table import EnvironmentSettings, TableEnvironment

settings = EnvironmentSettings.in_streaming_mode()
t_env = TableEnvironment.create(settings)

t_env.execute_sql("""
    CREATE TABLE Orders (
        user_id STRING,
        product STRING,
        amount INT
    ) WITH (...)
""")

result = t_env.sql_query("SELECT product, SUM(amount) FROM Orders GROUP BY product")
result.execute().print()

十一、API 最佳实践总结

1. API 选择指南

场景	推荐API
声明式ETL	Table API/SQL
复杂事件处理	CEP
精确状态控制	ProcessFunction
批流统一分析	Table API
机器学习管道	Flink ML
图分析	Gelly

2. 性能优化要点

状态后端选择：大状态场景使用RocksDB
Checkpoint优化：调整间隔和超时时间
并行度设置：与Kafka分区数对齐
网络缓冲：调整taskmanager.memory.network.fraction
序列化优化：使用高效序列化器

3. 容错设计

// 高可用配置
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(
    3, Time.of(10, TimeUnit.SECONDS)
);

// Checkpoint配置
env.enableCheckpointing(5000);
env.getCheckpointConfig().setCheckpointStorage("hdfs:///checkpoints");
env.getCheckpointConfig().setTolerableCheckpointFailureNumber(3);