Apache Flink API 大全详解
本文将全面解析 Apache Flink 的核心 API 体系,涵盖从基础操作到高级功能的完整接口,帮助您构建强大的流批一体应用。
一、Flink API 分层架构
二、DataStream API(核心流处理)
1. 数据源与接收器
// Kafka源
DataStream<String> kafkaSource = env.addSource(new FlinkKafkaConsumer<>(
"topic", new SimpleStringSchema(), properties));
// 文件源
DataStream<String> fileSource = env.readTextFile("hdfs:///input");
// Kafka接收器
stream.addSink(new FlinkKafkaProducer<>(
"output-topic", new SimpleStringSchema(), properties));
2. 基础转换操作
DataStream<Tuple2<String, Integer>> processed = source
.filter(value -> value.startsWith("A")) // 过滤
.map(value -> new Tuple2<>(value, 1)) // 映射
.flatMap((String value, Collector<Tuple2<String, Integer>> out) -> {
for (String word : value.split(" ")) {
out.collect(new Tuple2<>(word, 1));
}
}) // 扁平化
.keyBy(0) // 按键分区
.sum(1); // 聚合
3. 窗口操作
// 事件时间滚动窗口
stream.keyBy(0)
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.reduce(new MyReduceFunction());
// 处理时间滑动窗口
stream.keyBy(0)
.window(SlidingProcessingTimeWindows.of(Time.minutes(10), Time.minutes(2)))
.aggregate(new MyAggregateFunction());
// 会话窗口
stream.keyBy(0)
.window(EventTimeSessionWindows.withGap(Time.minutes(15)))
.apply(new WindowFunction());
三、Table API & SQL(声明式处理)
1. 表环境配置
// 创建表环境
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
// 注册表
tableEnv.executeSql("CREATE TABLE Orders (" +
"user_id STRING, " +
"product STRING, " +
"amount INT, " +
"order_time TIMESTAMP(3), " +
"WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND" +
") WITH ( ... )");
2. SQL 查询
-- 滚动窗口聚合
SELECT
user_id,
TUMBLE_START(order_time, INTERVAL '1' HOUR) AS window_start,
SUM(amount) AS total_amount
FROM Orders
GROUP BY
user_id,
TUMBLE(order_time, INTERVAL '1' HOUR)
3. Table API 操作
Table orders = tableEnv.from("Orders");
Table result = orders
.filter($("amount").gt(100))
.window(Tumble.over(lit(1).hours()).on($("order_time")).as("w"))
.groupBy($("user_id"), $("w"))
.select(
$("user_id"),
$("w").start().as("window_start"),
$("amount").sum().as("total_amount"));
四、ProcessFunction(底层控制)
1. 时间与状态处理
public class FraudDetector extends KeyedProcessFunction<String, Transaction, Alert> {
private ValueState<Boolean> flagState;
private ValueState<Long> timerState;
@Override
public void open(Configuration parameters) {
ValueStateDescriptor<Boolean> flagDescriptor = new ValueStateDescriptor<>(
"flag", Boolean.class);
flagState = getRuntimeContext().getState(flagDescriptor);
ValueStateDescriptor<Long> timerDescriptor = new ValueStateDescriptor<>(
"timer-state", Long.class);
timerState = getRuntimeContext().getState(timerDescriptor);
}
@Override
public void processElement(Transaction tx, Context ctx, Collector<Alert> out) {
// 状态访问
Boolean lastFlag = flagState.value();
if (tx.getAmount() > 10000) {
// 设置状态
flagState.update(true);
// 注册事件时间定时器
long timer = ctx.timestamp() + Time.minutes(10).toMilliseconds();
ctx.timerService().registerEventTimeTimer(timer);
timerState.update(timer);
}
}
@Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<Alert> out) {
// 定时器触发逻辑
timerState.clear();
flagState.clear();
}
}
2. 侧输出流
final OutputTag<String> sideOutputTag = new OutputTag<String>("side-output"){};
DataStream<Integer> mainStream = stream.process(new ProcessFunction<String, Integer>() {
@Override
public void processElement(String value, Context ctx, Collector<Integer> out) {
if (value.contains("ERROR")) {
ctx.output(sideOutputTag, "Error detected: " + value);
}
out.collect(value.length());
}
});
DataStream<String> sideOutputStream = mainStream.getSideOutput(sideOutputTag);
五、状态管理API
1. 状态类型
状态类型 | 接口 | 使用场景 |
---|---|---|
ValueState | T value() /update(T) | 单值状态(如计数器) |
ListState | add(T) /get() | 列表状态(如窗口元素) |
MapState | put(K,V) /get(K) | 键值状态(如用户画像) |
ReducingState | add(T) | 聚合状态(自动reduce) |
AggregatingState | add(IN) | 复杂聚合状态 |
2. 状态TTL配置
StateTtlConfig ttlConfig = StateTtlConfig.newBuilder(Time.days(30))
.setUpdateType(StateTtlConfig.UpdateType.OnReadAndWrite)
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
.cleanupInRocksdbCompactFilter(1000)
.build();
ValueStateDescriptor<String> stateDescriptor = new ValueStateDescriptor<>("state", String.class);
stateDescriptor.enableTimeToLive(ttlConfig);
六、连接器API(Connectors)
1. Kafka连接器
// 精确一次生产者
kafkaSource.addSink(new FlinkKafkaProducer<>(
"output-topic",
new KafkaSerializationSchema<String>() {
@Override
public ProducerRecord<byte[], byte[]> serialize(
String element, @Nullable Long timestamp) {
return new ProducerRecord<>("topic", element.getBytes());
}
},
properties,
FlinkKafkaProducer.Semantic.EXACTLY_ONCE
));
2. 文件系统连接器
// 流式写入Parquet
stream.sinkTo(FileSink.forRowFormat(
new Path("hdfs:///output"),
new SimpleStringEncoder<String>("UTF-8"))
.withRollingPolicy(DefaultRollingPolicy.builder()
.withRolloverInterval(TimeUnit.MINUTES.toMillis(15))
.build())
.build());
3. JDBC连接器
JdbcSink.sink(
"INSERT INTO orders (user, product, amount) VALUES (?, ?, ?)",
(statement, record) -> {
statement.setString(1, record.user);
statement.setString(2, record.product);
statement.setInt(3, record.amount);
},
JdbcExecutionOptions.builder()
.withBatchSize(1000)
.withBatchIntervalMs(200)
.build(),
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withUrl("jdbc:mysql://localhost:3306/db")
.withDriverName("com.mysql.jdbc.Driver")
.withUsername("user")
.withPassword("pass")
.build()
);
七、CEP(复杂事件处理)
1. 模式定义
Pattern<LoginEvent, ?> pattern = Pattern.<LoginEvent>begin("first")
.where(SimpleCondition.of(event -> "FAIL".equals(event.getType())))
.next("second")
.where(SimpleCondition.of(event -> "FAIL".equals(event.getType())))
.next("third")
.where(SimpleCondition.of(event -> "FAIL".equals(event.getType())))
.within(Time.minutes(5));
2. 模式检测
PatternStream<LoginEvent> patternStream = CEP.pattern(
loginEventStream.keyBy(LoginEvent::getUserId),
pattern
);
DataStream<Alert> alerts = patternStream.select(
(Map<String, List<LoginEvent>> pattern) -> {
LoginEvent first = pattern.get("first").get(0);
LoginEvent second = pattern.get("second").get(0);
LoginEvent third = pattern.get("third").get(0);
return new Alert("三次登录失败: " + first.getUserId());
}
);
八、机器学习库(Flink ML)
1. 特征工程
// 创建特征工程管道
Estimator<Vector, Vector, Vectorizer> vectorizer = new Vectorizer()
.setInputCols("feature1", "feature2")
.setOutputCol("features");
// 创建KMeans算法
KMeans kmeans = new KMeans()
.setK(3)
.setSeed(42);
2. 模型训练与预测
// 构建管道
Pipeline pipeline = new Pipeline().add(vectorizer).add(kmeans);
// 训练模型
PipelineModel model = pipeline.fit(trainingData);
// 执行预测
Table predictions = model.transform(testData)[0];
九、Gelly图计算API
1. 图创建
Graph<Long, String, Double> graph = Graph.fromDataSet(
vertices, // DataSet<Vertex<Long, String>>
edges, // DataSet<Edge<Long, Double>>
env
);
2. 图算法
// PageRank算法
DataSet<Vertex<Long, Double>> pageRanks = graph.run(
new PageRank<Long>(0.85, 10)
);
// 社区检测
DataSet<Vertex<Long, Long>> communities = graph.run(
new LabelPropagation<Long>(10)
);
十、PyFlink API(Python接口)
1. Python DataStream API
from pyflink.datastream import StreamExecutionEnvironment
env = StreamExecutionEnvironment.get_execution_environment()
ds = env.from_collection([1, 2, 3, 4, 5])
ds.filter(lambda x: x % 2 == 0).print()
env.execute("Python DataStream Job")
2. Python Table API
from pyflink.table import EnvironmentSettings, TableEnvironment
settings = EnvironmentSettings.in_streaming_mode()
t_env = TableEnvironment.create(settings)
t_env.execute_sql("""
CREATE TABLE Orders (
user_id STRING,
product STRING,
amount INT
) WITH (...)
""")
result = t_env.sql_query("SELECT product, SUM(amount) FROM Orders GROUP BY product")
result.execute().print()
十一、API 最佳实践总结
1. API 选择指南
场景 | 推荐API |
---|---|
声明式ETL | Table API/SQL |
复杂事件处理 | CEP |
精确状态控制 | ProcessFunction |
批流统一分析 | Table API |
机器学习管道 | Flink ML |
图分析 | Gelly |
2. 性能优化要点
- 状态后端选择:大状态场景使用RocksDB
- Checkpoint优化:调整间隔和超时时间
- 并行度设置:与Kafka分区数对齐
- 网络缓冲:调整taskmanager.memory.network.fraction
- 序列化优化:使用高效序列化器
3. 容错设计
// 高可用配置
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(
3, Time.of(10, TimeUnit.SECONDS)
);
// Checkpoint配置
env.enableCheckpointing(5000);
env.getCheckpointConfig().setCheckpointStorage("hdfs:///checkpoints");
env.getCheckpointConfig().setTolerableCheckpointFailureNumber(3);
Flink 的 API 体系提供了从简单到复杂、从声明式到过程式的完整解决方案,使开发者能够根据具体需求灵活选择合适的抽象层级,构建高效可靠的流批一体应用。