Apache Flink Java 示例:实时ETL管道(Kafka到Elasticsearch)
本文将详细讲解如何使用 Apache Flink 构建一个完整的实时 ETL 管道,从 Kafka 读取原始数据,经过一系列转换处理,最终写入 Elasticsearch。本示例适用于各种数据处理场景,如日志分析、用户行为跟踪、物联网数据处理等。
系统架构概述
完整实现代码
1. 依赖配置 (pom.xml)
<dependencies>
<!-- Flink核心依赖 -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>1.17.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java</artifactId>
<version>1.17.0</version>
</dependency>
<!-- Kafka连接器 -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka</artifactId>
<version>1.17.0</version>
</dependency>
<!-- Elasticsearch连接器 -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-elasticsearch7</artifactId>
<version>1.17.0</version>
</dependency>
<!-- JSON处理 -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-json</artifactId>
<version>1.17.0</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.14.2</version>
</dependency>
<!-- 其他工具 -->
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>1.18.26</version>
<scope>provided</scope>
</dependency>
</dependencies>
2. 数据模型定义
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
import java.time.Instant;
/**
* 原始事件POJO(Kafka接收格式)
*/
@Data
@NoArgsConstructor
@AllArgsConstructor
public class RawEvent {
private String eventId;
private String eventType;
private String userId;
private String ipAddress;
private String deviceType;
private String userAgent;
private Double value;
private String metadata;
private Long timestamp;
}
/**
* 处理后的结构化事件(Elasticsearch写入格式)
*/
@Data
@NoArgsConstructor
@AllArgsConstructor
public class ProcessedEvent {
private String eventId;
private String eventType;
private String userId;
private String location; // 通过IP解析的地理位置
private String deviceCategory; // 设备分类(Desktop/Mobile/Tablet)
private String osFamily; // 操作系统家族
private Double value;
private Double normalizedValue; // 标准化后的值
private String status; // 处理状态(SUCCESS / FAILED)
private Instant eventTime; // ISO-8601格式时间
private Instant processTime; // ETL处理时间
}
3. Flink ETL作业实现
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.RichFlatMapFunction;
import org.apache.flink.api.common.restartstrategy.RestartStrategies;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.common.time.Time;
import org.apache.flink.connector.base.DeliveryGuarantee;
import org.apache.flink.connector.elasticsearch.sink.Elasticsearch7SinkBuilder;
import org.apache.flink.connector.elasticsearch.sink.FlushBackoffType;
import org.apache.flink.connector.kafka.sink.KafkaRecordSerializationSchema;
import org.apache.flink.connector.kafka.sink.KafkaSink;
import org.apache.flink.connector.kafka.source.KafkaSource;
import org.apache.flink.connector.kafka.source.enumerator.initializer.OffsetsInitializer;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;
import org.apache.http.HttpHost;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.client.Requests;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.databind.node.ObjectNode;
import java.time.Instant;
import java.time.ZoneId;
import java.time.format.DateTimeFormatter;
import java.util.Arrays;
import java.util.List;
import java.util.Locale;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class KafkaToElasticsearchETL {
// 定义异常输出流
private static final OutputTag<ProcessedEvent> DEAD_LETTER_TAG =
new OutputTag<ProcessedEvent>("dead-letter") {};
// IP地理位置解析服务(生产环境应使用专业服务)
private static final List<String> LOCATIONS = Arrays.asList(
"北京", "上海", "广州", "深圳", "杭州", "成都", "纽约", "伦敦", "东京", "悉尼"
);
public static void main(String[] args) throws Exception {
// 1. 设置流处理环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 配置检查点(保证精确一次语义)
env.enableCheckpointing(30000); // 30秒一次检查点
// 配置重启策略
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(
3, // 最多重试3次
Time.seconds(10) // 重试间隔10秒
));
// 2. 创建Kafka数据源(生产环境参数应从配置读取)
KafkaSource<String> kafkaSource = KafkaSource.<String>builder()
.setBootstrapServers("kafka-broker:9092")
.setTopics("user-events-raw")
.setGroupId("flink-etl-group")
.setStartingOffsets(OffsetsInitializer.earliest())
.setValueOnlyDeserializer(new SimpleStringSchema())
.setProperty("auto.offset.reset", "latest")
.build();
// 3. 从Kafka读取数据流
DataStream<String> rawDataStream = env.fromSource(
kafkaSource,
WatermarkStrategy.forMonotonousTimestamps(),
"Kafka Source"
);
// 4. ETL处理管道
SingleOutputStreamOperator<ProcessedEvent> processedStream = rawDataStream
// 步骤1: 字符串转POJO(带异常处理)
.flatMap(new JsonToRawEventMapper())
.name("JSON 解析")
.uid("json-parser")
// 步骤2: 数据清洗
.filter(event ->
event.getUserId() != null &&
event.getEventType() != null &&
event.getTimestamp() != null)
.name("数据过滤")
.uid("data-filter")
// 步骤3: 丰富数据(IP解析、设备分类等)
.map(new EnrichmentMapper())
.name("数据丰富")
.uid("data-enrichment")
// 步骤4: 值标准化
.map(new ValueNormalizationMapper())
.name("值标准化")
.uid("value-normalization")
// 步骤5: 主处理流程
.process(new MainProcessingFunction())
.name("主处理流程")
.uid("main-process");
// 5. 获取异常流
DataStream<ProcessedEvent> deadLetterStream = processedStream.getSideOutput(DEAD_LETTER_TAG);
// 6. 输出到Elasticsearch
buildElasticsearchSink(processedStream, "user-events-processed");
// 7. 死信队列处理(写入另一个Kafka主题)
buildDeadLetterSink(deadLetterStream);
// 8. 执行作业
env.execute("Real-time ETL Pipeline: Kafka to Elasticsearch");
}
// ======================= ETL处理函数实现 =======================
/**
* JSON字符串转RawEvent对象(带异常处理)
*/
private static class JsonToRawEventMapper
implements FlatMapFunction<String, RawEvent> {
private final ObjectMapper jsonMapper = new ObjectMapper();
private final OutputTag<String> jsonErrorTag = new OutputTag<String>("json-errors") {};
@Override
public void flatMap(String json, Collector<RawEvent> out) {
try {
RawEvent event = jsonMapper.readValue(json, RawEvent.class);
out.collect(event);
} catch (Exception e) {
// 在真实应用中应记录异常日志
System.err.println("JSON解析失败: " + json);
}
}
}
/**
* 数据丰富器:添加地理位置、设备分类等信息
*/
private static class EnrichmentMapper
implements MapFunction<RawEvent, ProcessedEvent> {
private static final Pattern DEVICE_PATTERN = Pattern.compile(
"iPhone|iPad|Android|Windows Phone|Mobile", Pattern.CASE_INSENSITIVE);
private static final Pattern OS_PATTERN = Pattern.compile(
"Windows NT|Mac OS X|Linux|Android|iOS", Pattern.CASE_INSENSITIVE);
private static final DateTimeFormatter TIME_FORMATTER =
DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
.withZone(ZoneId.of("UTC"));
@Override
public ProcessedEvent map(RawEvent raw) {
ProcessedEvent processed = new ProcessedEvent();
// 基本字段复制
processed.setEventId(raw.getEventId());
processed.setEventType(raw.getEventType());
processed.setUserId(raw.getUserId());
processed.setValue(raw.getValue());
// IP解析地理位置(简化逻辑)
processed.setLocation(resolveLocation(raw.getIpAddress()));
// 设备信息解析
String ua = raw.getUserAgent() != null ? raw.getUserAgent() : "";
processed.setDeviceCategory(classifyDevice(ua));
processed.setOsFamily(detectOsFamily(ua));
// 时间格式转换
Instant eventTime = Instant.ofEpochMilli(raw.getTimestamp());
processed.setEventTime(eventTime);
processed.setProcessTime(Instant.now());
processed.setStatus("SUCCESS");
return processed;
}
private String resolveLocation(String ip) {
if (ip == null || ip.isEmpty()) return "Unknown";
// 简化逻辑:根据IP最后一位分配位置
int hash = ip.hashCode();
int index = Math.abs(hash) % LOCATIONS.size();
return LOCATIONS.get(index);
}
private String classifyDevice(String userAgent) {
if (userAgent == null) return "Unknown";
Matcher matcher = DEVICE_PATTERN.matcher(userAgent);
if (matcher.find()) {
String device = matcher.group().toLowerCase();
if (device.contains("iphone") || device.contains("android"))
return "Mobile";
if (device.contains("ipad")) return "Tablet";
}
return "Desktop";
}
private String detectOsFamily(String userAgent) {
if (userAgent == null) return "Unknown";
Matcher matcher = OS_PATTERN.matcher(userAgent);
if (matcher.find()) {
return matcher.group();
}
return "Other";
}
}
/**
* 数值标准化处理器
*/
private static class ValueNormalizationMapper
implements MapFunction<ProcessedEvent, ProcessedEvent> {
@Override
public ProcessedEvent map(ProcessedEvent event) {
if (event.getValue() != null) {
// 简化标准化逻辑(实际应使用业务规则)
event.setNormalizedValue(Math.log1p(event.getValue()));
}
return event;
}
}
/**
* 主处理函数(业务逻辑入口)
*/
private static class MainProcessingFunction
extends ProcessFunction<ProcessedEvent, ProcessedEvent> {
@Override
public void processElement(ProcessedEvent event,
Context context,
Collector<ProcessedEvent> out) {
try {
// 示例:业务规则验证
if ("purchase".equals(event.getEventType()) && event.getValue() < 0) {
throw new IllegalArgumentException("购买金额不能为负数");
}
// 所有验证通过,进入主流程
out.collect(event);
} catch (Exception e) {
// 标记异常并发送到死信队列
event.setStatus("FAILED: " + e.getMessage());
context.output(DEAD_LETTER_TAG, event);
}
}
}
// ======================= Sink连接器构建 =======================
/**
* 构建Elasticsearch Sink
*/
private static void buildElasticsearchSink(
DataStream<ProcessedEvent> stream,
String indexPrefix
) {
List<HttpHost> esHosts = Arrays.asList(
new HttpHost("es-node1", 9200, "http"),
new HttpHost("es-node2", 9200, "http")
);
Elasticsearch7SinkBuilder<ProcessedEvent> esSinkBuilder = new Elasticsearch7SinkBuilder<>();
esSinkBuilder.setHosts(esHosts);
// 按时间分片索引:user-events-processed-2023-07
esSinkBuilder.setEmitter((event, context, indexer) -> {
String indexName = indexPrefix + "-" +
DateTimeFormatter.ofPattern("yyyy-MM").format(Instant.now());
ObjectMapper mapper = new ObjectMapper();
ObjectNode json = mapper.valueToTree(event);
IndexRequest request = Requests.indexRequest()
.index(indexName)
.id(event.getEventId())
.source(mapper.writeValueAsString(json));
indexer.add(request);
});
// 生产配置优化
esSinkBuilder.setBulkFlushMaxActions(1000); // 每1000条刷新
esSinkBuilder.setBulkFlushInterval(5000); // 每5秒刷新
esSinkBuilder.setBulkFlushBackoffStrategy(
FlushBackoffType.EXPONENTIAL, 5, 1000); // 指数退避重试
stream.sinkTo(esSinkBuilder.build())
.name("Elasticsearch Sink")
.uid("elasticsearch-sink");
}
/**
* 构建死信队列Sink(写入Kafka)
*/
private static void buildDeadLetterSink(DataStream<ProcessedEvent> stream) {
KafkaSink<ProcessedEvent> deadLetterSink = KafkaSink.<ProcessedEvent>builder()
.setBootstrapServers("kafka-broker:9092")
.setRecordSerializer(
KafkaRecordSerializationSchema.builder()
.setTopic("user-events-dead-letter")
.setValueSerializationSchema((element, timestamp) -> {
ObjectMapper mapper = new ObjectMapper();
return mapper.writeValueAsBytes(element);
})
.build()
)
.setDeliverGuarantee(DeliveryGuarantee.AT_LEAST_ONCE)
.build();
stream.sinkTo(deadLetterSink)
.name("Dead Letter Sink")
.uid("dead-letter-sink");
}
}
4. 配置Elasticsearch索引模板
在Elasticsearch中创建模板,确保自动创建索引的字段类型正确:
PUT _index_template/user-events-template
{
"index_patterns": ["user-events-processed-*"],
"template": {
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1
},
"mappings": {
"dynamic": "strict",
"properties": {
"eventId": {"type": "keyword"},
"eventType": {"type": "keyword"},
"userId": {"type": "keyword"},
"location": {"type": "keyword"},
"deviceCategory": {"type": "keyword"},
"osFamily": {"type": "keyword"},
"value": {"type": "double"},
"normalizedValue": {"type": "double"},
"status": {"type": "keyword"},
"eventTime": {"type": "date"},
"processTime": {"type": "date"}
}
}
}
}
5. Flink ETL管道详解
A. 数据流处理流程
B. 关键处理阶段
-
数据摄入:
- 从Kafka读取原始事件JSON数据
- 使用Kafka消费者组管理消费偏移量
-
解析与清洗:
- JSON字符串转换为POJO对象
- 过滤无效数据(空字段/格式错误)
- 处理解析异常
-
数据丰富:
- IP地址解析地理位置
- UserAgent解析设备类型和操作系统
- 添加处理时间戳
-
数值标准化:
- 应用特定业务规则(如对数转换)
- 确保数据适合后续分析
-
业务验证:
- 执行领域特定规则验证
- 识别和处理异常数据
- 标记处理状态(SUCCESS/FAILED)
-
输出处理:
- 成功数据写入Elasticsearch(按时间分片)
- 失败数据写入死信队列(Kafka)
C. 容错与可靠性机制
-
精确一次处理语义:
env.enableCheckpointing(30000);
-
死信队列处理:
context.output(DEAD_LETTER_TAG, event);
-
弹性搜索写入优化:
.setBulkFlushBackoffStrategy( FlushBackoffType.EXPONENTIAL, 5, 1000);
-
重启策略:
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3, 10));
6. 生产环境部署建议
A. 配置参数化
使用Flink配置API或外部配置中心(如ZooKeeper)管理参数:
// 从配置获取参数
ParameterTool params = ParameterTool.fromArgs(args);
String kafkaServers = params.get("kafka.brokers", "localhost:9092");
String esHosts = params.get("elasticsearch.hosts", "localhost:9200");
B. 监控与告警
- Flink Web Dashboard:实时监控作业状态
- Prometheus + Grafana:
dependencies.add("org.apache.flink:flink-metrics-prometheus:1.17.0"); env.getConfig().setMetricsReporter( new PrometheusPushGatewayReporter("pushgateway:9091", "flink-etl"));
- 告警规则:
- 延迟超过阈值
- 死信队列增长过快
- 检查点失败
C. 性能优化
// 启用 RocksDB 状态后端
env.setStateBackend(new EmbeddedRocksDBStateBackend());
// 配置网络缓冲
env.getConfig().setTaskManagerNetworkMemoryFraction(0.3f);
// 并行度优化
env.setParallelism(4); // 根据集群规模调整
D. 安全配置
// Kafka SASL认证
properties.setProperty("security.protocol", "SASL_SSL");
properties.setProperty("sasl.mechanism", "PLAIN");
// ES安全配置
esSinkBuilder.setConnectionUsername("elastic");
esSinkBuilder.setConnectionPassword("secure-password");
完整系统架构
应用场景扩展
-
实时用户行为分析:
// 添加用户行为模式分析 processedStream .keyBy(ProcessedEvent::getUserId) .window(SlidingEventTimeWindows.of(Time.minutes(30), Time.minutes(5))) .aggregate(new UserBehaviorAggregator()) .addSink(new UserProfileSink());
-
异常检测与实时告警:
processedStream .filter(event -> "FAILED".equals(event.getStatus())) .process(new AlertGenerator()) .addSink(new AlertSystemSink());
-
多目标输出:
// 写入关系数据库 processedStream.addSink(JdbcSink.sink(...)); // 写入数据湖 processedStream.addSink(new IcebergSink(...));
通过此完整示例,您可以快速构建可靠的生产级ETL管道,实现从Kafka到Elasticsearch的高性能实时数据处理。该架构可轻松扩展支持更复杂的业务场景。