Apache Flink Java 示例:实时ETL管道(Kafka到Elasticsearch)

Apache Flink Java 示例:实时ETL管道(Kafka到Elasticsearch)

本文将详细讲解如何使用 Apache Flink 构建一个完整的实时 ETL 管道,从 Kafka 读取原始数据,经过一系列转换处理,最终写入 Elasticsearch。本示例适用于各种数据处理场景,如日志分析、用户行为跟踪、物联网数据处理等。

系统架构概述

Kafka:原始数据源
Flink ETL处理
数据清洗/转换
丰富/聚合
异常处理
Elasticsearch:结构化存储
Kibana:可视化
其他应用

完整实现代码

1. 依赖配置 (pom.xml)

<dependencies>
    <!-- Flink核心依赖 -->
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-java</artifactId>
        <version>1.17.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-streaming-java</artifactId>
        <version>1.17.0</version>
    </dependency>
    
    <!-- Kafka连接器 -->
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-connector-kafka</artifactId>
        <version>1.17.0</version>
    </dependency>
    
    <!-- Elasticsearch连接器 -->
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-connector-elasticsearch7</artifactId>
        <version>1.17.0</version>
    </dependency>
    
    <!-- JSON处理 -->
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-json</artifactId>
        <version>1.17.0</version>
    </dependency>
    <dependency>
        <groupId>com.fasterxml.jackson.core</groupId>
        <artifactId>jackson-databind</artifactId>
        <version>2.14.2</version>
    </dependency>
    
    <!-- 其他工具 -->
    <dependency>
        <groupId>org.projectlombok</groupId>
        <artifactId>lombok</artifactId>
        <version>1.18.26</version>
        <scope>provided</scope>
    </dependency>
</dependencies>

2. 数据模型定义

import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;

import java.time.Instant;

/**
 * 原始事件POJO(Kafka接收格式)
 */
@Data
@NoArgsConstructor
@AllArgsConstructor
public class RawEvent {
    private String eventId;
    private String eventType;
    private String userId;
    private String ipAddress;
    private String deviceType;
    private String userAgent;
    private Double value;
    private String metadata;
    private Long timestamp;
}

/**
 * 处理后的结构化事件(Elasticsearch写入格式)
 */
@Data
@NoArgsConstructor
@AllArgsConstructor
public class ProcessedEvent {
    private String eventId;
    private String eventType;
    private String userId;
    private String location;      // 通过IP解析的地理位置
    private String deviceCategory; // 设备分类(Desktop/Mobile/Tablet)
    private String osFamily;       // 操作系统家族
    private Double value;
    private Double normalizedValue; // 标准化后的值
    private String status;         // 处理状态(SUCCESS / FAILED)
    private Instant eventTime;     // ISO-8601格式时间
    private Instant processTime;   // ETL处理时间
}

3. Flink ETL作业实现

import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.RichFlatMapFunction;
import org.apache.flink.api.common.restartstrategy.RestartStrategies;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.common.time.Time;
import org.apache.flink.connector.base.DeliveryGuarantee;
import org.apache.flink.connector.elasticsearch.sink.Elasticsearch7SinkBuilder;
import org.apache.flink.connector.elasticsearch.sink.FlushBackoffType;
import org.apache.flink.connector.kafka.sink.KafkaRecordSerializationSchema;
import org.apache.flink.connector.kafka.sink.KafkaSink;
import org.apache.flink.connector.kafka.source.KafkaSource;
import org.apache.flink.connector.kafka.source.enumerator.initializer.OffsetsInitializer;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;
import org.apache.http.HttpHost;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.client.Requests;

import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.databind.node.ObjectNode;

import java.time.Instant;
import java.time.ZoneId;
import java.time.format.DateTimeFormatter;
import java.util.Arrays;
import java.util.List;
import java.util.Locale;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class KafkaToElasticsearchETL {

    // 定义异常输出流
    private static final OutputTag<ProcessedEvent> DEAD_LETTER_TAG = 
        new OutputTag<ProcessedEvent>("dead-letter") {};
    
    // IP地理位置解析服务(生产环境应使用专业服务)
    private static final List<String> LOCATIONS = Arrays.asList(
        "北京", "上海", "广州", "深圳", "杭州", "成都", "纽约", "伦敦", "东京", "悉尼"
    );

    public static void main(String[] args) throws Exception {
        // 1. 设置流处理环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        
        // 配置检查点(保证精确一次语义)
        env.enableCheckpointing(30000); // 30秒一次检查点
        
        // 配置重启策略
        env.setRestartStrategy(RestartStrategies.fixedDelayRestart(
            3, // 最多重试3次
            Time.seconds(10) // 重试间隔10秒
        ));
        
        // 2. 创建Kafka数据源(生产环境参数应从配置读取)
        KafkaSource<String> kafkaSource = KafkaSource.<String>builder()
            .setBootstrapServers("kafka-broker:9092")
            .setTopics("user-events-raw")
            .setGroupId("flink-etl-group")
            .setStartingOffsets(OffsetsInitializer.earliest())
            .setValueOnlyDeserializer(new SimpleStringSchema())
            .setProperty("auto.offset.reset", "latest")
            .build();
        
        // 3. 从Kafka读取数据流
        DataStream<String> rawDataStream = env.fromSource(
            kafkaSource,
            WatermarkStrategy.forMonotonousTimestamps(),
            "Kafka Source"
        );
        
        // 4. ETL处理管道
        SingleOutputStreamOperator<ProcessedEvent> processedStream = rawDataStream
            // 步骤1: 字符串转POJO(带异常处理)
            .flatMap(new JsonToRawEventMapper())
            .name("JSON 解析")
            .uid("json-parser")
            
            // 步骤2: 数据清洗
            .filter(event -> 
                event.getUserId() != null && 
                event.getEventType() != null &&
                event.getTimestamp() != null)
            .name("数据过滤")
            .uid("data-filter")
            
            // 步骤3: 丰富数据(IP解析、设备分类等)
            .map(new EnrichmentMapper())
            .name("数据丰富")
            .uid("data-enrichment")
            
            // 步骤4: 值标准化
            .map(new ValueNormalizationMapper())
            .name("值标准化")
            .uid("value-normalization")
            
            // 步骤5: 主处理流程
            .process(new MainProcessingFunction())
            .name("主处理流程")
            .uid("main-process");
        
        // 5. 获取异常流
        DataStream<ProcessedEvent> deadLetterStream = processedStream.getSideOutput(DEAD_LETTER_TAG);
        
        // 6. 输出到Elasticsearch
        buildElasticsearchSink(processedStream, "user-events-processed");
        
        // 7. 死信队列处理(写入另一个Kafka主题)
        buildDeadLetterSink(deadLetterStream);
        
        // 8. 执行作业
        env.execute("Real-time ETL Pipeline: Kafka to Elasticsearch");
    }

    // ======================= ETL处理函数实现 =======================

    /**
     * JSON字符串转RawEvent对象(带异常处理)
     */
    private static class JsonToRawEventMapper 
        implements FlatMapFunction<String, RawEvent> {
        
        private final ObjectMapper jsonMapper = new ObjectMapper();
        private final OutputTag<String> jsonErrorTag = new OutputTag<String>("json-errors") {};
        
        @Override
        public void flatMap(String json, Collector<RawEvent> out) {
            try {
                RawEvent event = jsonMapper.readValue(json, RawEvent.class);
                out.collect(event);
            } catch (Exception e) {
                // 在真实应用中应记录异常日志
                System.err.println("JSON解析失败: " + json);
            }
        }
    }

    /**
     * 数据丰富器:添加地理位置、设备分类等信息
     */
    private static class EnrichmentMapper 
        implements MapFunction<RawEvent, ProcessedEvent> {
        
        private static final Pattern DEVICE_PATTERN = Pattern.compile(
            "iPhone|iPad|Android|Windows Phone|Mobile", Pattern.CASE_INSENSITIVE);
        private static final Pattern OS_PATTERN = Pattern.compile(
            "Windows NT|Mac OS X|Linux|Android|iOS", Pattern.CASE_INSENSITIVE);
        private static final DateTimeFormatter TIME_FORMATTER = 
            DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
                             .withZone(ZoneId.of("UTC"));
        
        @Override
        public ProcessedEvent map(RawEvent raw) {
            ProcessedEvent processed = new ProcessedEvent();
            
            // 基本字段复制
            processed.setEventId(raw.getEventId());
            processed.setEventType(raw.getEventType());
            processed.setUserId(raw.getUserId());
            processed.setValue(raw.getValue());
            
            // IP解析地理位置(简化逻辑)
            processed.setLocation(resolveLocation(raw.getIpAddress()));
            
            // 设备信息解析
            String ua = raw.getUserAgent() != null ? raw.getUserAgent() : "";
            processed.setDeviceCategory(classifyDevice(ua));
            processed.setOsFamily(detectOsFamily(ua));
            
            // 时间格式转换
            Instant eventTime = Instant.ofEpochMilli(raw.getTimestamp());
            processed.setEventTime(eventTime);
            processed.setProcessTime(Instant.now());
            
            processed.setStatus("SUCCESS");
            
            return processed;
        }
        
        private String resolveLocation(String ip) {
            if (ip == null || ip.isEmpty()) return "Unknown";
            
            // 简化逻辑:根据IP最后一位分配位置
            int hash = ip.hashCode();
            int index = Math.abs(hash) % LOCATIONS.size();
            return LOCATIONS.get(index);
        }
        
        private String classifyDevice(String userAgent) {
            if (userAgent == null) return "Unknown";
            
            Matcher matcher = DEVICE_PATTERN.matcher(userAgent);
            if (matcher.find()) {
                String device = matcher.group().toLowerCase();
                if (device.contains("iphone") || device.contains("android")) 
                    return "Mobile";
                if (device.contains("ipad")) return "Tablet";
            }
            return "Desktop";
        }
        
        private String detectOsFamily(String userAgent) {
            if (userAgent == null) return "Unknown";
            
            Matcher matcher = OS_PATTERN.matcher(userAgent);
            if (matcher.find()) {
                return matcher.group();
            }
            return "Other";
        }
    }

    /**
     * 数值标准化处理器
     */
    private static class ValueNormalizationMapper 
        implements MapFunction<ProcessedEvent, ProcessedEvent> {
        
        @Override
        public ProcessedEvent map(ProcessedEvent event) {
            if (event.getValue() != null) {
                // 简化标准化逻辑(实际应使用业务规则)
                event.setNormalizedValue(Math.log1p(event.getValue()));
            }
            return event;
        }
    }

    /**
     * 主处理函数(业务逻辑入口)
     */
    private static class MainProcessingFunction 
        extends ProcessFunction<ProcessedEvent, ProcessedEvent> {
        
        @Override
        public void processElement(ProcessedEvent event, 
                                  Context context, 
                                  Collector<ProcessedEvent> out) {
            try {
                // 示例:业务规则验证
                if ("purchase".equals(event.getEventType()) && event.getValue() < 0) {
                    throw new IllegalArgumentException("购买金额不能为负数");
                }
                
                // 所有验证通过,进入主流程
                out.collect(event);
            } catch (Exception e) {
                // 标记异常并发送到死信队列
                event.setStatus("FAILED: " + e.getMessage());
                context.output(DEAD_LETTER_TAG, event);
            }
        }
    }

    // ======================= Sink连接器构建 =======================

    /**
     * 构建Elasticsearch Sink
     */
    private static void buildElasticsearchSink(
        DataStream<ProcessedEvent> stream, 
        String indexPrefix
    ) {
        List<HttpHost> esHosts = Arrays.asList(
            new HttpHost("es-node1", 9200, "http"),
            new HttpHost("es-node2", 9200, "http")
        );
        
        Elasticsearch7SinkBuilder<ProcessedEvent> esSinkBuilder = new Elasticsearch7SinkBuilder<>();
        esSinkBuilder.setHosts(esHosts);
        
        // 按时间分片索引:user-events-processed-2023-07
        esSinkBuilder.setEmitter((event, context, indexer) -> {
            String indexName = indexPrefix + "-" + 
                DateTimeFormatter.ofPattern("yyyy-MM").format(Instant.now());
                
            ObjectMapper mapper = new ObjectMapper();
            ObjectNode json = mapper.valueToTree(event);
            
            IndexRequest request = Requests.indexRequest()
                .index(indexName)
                .id(event.getEventId())
                .source(mapper.writeValueAsString(json));
            
            indexer.add(request);
        });
        
        // 生产配置优化
        esSinkBuilder.setBulkFlushMaxActions(1000);     // 每1000条刷新
        esSinkBuilder.setBulkFlushInterval(5000);       // 每5秒刷新
        esSinkBuilder.setBulkFlushBackoffStrategy(
            FlushBackoffType.EXPONENTIAL, 5, 1000);     // 指数退避重试
        
        stream.sinkTo(esSinkBuilder.build())
              .name("Elasticsearch Sink")
              .uid("elasticsearch-sink");
    }

    /**
     * 构建死信队列Sink(写入Kafka)
     */
    private static void buildDeadLetterSink(DataStream<ProcessedEvent> stream) {
        KafkaSink<ProcessedEvent> deadLetterSink = KafkaSink.<ProcessedEvent>builder()
            .setBootstrapServers("kafka-broker:9092")
            .setRecordSerializer(
                KafkaRecordSerializationSchema.builder()
                    .setTopic("user-events-dead-letter")
                    .setValueSerializationSchema((element, timestamp) -> {
                        ObjectMapper mapper = new ObjectMapper();
                        return mapper.writeValueAsBytes(element);
                    })
                    .build()
            )
            .setDeliverGuarantee(DeliveryGuarantee.AT_LEAST_ONCE)
            .build();
        
        stream.sinkTo(deadLetterSink)
              .name("Dead Letter Sink")
              .uid("dead-letter-sink");
    }
}

4. 配置Elasticsearch索引模板

在Elasticsearch中创建模板,确保自动创建索引的字段类型正确:

PUT _index_template/user-events-template
{
  "index_patterns": ["user-events-processed-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1
    },
    "mappings": {
      "dynamic": "strict",
      "properties": {
        "eventId": {"type": "keyword"},
        "eventType": {"type": "keyword"},
        "userId": {"type": "keyword"},
        "location": {"type": "keyword"},
        "deviceCategory": {"type": "keyword"},
        "osFamily": {"type": "keyword"},
        "value": {"type": "double"},
        "normalizedValue": {"type": "double"},
        "status": {"type": "keyword"},
        "eventTime": {"type": "date"},
        "processTime": {"type": "date"}
      }
    }
  }
}

5. Flink ETL管道详解

A. 数据流处理流程
通过
失败
Kafka原始数据
JSON解析
有效数据?
数据清洗
丢弃
丰富数据
数值标准化
业务验证
写入ES
写入死信队列
B. 关键处理阶段
  1. 数据摄入

    • 从Kafka读取原始事件JSON数据
    • 使用Kafka消费者组管理消费偏移量
  2. 解析与清洗

    • JSON字符串转换为POJO对象
    • 过滤无效数据(空字段/格式错误)
    • 处理解析异常
  3. 数据丰富

    • IP地址解析地理位置
    • UserAgent解析设备类型和操作系统
    • 添加处理时间戳
  4. 数值标准化

    • 应用特定业务规则(如对数转换)
    • 确保数据适合后续分析
  5. 业务验证

    • 执行领域特定规则验证
    • 识别和处理异常数据
    • 标记处理状态(SUCCESS/FAILED)
  6. 输出处理

    • 成功数据写入Elasticsearch(按时间分片)
    • 失败数据写入死信队列(Kafka)
C. 容错与可靠性机制
  1. 精确一次处理语义

    env.enableCheckpointing(30000);
    
  2. 死信队列处理

    context.output(DEAD_LETTER_TAG, event);
    
  3. 弹性搜索写入优化

    .setBulkFlushBackoffStrategy(
         FlushBackoffType.EXPONENTIAL, 5, 1000);
    
  4. 重启策略

    env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3, 10));
    

6. 生产环境部署建议

A. 配置参数化

使用Flink配置API或外部配置中心(如ZooKeeper)管理参数:

// 从配置获取参数
ParameterTool params = ParameterTool.fromArgs(args);
String kafkaServers = params.get("kafka.brokers", "localhost:9092");
String esHosts = params.get("elasticsearch.hosts", "localhost:9200");
B. 监控与告警
  1. Flink Web Dashboard:实时监控作业状态
  2. Prometheus + Grafana
    dependencies.add("org.apache.flink:flink-metrics-prometheus:1.17.0");
    
    env.getConfig().setMetricsReporter(
        new PrometheusPushGatewayReporter("pushgateway:9091", "flink-etl"));
    
  3. 告警规则
    • 延迟超过阈值
    • 死信队列增长过快
    • 检查点失败
C. 性能优化
// 启用 RocksDB 状态后端
env.setStateBackend(new EmbeddedRocksDBStateBackend());

// 配置网络缓冲
env.getConfig().setTaskManagerNetworkMemoryFraction(0.3f);

// 并行度优化
env.setParallelism(4); // 根据集群规模调整
D. 安全配置
// Kafka SASL认证
properties.setProperty("security.protocol", "SASL_SSL");
properties.setProperty("sasl.mechanism", "PLAIN");

// ES安全配置
esSinkBuilder.setConnectionUsername("elastic");
esSinkBuilder.setConnectionPassword("secure-password");

完整系统架构

Flink作业
ETL处理
Flink集群
Elasticsearch集群
Kafka死信队列
Kafka集群:原始数据
Kibana仪表板
业务分析应用
错误分析系统
报警通知
数据修复

应用场景扩展

  1. 实时用户行为分析

    // 添加用户行为模式分析
    processedStream
      .keyBy(ProcessedEvent::getUserId)
      .window(SlidingEventTimeWindows.of(Time.minutes(30), Time.minutes(5)))
      .aggregate(new UserBehaviorAggregator())
      .addSink(new UserProfileSink());
    
  2. 异常检测与实时告警

    processedStream
      .filter(event -> "FAILED".equals(event.getStatus()))
      .process(new AlertGenerator())
      .addSink(new AlertSystemSink());
    
  3. 多目标输出

    // 写入关系数据库
    processedStream.addSink(JdbcSink.sink(...));
    
    // 写入数据湖
    processedStream.addSink(new IcebergSink(...));
    

通过此完整示例,您可以快速构建可靠的生产级ETL管道,实现从Kafka到Elasticsearch的高性能实时数据处理。该架构可轻松扩展支持更复杂的业务场景。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值