4.Flink实时项目之日志数据拆分

1. 摘要

我们前面采集的日志数据已经保存到 Kafka 中,作为日志数据的 ODS 层,从 kafka 的ODS 层读取的日志数据分为 3 类, 页面日志、启动日志和曝光日志。这三类数据虽然都是用户行为数据,但是有着完全不一样的数据结构,所以要拆分处理。将拆分后的不同的日志写回 Kafka 不同主题中,作为日志 DWD 层。

页面日志输出到主流,启动日志输出到启动侧输出流,曝光日志输出到曝光侧输出流

2. 识别新老用户

本身客户端业务有新老用户的标识,但是不够准确,需要用实时计算再次确认(不涉及业务操作,只是单纯的做个状态确认)。

利用侧输出流实现数据拆分

根据日志数据内容,将日志数据分为 3 类:页面日志、启动日志和曝光日志。将不同流的数据推送下游的 kafka 的不同 Topic 中

3. 代码实现

在包app下创建flink任务BaseLogTask.java,

通过flink消费kafka 的数据,然后记录消费的checkpoint存到hdfs中,记得要手动创建路径,然后给权限

checkpoint可选择性使用,测试时可以关掉。

package com.zhangbao.gmall.realtime.app;
import com.alibaba.fastjson.JSONObject;
import com.zhangbao.gmall.realtime.utils.MyKafkaUtil;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.runtime.state.filesystem.FsStateBackend;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
/**
 * @author: zhangbao
 * @date: 2021/6/18 23:29
 * @desc:
 **/
public class BaseLogTask {
   
   
    public static void main(String[] args) {
   
   
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        //设置并行度,即kafka分区数
        env.setParallelism(4);
        //添加checkpoint,每5秒执行一次
        env.enableCheckpointing(5000, CheckpointingMode.EXACTLY_ONCE);
        env.getCheckpointConfig().setCheckpointTimeout(60000);
        env.setStateBackend(new FsStateBackend("hdfs://hadoop101:9000/gmall/flink/checkpoint/baseLogAll"));
        //指定哪个用户读取hdfs文件
        System.setProperty("HADOOP_USER_NAME","zhangbao");
        //添加数据源
        String topic = "ods_base_log";
        String group = "base_log_app_group";
        FlinkKafkaConsumer<String> kafkaSource = MyKafkaUtil.getKafkaSource(topic, group);
        DataStreamSource<String> kafkaDs = env.addSource(kafkaSource);
        //对格式进行转换
        SingleOutputStreamOperator<JSONObject> jsonDs = kafkaDs.map(new MapFunction<String, JSONObject>() {
   
   
            @Override
            public JSONObject map(String s) throws Exception {
   
   
                return JSONObject.parseObject(s);
            }
        });
        jsonDs.print("json >>> --- ");

        try {
   
   
            //执行
            env.execute();
        } catch (Exception e) {
   
   
            e.printStackTrace();
        }

    }
}

MyKafkaUtil.java工具类

package com.zhangbao.gmall.realtime.utils;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import java.util.Properties;
/**
 * @author: zhangbao
 * @date: 2021/6/18 23:41
 * @desc:
 **/
public class MyKafkaUtil {
   
   
    private static String kafka_host = "hadoop101:9092,hadoop102:9092,hadoop103:9092";
    /**
     * kafka消费者
     */
    public static FlinkKafkaConsumer<String> getKafkaSource(String topic,String group){
   
   
        Properties props = new Properties();
        props.setProperty(ConsumerConfig.GROUP_ID_CONFIG,group);
        props.setProperty(ConsumerConfig
加载配置: 30 个参数 配置加载完成,检查点时间: Fri Aug 01 08:22:46 CST 2025 java.lang.NullPointerException: No key set. This method should not be called outside of a keyed context. at org.apache.flink.util.Preconditions.checkNotNull(Preconditions.java:76) at org.apache.flink.runtime.state.heap.StateTable.checkKeyNamespacePreconditions(StateTable.java:270) at org.apache.flink.runtime.state.heap.StateTable.get(StateTable.java:260) at org.apache.flink.runtime.state.heap.StateTable.get(StateTable.java:143) at org.apache.flink.runtime.state.heap.HeapMapState.get(HeapMapState.java:86) at org.apache.flink.runtime.state.ttl.TtlMapState.lambda$getWrapped$0(TtlMapState.java:64) at org.apache.flink.runtime.state.ttl.AbstractTtlDecorator.getWrappedWithTtlCheckAndUpdate(AbstractTtlDecorator.java:97) at org.apache.flink.runtime.state.ttl.TtlMapState.getWrapped(TtlMapState.java:63) at org.apache.flink.runtime.state.ttl.TtlMapState.get(TtlMapState.java:57) at org.apache.flink.runtime.state.UserFacingMapState.get(UserFacingMapState.java:47) at com.tongchuang.realtime.mds.ULEDataanomalyanalysis$OptimizedAnomalyDetectionFunction.detectMissingTags(ULEDataanomalyanalysis.java:665) at com.tongchuang.realtime.mds.ULEDataanomalyanalysis$OptimizedAnomalyDetectionFunction.processBroadcastElement(ULEDataanomalyanalysis.java:614) at com.tongchuang.realtime.mds.ULEDataanomalyanalysis$OptimizedAnomalyDetectionFunction.processBroadcastElement(ULEDataanomalyanalysis.java:220) at org.apache.flink.streaming.api.operators.co.CoBroadcastWithKeyedOperator.processElement2(CoBroadcastWithKeyedOperator.java:133) at org.apache.flink.streaming.runtime.io.StreamTwoInputProcessorFactory.processRecord2(StreamTwoInputProcessorFactory.java:221) at org.apache.flink.streaming.runtime.io.StreamTwoInputProcessorFactory.lambda$create$1(StreamTwoInputProcessorFactory.java:190) at org.apache.flink.streaming.runtime.io.StreamTwoInputProcessorFactory$StreamTaskNetworkOutput.emitRecord(StreamTwoInputProcessorFactory.java:291) at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134) at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105) at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:66) at org.apache.flink.streaming.runtime.io.StreamTwoInputProcessor.processInput(StreamTwoInputProcessor.java:98) at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:423) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:204) at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:684) at org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:639) at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:650) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:623) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:779) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566) at java.lang.Thread.run(Thread.java:745)
最新发布
08-02
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值