Flink 指定时间范围内网站总浏览量(PV)的统计

基于Flink 统计每小时内的网站PV

public class Flink03_Practice_PageView_Window2 {

    public static void main(String[] args) throws Exception {

        //1.获取执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //2.读取文本数据
        DataStreamSource<String> readTextFile = env.readTextFile("input/UserBehavior.csv");

        //3.转换为JavaBean,根据行为过滤数据,并提取时间戳生成Watermark
        WatermarkStrategy<UserBehavior> userBehaviorWatermarkStrategy = WatermarkStrategy.<UserBehavior>forMonotonousTimestamps()
                .withTimestampAssigner(new SerializableTimestampAssigner<UserBehavior>() {
                    @Override
                    public long extractTimestamp(UserBehavior element, long recordTimestamp) {
                        return element.getTimestamp() * 1000L;
                    }
                });
        SingleOutputStreamOperator<UserBehavior> userBehaviorDS = readTextFile.map(data -> {
            String[] split = data.split(",");
            return new UserBehavior(Long.parseLong(split[0]),
                    Long.parseLong(split[1]),
                    Integer.parseInt(split[2]),
                    split[3],
                    Long.parseLong(split[4]));
        }).filter(data -> "pv".equals(data.getBehavior()))
                .assignTimestampsAndWatermarks(userBehaviorWatermarkStrategy);

        //4.将数据转换为元组
        KeyedStream<Tuple2<String, Integer>, String> keyedStream = userBehaviorDS.map(new MapFunction<UserBehavior, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(UserBehavior value) throws Exception {
                return new Tuple2<>("PV_" + new Random().nextInt(8), 1);
            }
        }).keyBy(data -> data.f0);

        //5.开窗并计算
        SingleOutputStreamOperator<PageViewCount> aggResult = keyedStream.window(TumblingEventTimeWindows.of(Time.hours(1)))
                .aggregate(new PageViewAggFunc(), new PageViewWindowFunc());

        //6.按照窗口信息重新分组做第二次聚合
        KeyedStream<PageViewCount, String> pageViewCountKeyedStream = aggResult.keyBy(PageViewCount::getTime);


        SingleOutputStreamOperator<PageViewCount> count = pageViewCountKeyedStream.sum("count");

        //7.执行任务
        result.print();
        env.execute();

    }

    public static class PageViewAggFunc implements AggregateFunction<Tuple2<String, Integer>, Integer, Integer> {
        @Override
        public Integer createAccumulator() {
            return 0;
        }

        @Override
        public Integer add(Tuple2<String, Integer> value, Integer accumulator) {
            return accumulator + 1;
        }

        @Override
        public Integer getResult(Integer accumulator) {
            return accumulator;
        }

        @Override
        public Integer merge(Integer a, Integer b) {
            return a + b;
        }
    }

    public static class PageViewWindowFunc implements WindowFunction<Integer, PageViewCount, String, TimeWindow> {

        @Override
        public void apply(String key, TimeWindow window, Iterable<Integer> input, Collector<PageViewCount> out) throws Exception {

            //提取窗口时间
            String timestamp = new Timestamp(window.getEnd()).toString();

            //获取累积结果
            Integer count = input.iterator().next();

            //输出结果
            out.collect(new PageViewCount("PV", timestamp, count));
        }
    }
}

运行结果:

keyby之后,求sum 

修改代码如下:

public class Flink03_Practice_PageView_Window2 {

    public static void main(String[] args) throws Exception {

        //1.获取执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //2.读取文本数据
        DataStreamSource<String> readTextFile = env.readTextFile("input/UserBehavior.csv");

        //3.转换为JavaBean,根据行为过滤数据,并提取时间戳生成Watermark
        WatermarkStrategy<UserBehavior> userBehaviorWatermarkStrategy = WatermarkStrategy.<UserBehavior>forMonotonousTimestamps()
                .withTimestampAssigner(new SerializableTimestampAssigner<UserBehavior>() {
                    @Override
                    public long extractTimestamp(UserBehavior element, long recordTimestamp) {
                        return element.getTimestamp() * 1000L;
                    }
                });
        SingleOutputStreamOperator<UserBehavior> userBehaviorDS = readTextFile.map(data -> {
            String[] split = data.split(",");
            return new UserBehavior(Long.parseLong(split[0]),
                    Long.parseLong(split[1]),
                    Integer.parseInt(split[2]),
                    split[3],
                    Long.parseLong(split[4]));
        }).filter(data -> "pv".equals(data.getBehavior()))
                .assignTimestampsAndWatermarks(userBehaviorWatermarkStrategy);

        //4.将数据转换为元组
        KeyedStream<Tuple2<String, Integer>, String> keyedStream = userBehaviorDS.map(new MapFunction<UserBehavior, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(UserBehavior value) throws Exception {
                return new Tuple2<>("PV_" + new Random().nextInt(8), 1);
            }
        }).keyBy(data -> data.f0);

        //5.开窗并计算
        SingleOutputStreamOperator<PageViewCount> aggResult = keyedStream.window(TumblingEventTimeWindows.of(Time.hours(1)))
                .aggregate(new PageViewAggFunc(), new PageViewWindowFunc());

        //6.按照窗口信息重新分组做第二次聚合
        KeyedStream<PageViewCount, String> pageViewCountKeyedStream = aggResult.keyBy(PageViewCount::getTime);


        SingleOutputStreamOperator<PageViewCount> count = pageViewCountKeyedStream.sum("count");




        //7.累加结果
        SingleOutputStreamOperator<PageViewCount> result = pageViewCountKeyedStream.process(new PageViewProcessFunc());

        //8.执行任务
        result.print();
        env.execute();

    }

    public static class PageViewAggFunc implements AggregateFunction<Tuple2<String, Integer>, Integer, Integer> {
        @Override
        public Integer createAccumulator() {
            return 0;
        }

        @Override
        public Integer add(Tuple2<String, Integer> value, Integer accumulator) {
            return accumulator + 1;
        }

        @Override
        public Integer getResult(Integer accumulator) {
            return accumulator;
        }

        @Override
        public Integer merge(Integer a, Integer b) {
            return a + b;
        }
    }

    public static class PageViewWindowFunc implements WindowFunction<Integer, PageViewCount, String, TimeWindow> {

        @Override
        public void apply(String key, TimeWindow window, Iterable<Integer> input, Collector<PageViewCount> out) throws Exception {

            //提取窗口时间
            String timestamp = new Timestamp(window.getEnd()).toString();

            //获取累积结果
            Integer count = input.iterator().next();

            //输出结果
            out.collect(new PageViewCount("PV", timestamp, count));
        }
    }

    public static class PageViewProcessFunc extends KeyedProcessFunction<String, PageViewCount, PageViewCount> {

        //定义状态
        private ListState<PageViewCount> listState;

        @Override
        public void open(Configuration parameters) throws Exception {
            listState = getRuntimeContext().getListState(new ListStateDescriptor<PageViewCount>("list-state", PageViewCount.class));
        }

        @Override
        public void processElement(PageViewCount value, Context ctx, Collector<PageViewCount> out) throws Exception {
            //将数据放入状态
            listState.add(value);

            //注册定时器
            String time = value.getTime();
            long ts = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(time).getTime();
            ctx.timerService().registerEventTimeTimer(ts + 1);
        }

        @Override
        public void onTimer(long timestamp, OnTimerContext ctx, Collector<PageViewCount> out) throws Exception {
            //取出状态中的数据
            Iterable<PageViewCount> pageViewCounts = listState.get();

            //遍历累加数据
            Integer count = 0;
            Iterator<PageViewCount> iterator = pageViewCounts.iterator();
            while (iterator.hasNext()) {
                PageViewCount next = iterator.next();
                count += next.getCount();
            }

            //输出数据
            out.collect(new PageViewCount("PV", new Timestamp(timestamp - 1).toString(), count));

            //清空状态
            listState.clear();
        }
    }
}

运行结果如下:

 

### 关于PV和UV的定义 PV(Page View)表示页面浏览量,指的是用户每次打开或者刷新某个网页都会被记录为一次PV。即使同一用户在同一时间段内多次访问相同的页面,每一次访问都将计入PV统计中[^3]。 UV(Unique Visitor)则代表独立访客数,它通过用户的Cookie来识别不同的访问者。在特定的时间范围内(通常是一天),无论某位用户访问了多少次网站,该用户只会被计为一个UV。 ### 计算方法 #### 使用 Redis 实现 PV 和 UV 的统计 为了高效地处理大规模的数据流并实时更新PV和UV数值,可以利用Redis这种内存数据库来进行中间层缓存操作后再定期写入正式数据库。下面是一个基于Java语言的例子展示如何设置相应的Redis键值用于存储这些统计数据: ```java public abstract class WebsiteRedisKeys { // 定义用于保存每日PV数的Redis Key前缀 public static final String DAILY_PREFIX_KEY_PV = "daily_website:pv"; // 定义用于保存每日UV集合的Redis Key前缀(使用Set结构去重) public static final String DAILY_PREFIX_KEY_UV_SET = "daily_website:uv_set"; // 定义用于保存每日IP地址集合的Redis Key前缀(同样采用Set防止重复录入相同IP) public static final String DAILY_PREFIX_KEY_IPS_SET = "daily_website:ips_set"; } ``` 对于上述代码片段中的`DAILY_PREFIX_KEY_UV_SET`部分特别重要,因为它是用来维护当天所有唯一用户的标识符列表的,从而能够轻松得出最终的UV数目[^4]。 另外一种方式是在大数据环境下应用Apache Flink框架完成分布式环境下的PV/UV计算任务。这里给出了一种简单的模型设计思路作为参考: ```java @NoArgsConstructor @AllArgsConstructor @Data @ToString public class UserActionRecord{ private String date; private Long userId; private Integer actionType;// 假设1表示点击动作 } // 对象实例化省略... DataStream<UserActionRecord> sourceStream = ... ; sourceStream.keyBy(r -> r.date+r.userId).process(new ProcessFunction<>(){ ValueState<Integer> pvCounter; SetState<Long> uvCollector; @Override public void open(Configuration parameters){ pvCounter=getRuntimeContext().getState(...); uvCollector=getRuntimeContext().getMapState(...); } @Override public void processElement(UserActionRecord value, Context ctx, Collector<Tuple2<String,Integer>> out)throws Exception{ int currentPv=pvCounter.value()==null?0:pvCounter.value(); pvCounter.update(currentPv+value.actionType); if(!uvCollector.contains(value.userId)){ uvCollector.put(value.userId,true); out.collect(Tuple2.of(value.date,pvCounter.value())); } } }); ``` 此段伪代码展示了怎样借助Flink的状态管理功能实现每组日期与用户ID组合下精确的PV累加以及UV判定逻辑[^5]。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值