一、背景说明
背景来源为一个实时指标计算的需求:实时计算过去一小时订单配对数。
订单配对的口径是用户下单后司机接单,且后续没有发生订单取消则定义为配对订单(基于打车场景)。
该口径在计算上,需要实现两次聚合,即对过去一小时窗口的订单进行计算,对后续发生取消的订单打上标签,下一个窗口对上一层基础上,剔除取消的订单,计算出配对单的数量。在此该需求可以再往上抽象一层:
对过去N小时的窗口数据,做级联GoupBy的需求均适用
二、开发部分
1. 官网对级联窗口的解释
Cascading Window Aggregation #
The window_start and window_end columns are regular timestamp columns, not time attributes. Thus they can’t be used as time attributes in subsequent time-based operations. In order to propagate time attributes, you need to additionally add window_time column into GROUP BY clause. The window_time is the third column produced by Windowing TVFs which is a time attribute of the assigned window. Adding window_time into GROUP BY clause makes window_time also to be group key that can be selected. Then following queries can use this column for subsequent time-based operations, such as cascading window aggregations and Window TopN.
The following shows a cascading window aggregation where the first window aggregation propagates the time attribute for the second window aggregation.
– tumbling 5 minutes for each supplier_id
CREATE VIEW window1 AS
-- Note: The window start and window end fields of inner Window TVF are optional in the select clause. However, if they appear in the clause, they need to be aliased to prevent name conflicting with the window start and window end of the outer Window TVF.
SELECT window_start as window_5mintumble_start, window_end as window_5mintumble_end, window_time as rowtime, SUM(price) as partial_price
FROM TABLE(
TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '5' MINUTES))
GROUP BY supplier_id, window_start, window_end, window_time;
-- tumbling 10 minutes on the first window
SELECT window_start, window_end, SUM(partial_price) as total_price
FROM TABLE<

本文详细介绍了在实时计算场景中,如何利用Apache Flink进行级联窗口操作来处理订单配对数的实时计算需求。首先阐述了背景和业务口径,接着通过Flink SQL展示了级联窗口的开发示例,包括1分钟滑动窗口和10分钟滚动窗口的聚合过程。此外,还探讨了时间属性在窗口操作中的重要性,并提供了完整的Flink 1.13.2版本代码示例,涉及Kafka作为源表和Hbase作为Sink表。最后,补充说明了时间语义处理和Hbase连接器的使用注意事项。
最低0.47元/天 解锁文章
1070





