前言
Flink SQL窗口需要有时间,时间是为窗口服务的。
Flink建表语句
CREATE TABLE source (
id VARCHAR,
price BIGINT,
--获取系统时间
event_time AS CAST(CURRENT_TIMESTAMP AS TIMESTAMP(3)),
--设置watermark
watermark FOR event_time AS event_time - INTERVAL '0' SECOND
) WITH (
--with 关键字,定义连接器相关信息
'connector' = 'datagen',
'rows-per-second' = '5',
'fields.id.length' = '2',
'fields.price.min' = '1',
'fields.price.max' = '100'
);
- 语法有两种
- Group Window Aggregation
- Windowing TVF
- 如果你的业务需求比较简单,只需要进行基本的窗口聚合操作,且希望使用传统的 SQL 语法,那么 Group Window Aggregation 是更合适的选择。
- 如果你的业务需求比较复杂,需要在窗口操作之后进行更复杂的查询和处理,或者需要同时处理多个窗口,那么 Windowing TVF 会更加灵活和强大。
1.滚动窗口(TUMBLE)
滚动窗口:将每条数据指定给具体的窗口。滚动窗口窗口大小固定,且数据没有重复计算和丢失。例如,指定一个大小为 5 秒钟的滚动窗口。在这种情况下,Flink 将每隔 5 秒钟开启一个新的窗口,其中每一条数都会划分到唯一的一个 5 秒钟的窗口中,如下图所示。
语法1:tumble(事件时间列,窗口大小) Group Window Aggregation
select
id,
count(*) as counts,
sum(price) as sum_price,
UNIX_TIMESTAMP(CAST(tumble_start(event_time, interval '5' second ) as STRING))*1000 as window_start,
UNIX_TIMESTAMP(CAST(tumble_end(event_time, interval '5' second ) as STRING))*1000 as window_end
from source
group by
id,
tumble(event_time, interval '5' second );
语法2:table(TUMBLE(table 表名,DESCRIPTOR(事件时间列),窗口大小) Windowing TVF
select
id,
count(*) as counts,
sum(price) as sum_price,
UNIX_TIMESTAMP(CAST (window_start as STRING))*1000 as window_start,
UNIX_TIMESTAMP(CAST (window_end as STRING))*1000 as window_end
from TABLE
(
TUMBLE(
TABLE source,
DESCRIPTOR(event_time),
INTERVAL '5' SECOND
)
)
group by
id,
window_start,
window_end;
- ⭐ 应用场景:例如,电商平台需要实时统计每 15 分钟内的订单数量、订单总金额等。
2滑动窗口(HOP)
滑动窗口:也是将数据指定给固定长度的窗口。与滚动窗口功能一样,也有窗口大小的概念。区别在于,滑动窗口有另一个参数控制窗口计算的频率(滑动窗口滑动的步长)。因此,如果滑动的步长小于窗口大小,则滑动窗口之间每个窗口是可以重叠。在这种情况下,一条数据就会分配到多个窗口当中。举例,有 5秒钟大小的窗口,滑动步长为 2 秒钟。这样,每 2秒钟会产生一个窗口,这个窗口包含的数据是过去 5秒钟的数据,如下图所示。
语法1:hop(事件时间列,窗口滑动步长,窗口大小)
select
id,
count(*) as counts,
sum(price) as sum_price,
UNIX_TIMESTAMP(CAST(hop_start(event_time, interval '2' second,interval '5' second ) as STRING))*1000 as window_start,
UNIX_TIMESTAMP(CAST(hop_end(event_time, interval '2' second,interval '5' second ) as STRING))*1000 as window_end
from source
group by
id,
hop(event_time, interval '2' second,interval '5' second );
语法2:table(hop(table 表名,DESCRIPTOR(事件时间列),窗口滑动步长,窗口大小)
select
id,
count(*) as counts,
sum(price) as sum_price,
UNIX_TIMESTAMP(CAST(window_start as STRING))*1000 as window_start,
UNIX_TIMESTAMP(CAST(window_end as STRING))*1000 as window_end
from table(hop(table source,DESCRIPTOR(event_time),interval '2' second,interval '5' second))
group by
id,
window_start,
window_end;
- ⭐ 应用场景:例如,设置一个 1 小时的滑动窗口,每 10 分钟滑动一次,分析商品在不同时间段的销售情况,帮助商家及时调整营销策略,如调整商品价格、进行促销活动等。
3.会话窗口(SESSION)
Session窗口 :和滚动、滑动窗口不一样,其没有固定的持续时间,窗口大小不固定,如果在定义的间隔期(Session Gap
)内没有新的数据出现,则产生新窗口。
语法:session(事件时间列,窗口时间 ) 暂不支持Window TVF
select
id,
count(*) as counts,
sum(price) as sum_price,
UNIX_TIMESTAMP(CAST(session_start(event_time, interval '2' second,interval '5' second ) as STRING))*1000 as window_start,
UNIX_TIMESTAMP(CAST(session_end(event_time, interval '2' second,interval '5' second ) as STRING))*1000 as window_end
from source
group by
id,
session(event_time,interval '5' second );
⭐ 应用场景:例如,设定会话间隙为 1 小时,当用户在看到广告后的 1 小时内进行了点击和购买操作,这些行为会被归为同一个会话。
4.渐进式窗口(CUMULATE)
渐进式窗口:属于累积窗口。与滚动窗口(TUMBLE)、滑动窗口(HOP)不同,累积窗口会在窗口不断向前滑动的过程中,持续累积之前窗口的数据。也就是说,累积窗口会随着时间的推移,将新到来的数据添加到之前已经处理过的数据集合中进行聚合计算,其窗口的结束时间会不断向后推移,而起始时间保持不变,直到达到最大窗口大小。
语法1: CUMULATE(事件时间列,时间间隔, 窗口大小);
select
id,
count(*) as counts,
sum(price) as sum_price,
window_start,
window_end
from source
group by
id,
CUMULATE(event_time, INTERVAL '1' HOUR, INTERVAL '24' HOUR);
window_start,
window_end;
语法2:table (cumulate(table 表名,DESCRIPTOR(事件时间列),窗口滑动步长,窗口最大大小 ))
select
id,
count(*) as counts,
sum(price) as sum_price,
window_start,
window_end
from table (cumulate(table source,DESCRIPTOR(event_time),interval '5' second ,interval '30' second ))
group by
id,
window_start,
window_end;
⭐ 应用场景:在电商或零售行业,需要实时统计某商品最近24小时的累计销售额、销售量等信息。通过 CUMULATE
窗口,可以不断累积每小时销售数据,从而得到从销售开始到当前时刻的累计销售情况。
5.聚合窗口(OVER)
OVER 窗口:是一种用于在结果集中对数据进行分组和排序,并在每个分组内进行计算的窗口类型。与常规的时间窗口(如滚动窗口、滑动窗口)不同,OVER 窗口不基于时间边界来划分数据,而是基于排序后的行进行计算,通常用于计算每行数据相对于其前后行的聚合值,例如计算移动平均值、累计和等。
语法:sum/max/min/count/avg(聚合字段) /row_number()/rank() (PARTITION by 分组字段 order by 排序字段 range/rows BETWEEN and )
SELECT
id,
price,
event_time,
SUM(price) OVER (PARTITION BY id ORDER BY event_time RANGE BETWEEN INTERVAL '3' SECOND PRECEDING AND CURRENT ROW) AS range_1,
COUNT(price) OVER (PARTITION BY id ORDER BY event_time RANGE BETWEEN INTERVAL '3' SECOND PRECEDING AND CURRENT ROW) AS range_2
FROM source;
SELECT
id,
price,
event_time,
SUM(price) OVER (PARTITION BY id ORDER BY event_time ROWS BETWEEN 5 PRECEDING AND CURRENT ROW) AS rows_1,
COUNT(price) OVER (PARTITION BY id ORDER BY event_time ROWS BETWEEN 5 PRECEDING AND CURRENT ROW) AS rows_2
FROM source;
⭐ 应用场景:移动平均值计算,累计和计算,排名计算