需求解读
文章名听着可能比较难理解嗷,先详细说下具体需求是啥样的。想计算在指定时间段内的有效会员,会员数据可能有月卡、年卡等,指定时间可能是一天或者一个时间范围。所有会员时间范围与指定时间范围有交集的会员订单数据,计算出在指定时间段内每月的有效天数。
说的可能比较抽象,具体看数据。
已知数据
有用户购买会员的会员明细数据如下:
用户id | 会员开始时间 | 会员结束时间 |
---|---|---|
A | 2021-01-01 | 2021-04-01 |
A | 2021-06-01 | 2021-09-01 |
B | 2021-03-01 | 2021-04-01 |
C | 2021-04-01 | 2021-05-01 |
C | 2021-06-01 | 2021-07-01 |
期望结果
现在想计算出指定时间段内是有效会员的所有会员在指定时间段内的每个月的有效天数,期望数据如下:
例如:想计算2021-02-20 到 20221-06-15范围内所有有效会员,在这段时间内的每月有效天数。
用户id | 指定时间范围月份 | 在该月份有效天数 |
---|---|---|
A | 2021-02 | 9 |
A | 2021-03 | 31 |
A | 2021-04 | 1 |
A | 2021-06 | 15 |
B | 2021-03 | 31 |
B | 2021-04 | 1 |
C | 2021-04 | 30 |
C | 2021-05 | 1 |
C | 2021-06 | 15 |
上述结果数据可以看出,A用户会员购买记录有两单,与指定时间(2021-02-20 到 20221-06-15)都有交集,则这两单在指定时间范围都是有效会员。第一单是21年1月1号到4月1号,与指定时间交集范围是2021-02-20 到 2021-04-01。则需计算2021-02-20 到 2021-04-01这段时间区间的每月天数,区间是全闭区间。
基本思路
到这可以思考下大概需要怎么做?
1.首先注意是指定范围内的有效会员。第一步是要筛选出哪些会员记录是有效(即会员时间范围与指定时间范围有交集),并且要计算出交集的开始结束时间。
2.第一步完成后,得到所有有效会员记录,有效开始结束时间,此时要对每一条记录(注意这里是要对每一条记录,而不是每个用户,因为一个用户可能有多条购买会员记录。做一行转多行的拆分有效时间范围月份操作)。
3.第二步完成后,此时应该得到每一条有效记录在指定时间范围的每个月的数据。此时需要判断有效开始时间是否在拆分后的月份内。按照真实开始、结束时间计算个月内的有效天数。
数据实践
按照上述步骤,写sql实践。相对来说,麻烦点其实在第二步,需要将交集时间拆分到每个月。让俺们一步一步看:
首先准备数据:
with table_name as ( -- 测试数据
select 'A' as username,'2021-01-01' as start_time, '2021-04-01' as end_time
union all
select 'A' as username,'2021-06-01' as start_time, '2021-09-01' as end_time
union all
select 'B' as username,'2021-03-01' as start_time, '2021-04-01' as end_time
union all
select 'C' as username,'2021-04-01' as start_time, '2021-05-01' as end_time
union all
select 'C' as username,'2021-06-01' as start_time, '2021-07-01' as end_time
)
- 筛选有效会员记录(即计算会员范围和指定范围的交集)
时间段1:[A,B] ;时间段2:[C,D]。
交集计算:A <= D and B >= C
真实交集时间段:[A <= C则取C否则取A , B >= D则取D否则取B]
irank:注意这里多了一个ROW_NUMBER()开窗,目的是为了会员记录的唯一性,后续用这个唯一id和用户一起分组拆分有效月。因为如果只按照用户分区,会有一个用户购买多次会员的情况,导致拆分月份混乱。
select
username,
start_time as vip_start_time,
end_time as vip_end_time,
case when start_time <= '2021-02-20' then '2021-02-20' else start_time end as real_start_time,
case when end_time >= '2021-06-15' then '2021-06-15' else end_time end as real_end_time
,ROW_NUMBER() OVER(PARTITION BY 1) AS irank
from table_name
where start_time <= '2021-06-15' and end_time >= '2021-02-20'
第一步执行结果:
- 第二步需要拆分真实交集时间范围到每个月上,想到可以使用EXPLODE()方法将月份拆开。但是现在只是一个真实开始日期,真实结束日期。要怎么拆呢?而且拆完后还要计算出月份第增的效果。例如第一步结果中第一条有效开始日期,结束日期是 2021-02-20 到 2021-04-01。拆完之后需要2月、3月、4月三条数据。
此时想到其实可以利用repeat()方法,重复交集有效时间段月份数的字段串"1 "(“1” 和 " ")即可。配合trim()将最后的空格去除。之所以是"1"是因为后续可以利用sum() over开窗这个转成数字的1累加计算月份递增操作。见sql:
with table_name as ( -- 测试数据
select 'A' as username,'2021-01-01' as start_time, '2021-04-01' as end_time
union all
select 'A' as username,'2021-06-01' as start_time, '2021-09-01' as end_time
union all
select 'B' as username,'2021-03-01' as start_time, '2021-04-01' as end_time
union all
select 'C' as username,'2021-04-01' as start_time, '2021-05-01' as end_time
union all
select 'C' as username,'2021-06-01' as start_time, '2021-07-01' as end_time
)
select
username,
vip_start_time,
vip_end_time,
real_start_time,
real_end_time,
irank,
trim(repeat('1 ',DATEDIFF(to_date(real_end_time,'yyyy-mm-dd'), to_date(real_start_time,'yyyy-mm-dd'),'mm') + 1)) as str
from
(select
username,
start_time as vip_start_time,
end_time as vip_end_time,
case when start_time <= '2021-02-20' then '2021-02-20' else start_time end as real_start_time,
case when end_time >= '2021-06-15' then '2021-06-15' else end_time end as real_end_time,
ROW_NUMBER() OVER(PARTITION BY 1) AS irank
from table_name
where start_time <= '2021-06-15' and end_time >= '2021-02-20'
) t
此处的:
trim(repeat('1 ',DATEDIFF(to_date(real_end_time,'yyyy-mm-dd'), to_date(real_start_time,'yyyy-mm-dd'),'mm') + 1)) as str
将真实开始日期与真实结束日期月份差+1后 repeat('1 ')后再trim。可以得到每条数据后跟着真实日期月份差的"1 "数量(即上述步骤2的str字段)。可见该小步骤执行结果:
再将str字段EXPLODE()拆分后。sum() over每个拆分后的1,按照对应真实开始时间去加这个sum()over()开窗后的值,可以得到递增真实月份。sql如下:
with table_name as ( -- 测试数据
select 'A' as username,'2021-01-01' as start_time, '2021-04-01' as end_time
union all
select 'A' as username,'2021-06-01' as start_time, '2021-09-01' as end_time
union all
select 'B' as username,'2021-03-01' as start_time, '2021-04-01' as end_time
union all
select 'C' as username,'2021-04-01' as start_time, '2021-05-01' as end_time
union all
select 'C' as username,'2021-06-01' as start_time, '2021-07-01' as end_time
)
select
username,
vip_start_time,
vip_end_time,
real_start_time,
real_end_time,
dateadd(to_date(real_start_time,'yyyy-mm-dd'),sum(cast(trim(flag) as bigint)) over(partition by username,irank order by real_start_time)-1,'mm') as tmp_date
from(
select
username,
vip_start_time,
vip_end_time,
real_start_time,
real_end_time,
irank,
trim(repeat('1 ',DATEDIFF(to_date(real_end_time,'yyyy-mm-dd'), to_date(real_start_time,'yyyy-mm-dd'),'mm') + 1)) as str
from
(select
username,
start_time as vip_start_time,
end_time as vip_end_time,
case when start_time <= '2021-02-20' then '2021-02-20' else start_time end as real_start_time,
case when end_time >= '2021-06-15' then '2021-06-15' else end_time end as real_end_time,
ROW_NUMBER() OVER(PARTITION BY 1) AS irank
from table_name
where start_time <= '2021-06-15' and end_time >= '2021-02-20'
) t
) t1
LATERAL view EXPLODE(split( str,' ')) ABTable as flag
此处的:
dateadd(to_date(real_start_time,'yyyy-mm-dd'),sum(cast(trim(flag) as bigint)) over(partition by username,irank order by real_start_time)-1,'mm') as tmp_date
将str字段拆分后cast()成数字类型,sum() over() 按照用户名即irank保证数据唯一性,避免用户多单会员记录情况。注意此处计算真实开始时间加和月份数的时候 月份数需要做减1操作。因为同月加0。异邻月加1,依次类推,所以要做减1操作。
执行结果如下:
- 第三步需要判断real_start_time 或 real_end_time是否和tmp_date是相同月份,因为这关乎计算有效天数。例如可以看到第二步的结果第一条数据,real_start_time:2021-02-20 ;
real_end_time:2021-04-01;
tmp_date:2021-02-20 00:00:00
此时real_start_time和tmp_date同属2月份,可知道这条数据应该是用2月份的月末减去real_start_time+1天,从而得到二月份的有效天数。
可以看到第二步的结果第二条数据,
real_start_time:2021-02-20 ;
real_end_time:2021-04-01;
tmp_date:2021-03-20 00:00:00,
此时real_start_time、real_end_time和tmp_date都不在同一个月,可知道这条数据应该是3月份的整个月的天数,从而得到三月份的有效天数。
可以看到第二步的结果第三条数据,
real_start_time:2021-02-20 ;
real_end_time:2021-04-01;
tmp_date:2021-04-20 00:00:00,
此时real_end_time和tmp_date同属4月份,可知道这条数据应该是real_end_time减去4月初一加1的天数,从而得到四月份的有效天数。
所以真实开始时间、真实结束时间是否和拆分的月份同月很关键。可以标识出两个字段 real_start_time是否和tmp_date同月、real_end_time是否和tmp_date同月。sql如下:
with table_name as ( -- 测试数据
select 'A' as username,'2021-01-01' as start_time, '2021-04-01' as end_time
union all
select 'A' as username,'2021-06-01' as start_time, '2021-09-01' as end_time
union all
select 'B' as username,'2021-03-01' as start_time, '2021-04-01' as end_time
union all
select 'C' as username,'2021-04-01' as start_time, '2021-05-01' as end_time
union all
select 'C' as username,'2021-06-01' as start_time, '2021-07-01' as end_time
)
select
username,
vip_start_time,
vip_end_time,
real_start_time,
real_end_time,
substr(tmp_date,1,7) as per_month,
if(substr(real_start_time,1,7) = substr(tmp_date,1,7) ,1,0) as is_s_month,
if(substr(real_end_time,1,7) = substr(tmp_date,1,7) ,1,0) as is_e_month
from
(
select
username,
vip_start_time,
vip_end_time,
real_start_time,
real_end_time,
dateadd(to_date(real_start_time,'yyyy-mm-dd'),sum(cast(trim(flag) as bigint)) over(partition by username,irank order by real_start_time)-1,'mm') as tmp_date
from(
select
username,
vip_start_time,
vip_end_time,
real_start_time,
real_end_time,
irank,
trim(repeat('1 ',DATEDIFF(to_date(real_end_time,'yyyy-mm-dd'), to_date(real_start_time,'yyyy-mm-dd'),'mm') + 1)) as str
from
(select
username,
start_time as vip_start_time,
end_time as vip_end_time,
case when start_time <= '2021-02-20' then '2021-02-20' else start_time end as real_start_time,
case when end_time >= '2021-06-15' then '2021-06-15' else end_time end as real_end_time,
ROW_NUMBER() OVER(PARTITION BY 1) AS irank
from table_name
where start_time <= '2021-06-15' and end_time >= '2021-02-20'
) t
) t1
LATERAL view EXPLODE(split( str,' ')) ABTable as flag
) t2
该小步骤sql执行结果如下:
此时只需要利用is_s_month和 is_e_month判断怎么计算当月有效天数即可,逻辑如下:
- 当is_s_month 和 is_e_month 都是0,表示当天数据是整月,两边都不和拆分月份同月。直接计算tmp_date的整月天数即可。
- 当is_s_month 和 is_e_month 都是1,表示真实开始日期,和真实结束日期同月,只有一个月要计算有效天数,直接用real_end_date减去real_start_date+1天即可。
- 当is_s_month 和 is_e_month 有任意一个是1另一个是0,此时需要判断开始时间还是结束时间与拆分月份是同月。需要再写一个case when判断
a).真实开始日期和tmp_date同月(tmp_date月末减real_start_date+1天)
b)真实结束日期和tmp_date同月(real_end_date减去tmp_date月初减+1天)
sql如下:
case when is_s_month + is_e_month = 0 then
datediff(dateadd(dateadd(to_date(per_month,'yyyy-mm'),1,'mm'),-1,'dd'), to_date(per_month,'yyyy-mm'),'dd')+1
when is_s_month + is_e_month = 2 then
datediff(to_date(real_end_time,'yyyy-mm-dd'),to_date(real_start_time,'yyyy-mm-dd'),'dd')+1
else
case when is_s_month = 1
then datediff(dateadd(dateadd( to_date(concat(substr(real_start_time,1,7),'-01'),'yyyy-mm-dd'),1,'mm' ),-1,'dd'),to_date(real_start_time,'yyyy-mm-dd'),'dd')+1
else datediff( to_date(real_end_time,'yyyy-mm-dd'), to_date(concat(substr(real_end_time,1,7),'-01'),'yyyy-mm-dd'),'dd')+1
end
end as daily_month
完整代码
with table_name as ( -- 测试数据
select 'A' as username,'2021-01-01' as start_time, '2021-04-01' as end_time
union all
select 'A' as username,'2021-06-01' as start_time, '2021-09-01' as end_time
union all
select 'B' as username,'2021-03-01' as start_time, '2021-04-01' as end_time
union all
select 'C' as username,'2021-04-01' as start_time, '2021-05-01' as end_time
union all
select 'C' as username,'2021-06-01' as start_time, '2021-07-01' as end_time
)
select
username,
vip_start_time,
vip_end_time,
real_start_time,
real_end_time,
per_month,
is_s_month,
is_e_month,
case
when is_s_month + is_e_month = 0 then
datediff(dateadd(dateadd(to_date(per_month,'yyyy-mm'),1,'mm'),-1,'dd'), to_date(per_month,'yyyy-mm'),'dd')+1
when is_s_month + is_e_month = 2 then
datediff(to_date(real_end_time,'yyyy-mm-dd'),to_date(real_start_time,'yyyy-mm-dd'),'dd')+1
else
case when is_s_month = 1
then datediff(dateadd(dateadd( to_date(concat(substr(real_start_time,1,7),'-01'),'yyyy-mm-dd'),1,'mm' ),-1,'dd'),to_date(real_start_time,'yyyy-mm-dd'),'dd')+1
else datediff( to_date(real_end_time,'yyyy-mm-dd'), to_date(concat(substr(real_end_time,1,7),'-01'),'yyyy-mm-dd'),'dd')+1
end
end as daily_month
from (
select
username,
vip_start_time,
vip_end_time,
real_start_time,
real_end_time,
substr(tmp_date,1,7) as per_month,
if(substr(real_start_time,1,7) = substr(tmp_date,1,7) ,1,0) as is_s_month,
if(substr(real_end_time,1,7) = substr(tmp_date,1,7) ,1,0) as is_e_month
from
(
select
username,
vip_start_time,
vip_end_time,
real_start_time,
real_end_time,
dateadd(to_date(real_start_time,'yyyy-mm-dd'),sum(cast(trim(flag) as bigint)) over(partition by username,irank order by real_start_time)-1,'mm') as tmp_date
from(
select
username,
vip_start_time,
vip_end_time,
real_start_time,
real_end_time,
irank,
trim(repeat('1 ',DATEDIFF(to_date(real_end_time,'yyyy-mm-dd'), to_date(real_start_time,'yyyy-mm-dd'),'mm') + 1)) as str
from
(select
username,
start_time as vip_start_time,
end_time as vip_end_time,
case when start_time <= '2021-02-20' then '2021-02-20' else start_time end as real_start_time,
case when end_time >= '2021-06-15' then '2021-06-15' else end_time end as real_end_time,
ROW_NUMBER() OVER(PARTITION BY 1) AS irank
from table_name
where start_time <= '2021-06-15' and end_time >= '2021-02-20'
) t
) t1
LATERAL view EXPLODE(split( str,' ')) ABTable as flag
) t2
) t3
sql结果:
总结归纳
主要是运用开窗函数计算递增月份,其他的就是逻辑计算有效天数了。repeat('1 ')再sum() over累积月份相对比较巧妙。如果有其他更好的方式,希望评论区或者私信不吝赐教。