hiveSql 计算两段时间范围交集的连续每月天数

本文详细介绍如何使用Hive SQL解决实际问题,计算指定时间段内会员的有效天数,通过实例分解基本思路、数据实践和代码实现,适用于处理会员服务中的数据统计需求。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

hiveSql 计算两段时间范围交集的连续每月天数

需求解读

文章名听着可能比较难理解嗷,先详细说下具体需求是啥样的。想计算在指定时间段内的有效会员,会员数据可能有月卡、年卡等,指定时间可能是一天或者一个时间范围。所有会员时间范围与指定时间范围有交集的会员订单数据,计算出在指定时间段内每月的有效天数。

说的可能比较抽象,具体看数据。

已知数据

有用户购买会员的会员明细数据如下:

用户id会员开始时间会员结束时间
A2021-01-012021-04-01
A2021-06-012021-09-01
B2021-03-012021-04-01
C2021-04-012021-05-01
C2021-06-012021-07-01

期望结果

现在想计算出指定时间段内是有效会员的所有会员在指定时间段内的每个月的有效天数,期望数据如下:

例如:想计算2021-02-20 到 20221-06-15范围内所有有效会员,在这段时间内的每月有效天数。

用户id指定时间范围月份在该月份有效天数
A2021-029
A2021-0331
A2021-041
A2021-0615
B2021-0331
B2021-041
C2021-0430
C2021-051
C2021-0615

上述结果数据可以看出,A用户会员购买记录有两单,与指定时间(2021-02-20 到 20221-06-15)都有交集,则这两单在指定时间范围都是有效会员。第一单是21年1月1号到4月1号,与指定时间交集范围是2021-02-20 到 2021-04-01。则需计算2021-02-20 到 2021-04-01这段时间区间的每月天数,区间是全闭区间。

基本思路

到这可以思考下大概需要怎么做?
1.首先注意是指定范围内的有效会员。第一步是要筛选出哪些会员记录是有效(即会员时间范围与指定时间范围有交集),并且要计算出交集的开始结束时间。

2.第一步完成后,得到所有有效会员记录,有效开始结束时间,此时要对每一条记录(注意这里是要对每一条记录,而不是每个用户,因为一个用户可能有多条购买会员记录。做一行转多行的拆分有效时间范围月份操作)。

3.第二步完成后,此时应该得到每一条有效记录在指定时间范围的每个月的数据。此时需要判断有效开始时间是否在拆分后的月份内。按照真实开始、结束时间计算个月内的有效天数。

数据实践

按照上述步骤,写sql实践。相对来说,麻烦点其实在第二步,需要将交集时间拆分到每个月。让俺们一步一步看:

首先准备数据:

with table_name as ( -- 测试数据
    select 'A' as username,'2021-01-01' as start_time, '2021-04-01' as end_time
    union all
    select 'A' as username,'2021-06-01' as start_time, '2021-09-01' as end_time
    union all
    select 'B' as username,'2021-03-01' as start_time, '2021-04-01' as end_time
    union all
    select 'C' as username,'2021-04-01' as start_time, '2021-05-01' as end_time
    union all
    select 'C' as username,'2021-06-01' as start_time, '2021-07-01' as end_time
)
  • 筛选有效会员记录(即计算会员范围和指定范围的交集)
    时间段1:[A,B] ;时间段2:[C,D]。
    交集计算:A <= D and B >= C
    真实交集时间段:[A <= C则取C否则取A , B >= D则取D否则取B]
    irank:注意这里多了一个ROW_NUMBER()开窗,目的是为了会员记录的唯一性,后续用这个唯一id和用户一起分组拆分有效月。因为如果只按照用户分区,会有一个用户购买多次会员的情况,导致拆分月份混乱。
select
    username,
    start_time as vip_start_time,
    end_time as vip_end_time,
    case when start_time <= '2021-02-20' then '2021-02-20' else start_time end as real_start_time,
    case when end_time >= '2021-06-15' then '2021-06-15' else end_time end as real_end_time
    ,ROW_NUMBER() OVER(PARTITION BY 1) AS irank 
from table_name
where start_time <= '2021-06-15' and end_time >= '2021-02-20'

第一步执行结果:
在这里插入图片描述

  • 第二步需要拆分真实交集时间范围到每个月上,想到可以使用EXPLODE()方法将月份拆开。但是现在只是一个真实开始日期,真实结束日期。要怎么拆呢?而且拆完后还要计算出月份第增的效果。例如第一步结果中第一条有效开始日期,结束日期是 2021-02-20 到 2021-04-01。拆完之后需要2月、3月、4月三条数据。
    此时想到其实可以利用repeat()方法,重复交集有效时间段月份数的字段串"1 "(“1” 和 " ")即可。配合trim()将最后的空格去除。之所以是"1"是因为后续可以利用sum() over开窗这个转成数字的1累加计算月份递增操作。见sql:
with table_name as ( -- 测试数据
    select 'A' as username,'2021-01-01' as start_time, '2021-04-01' as end_time
    union all
    select 'A' as username,'2021-06-01' as start_time, '2021-09-01' as end_time
    union all
    select 'B' as username,'2021-03-01' as start_time, '2021-04-01' as end_time
    union all
    select 'C' as username,'2021-04-01' as start_time, '2021-05-01' as end_time
    union all
    select 'C' as username,'2021-06-01' as start_time, '2021-07-01' as end_time
)
select
     username,
     vip_start_time,
     vip_end_time,
     real_start_time,
     real_end_time,
     irank,
     trim(repeat('1 ',DATEDIFF(to_date(real_end_time,'yyyy-mm-dd'), to_date(real_start_time,'yyyy-mm-dd'),'mm') + 1)) as str
 from
     (select
         username,
         start_time as vip_start_time,
         end_time as vip_end_time,
         case when start_time <= '2021-02-20' then '2021-02-20' else start_time end as real_start_time,
         case when end_time >= '2021-06-15' then '2021-06-15' else end_time end as real_end_time,
         ROW_NUMBER() OVER(PARTITION BY 1) AS irank 
     from table_name
     where start_time <= '2021-06-15' and end_time >= '2021-02-20'
     ) t

此处的:

trim(repeat('1 ',DATEDIFF(to_date(real_end_time,'yyyy-mm-dd'), to_date(real_start_time,'yyyy-mm-dd'),'mm') + 1)) as str

将真实开始日期与真实结束日期月份差+1后 repeat('1 ')后再trim。可以得到每条数据后跟着真实日期月份差的"1 "数量(即上述步骤2的str字段)。可见该小步骤执行结果:
在这里插入图片描述
再将str字段EXPLODE()拆分后。sum() over每个拆分后的1,按照对应真实开始时间去加这个sum()over()开窗后的值,可以得到递增真实月份。sql如下:

with table_name as ( -- 测试数据
    select 'A' as username,'2021-01-01' as start_time, '2021-04-01' as end_time
    union all
    select 'A' as username,'2021-06-01' as start_time, '2021-09-01' as end_time
    union all
    select 'B' as username,'2021-03-01' as start_time, '2021-04-01' as end_time
    union all
    select 'C' as username,'2021-04-01' as start_time, '2021-05-01' as end_time
    union all
    select 'C' as username,'2021-06-01' as start_time, '2021-07-01' as end_time
)
select
    username,
    vip_start_time,
    vip_end_time,
    real_start_time,
    real_end_time,
    dateadd(to_date(real_start_time,'yyyy-mm-dd'),sum(cast(trim(flag) as bigint)) over(partition by username,irank order by real_start_time)-1,'mm') as tmp_date
from(
    select
        username,
        vip_start_time,
        vip_end_time,
        real_start_time,
        real_end_time,
        irank,
        trim(repeat('1 ',DATEDIFF(to_date(real_end_time,'yyyy-mm-dd'), to_date(real_start_time,'yyyy-mm-dd'),'mm') + 1)) as str
    from
        (select
            username,
            start_time as vip_start_time,
            end_time as vip_end_time,
            case when start_time <= '2021-02-20' then '2021-02-20' else start_time end as real_start_time,
            case when end_time >= '2021-06-15' then '2021-06-15' else end_time end as real_end_time,
            ROW_NUMBER() OVER(PARTITION BY 1) AS irank 
        from table_name
        where start_time <= '2021-06-15' and end_time >= '2021-02-20'
        ) t
    ) t1
LATERAL view EXPLODE(split( str,' ')) ABTable as flag

此处的:

dateadd(to_date(real_start_time,'yyyy-mm-dd'),sum(cast(trim(flag) as bigint)) over(partition by username,irank order by real_start_time)-1,'mm') as tmp_date

将str字段拆分后cast()成数字类型,sum() over() 按照用户名即irank保证数据唯一性,避免用户多单会员记录情况。注意此处计算真实开始时间加和月份数的时候 月份数需要做减1操作。因为同月加0。异邻月加1,依次类推,所以要做减1操作。
执行结果如下:
在这里插入图片描述

  • 第三步需要判断real_start_time 或 real_end_time是否和tmp_date是相同月份,因为这关乎计算有效天数。例如可以看到第二步的结果第一条数据,real_start_time:2021-02-20 ;
    real_end_time:2021-04-01;
    tmp_date:2021-02-20 00:00:00
    此时real_start_time和tmp_date同属2月份,可知道这条数据应该是用2月份的月末减去real_start_time+1天,从而得到二月份的有效天数。
    可以看到第二步的结果第二条数据
    real_start_time:2021-02-20 ;
    real_end_time:2021-04-01;
    tmp_date:2021-03-20 00:00:00,
    此时real_start_time、real_end_time和tmp_date都不在同一个月,可知道这条数据应该是3月份的整个月的天数,从而得到三月份的有效天数。
    可以看到第二步的结果第三条数据
    real_start_time:2021-02-20 ;
    real_end_time:2021-04-01;
    tmp_date:2021-04-20 00:00:00,
    此时real_end_time和tmp_date同属4月份,可知道这条数据应该是real_end_time减去4月初一加1的天数,从而得到四月份的有效天数。
    所以真实开始时间、真实结束时间是否和拆分的月份同月很关键。可以标识出两个字段 real_start_time是否和tmp_date同月、real_end_time是否和tmp_date同月。sql如下:
with table_name as ( -- 测试数据
    select 'A' as username,'2021-01-01' as start_time, '2021-04-01' as end_time
    union all
    select 'A' as username,'2021-06-01' as start_time, '2021-09-01' as end_time
    union all
    select 'B' as username,'2021-03-01' as start_time, '2021-04-01' as end_time
    union all
    select 'C' as username,'2021-04-01' as start_time, '2021-05-01' as end_time
    union all
    select 'C' as username,'2021-06-01' as start_time, '2021-07-01' as end_time
)
select
    username,
    vip_start_time,
    vip_end_time,
    real_start_time,
    real_end_time,
    substr(tmp_date,1,7) as per_month,
    if(substr(real_start_time,1,7) = substr(tmp_date,1,7) ,1,0) as is_s_month,
    if(substr(real_end_time,1,7) = substr(tmp_date,1,7) ,1,0) as is_e_month
from
    (
    select
        username,
        vip_start_time,
        vip_end_time,
        real_start_time,
        real_end_time,
        dateadd(to_date(real_start_time,'yyyy-mm-dd'),sum(cast(trim(flag) as bigint)) over(partition by username,irank order by real_start_time)-1,'mm') as tmp_date
    from(
        select
            username,
            vip_start_time,
            vip_end_time,
            real_start_time,
            real_end_time,
            irank,
            trim(repeat('1 ',DATEDIFF(to_date(real_end_time,'yyyy-mm-dd'), to_date(real_start_time,'yyyy-mm-dd'),'mm') + 1)) as str
        from
            (select
                username,
                start_time as vip_start_time,
                end_time as vip_end_time,
                case when start_time <= '2021-02-20' then '2021-02-20' else start_time end as real_start_time,
                case when end_time >= '2021-06-15' then '2021-06-15' else end_time end as real_end_time,
                ROW_NUMBER() OVER(PARTITION BY 1) AS irank 
            from table_name
            where start_time <= '2021-06-15' and end_time >= '2021-02-20'
            ) t
        ) t1
    LATERAL view EXPLODE(split( str,' ')) ABTable as flag
    ) t2

该小步骤sql执行结果如下:
在这里插入图片描述

此时只需要利用is_s_month和 is_e_month判断怎么计算当月有效天数即可,逻辑如下:

  1. 当is_s_month 和 is_e_month 都是0,表示当天数据是整月,两边都不和拆分月份同月。直接计算tmp_date的整月天数即可。
  2. 当is_s_month 和 is_e_month 都是1,表示真实开始日期,和真实结束日期同月,只有一个月要计算有效天数,直接用real_end_date减去real_start_date+1天即可。
  3. 当is_s_month 和 is_e_month 有任意一个是1另一个是0,此时需要判断开始时间还是结束时间与拆分月份是同月。需要再写一个case when判断
    a).真实开始日期和tmp_date同月(tmp_date月末减real_start_date+1天)
    b)真实结束日期和tmp_date同月(real_end_date减去tmp_date月初减+1天)

sql如下:

case when is_s_month + is_e_month = 0 then 
        datediff(dateadd(dateadd(to_date(per_month,'yyyy-mm'),1,'mm'),-1,'dd'), to_date(per_month,'yyyy-mm'),'dd')+1
    when is_s_month + is_e_month = 2 then 
        datediff(to_date(real_end_time,'yyyy-mm-dd'),to_date(real_start_time,'yyyy-mm-dd'),'dd')+1
    else
        case when is_s_month = 1 
        then datediff(dateadd(dateadd( to_date(concat(substr(real_start_time,1,7),'-01'),'yyyy-mm-dd'),1,'mm' ),-1,'dd'),to_date(real_start_time,'yyyy-mm-dd'),'dd')+1
        else datediff( to_date(real_end_time,'yyyy-mm-dd'), to_date(concat(substr(real_end_time,1,7),'-01'),'yyyy-mm-dd'),'dd')+1
        end 
end as daily_month

完整代码

with table_name as ( -- 测试数据
    select 'A' as username,'2021-01-01' as start_time, '2021-04-01' as end_time
    union all
    select 'A' as username,'2021-06-01' as start_time, '2021-09-01' as end_time
    union all
    select 'B' as username,'2021-03-01' as start_time, '2021-04-01' as end_time
    union all
    select 'C' as username,'2021-04-01' as start_time, '2021-05-01' as end_time
    union all
    select 'C' as username,'2021-06-01' as start_time, '2021-07-01' as end_time
)
select
    username,
    vip_start_time,
    vip_end_time,
    real_start_time,
    real_end_time,
    per_month,
    is_s_month,
    is_e_month,
    case 
        when is_s_month + is_e_month = 0 then 
            datediff(dateadd(dateadd(to_date(per_month,'yyyy-mm'),1,'mm'),-1,'dd'), to_date(per_month,'yyyy-mm'),'dd')+1
        when is_s_month + is_e_month = 2 then 
            datediff(to_date(real_end_time,'yyyy-mm-dd'),to_date(real_start_time,'yyyy-mm-dd'),'dd')+1
        else
            case when is_s_month = 1 
            then datediff(dateadd(dateadd( to_date(concat(substr(real_start_time,1,7),'-01'),'yyyy-mm-dd'),1,'mm' ),-1,'dd'),to_date(real_start_time,'yyyy-mm-dd'),'dd')+1
            else datediff( to_date(real_end_time,'yyyy-mm-dd'), to_date(concat(substr(real_end_time,1,7),'-01'),'yyyy-mm-dd'),'dd')+1
            end 
    end as daily_month 
from (
    select
        username,
        vip_start_time,
        vip_end_time,
        real_start_time,
        real_end_time,
        substr(tmp_date,1,7) as per_month,
        if(substr(real_start_time,1,7) = substr(tmp_date,1,7) ,1,0) as is_s_month,
        if(substr(real_end_time,1,7) = substr(tmp_date,1,7) ,1,0) as is_e_month
    from
        (
        select
            username,
            vip_start_time,
            vip_end_time,
            real_start_time,
            real_end_time,
            dateadd(to_date(real_start_time,'yyyy-mm-dd'),sum(cast(trim(flag) as bigint)) over(partition by username,irank order by real_start_time)-1,'mm') as tmp_date
        from(
            select
                username,
                vip_start_time,
                vip_end_time,
                real_start_time,
                real_end_time,
                irank,
                trim(repeat('1 ',DATEDIFF(to_date(real_end_time,'yyyy-mm-dd'), to_date(real_start_time,'yyyy-mm-dd'),'mm') + 1)) as str
            from
                (select
                    username,
                    start_time as vip_start_time,
                    end_time as vip_end_time,
                    case when start_time <= '2021-02-20' then '2021-02-20' else start_time end as real_start_time,
                    case when end_time >= '2021-06-15' then '2021-06-15' else end_time end as real_end_time,
                    ROW_NUMBER() OVER(PARTITION BY 1) AS irank 
                from table_name
                where start_time <= '2021-06-15' and end_time >= '2021-02-20'
                ) t
            ) t1
        LATERAL view EXPLODE(split( str,' ')) ABTable as flag
        ) t2
    ) t3

sql结果:
在这里插入图片描述

总结归纳

主要是运用开窗函数计算递增月份,其他的就是逻辑计算有效天数了。repeat('1 ')再sum() over累积月份相对比较巧妙。如果有其他更好的方式,希望评论区或者私信不吝赐教。

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

@nanami

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值