题目:
用SQL计算:计算用户连续完成订单间隔最大的天数和最小的天数
注:0是未完成订单,1是完成订单。
字段名称:user_id dt amt,
结果要求(自己所加):显示为“用户 开始区间 结束区间 最小值 最大值”的形式
建表语句:
drop table if exists user_com;
create table user_com(
user_id string,
dt string,
amt int
)
row format delimited fields terminated by '\t';
--插入数据:
insert overwrite table user_com values('uid_1','20200501',1),('uid_1','20200508',0),('uid_1','20200601',1),('uid_1','20200606',1),('uid_1','20200705',0),('uid_1','20200709',1),('uid_1','20200725',1),('uid_2','20200501',0),('uid_2','20200508',1),('uid_2','20200517',0),('uid_2','20200704',1),('uid_2','20200905',1),('uid_2','20200920',1),('uid_2','20200928',1),('uid_3','20200501',1),('uid_3','20200508',1),('uid_3','20200517',0),('uid_3','20200704',1),('uid_3','20200905',0),('uid_3','20200920',1),('uid_3','20200928',1);
一、按照题目需求解答(只需要找出用户的最大和最小连续区间):
代码如下:
--只能显示用户的最大和最小连续区间的值,无法显示具体范围
select user_id,
min(diff_date) min_days,
max(diff_date) max_days
from (
select user_id,
max(dt) over (partition by point1 ) stop_date,
min(up_date) over (partition by point1) start_date,
datediff(max(dt) over (partition by point1 ),min(up_date) over (partition by point1)) diff_date,
point1
from (
select user_id,
dt,
up_date,
point1,
case
when point1 is null then dk >= 2
else dk >= 3
end section_date
from (
select user_id,
dt,
up_date,
point1,
dense_rank() over (partition by concat(user_id, '-', point1) order by dt ) dk
from (
select user_id,
dt,
lag(dt) over (partition by user_id order by dt ) up_date,
last_value(if(amt = 0, concat(user_id, '|', dt), null), true)
over (partition by user_id order by dt) point1
from (
select user_id,
from_unixtime(unix_timestamp(dt, 'yyyyMMdd'), 'yyyy-MM-dd') dt,
amt
from user_com) t1
) t2
) t3
) t4
where section_date is true
)t5
group by user_id;
执行结果:
二、思维扩展
需求:找出每个用户的最大和最小连续区间,并且求出最大和最小的持续天数。
分析:注意点,每个用户的最大连续区间和最小连续区间可能存在多个,此时不能直接使用"max()\min()"函数来解答。
解答代码如下:
--优化为(用户 开始区间 结束区间 最小值 最大值)显示形式:
--注意:此时可能存在持续天数相同的最小值区间和最大值区间,所以不能使用max()\min()函数
--这里存在spark引擎在计算逻辑正确的情况下无法获得正确结果的情况,此时需要换用MR引擎执行计算。
set hive.execution.engine=mr;
select user_id,
start_date,
stop_date,
if(rk=1,diff_date,0) max_days,
if(rk=min_c,diff_date,0) min_days
from (
select user_id,
start_date,
stop_date,
diff_date,
count(user_id) over(partition by user_id ) min_c,
rank() over (partition by user_id order by diff_date desc) rk
from (
select user_id,
start_date,
stop_date,
diff_date
from (
select user_id,
max(dt) over (partition by point1 ) stop_date,
min(up_date) over (partition by point1) start_date,
datediff(max(dt) over (partition by point1 ),
min(up_date) over (partition by point1)) diff_date,
point1
from (
select user_id,
dt,
up_date,
point1,
case
when point1 is null then dk >= 2
else dk >= 3
end section_date
from (
select user_id,
dt,
up_date,
point1,
dense_rank()
over (partition by concat(user_id, '-', point1) order by dt ) dk
from (
select user_id,
dt,
lag(dt) over (partition by user_id order by dt ) up_date,
last_value(if(amt = 0, concat(user_id, '|', dt), null),
true)
over (partition by user_id order by dt) point1
from (
select user_id,
from_unixtime(unix_timestamp(dt, 'yyyyMMdd'), 'yyyy-MM-dd') dt,
amt
from user_com) t1
) t2
) t3
) t4
where section_date is true
) t5
group by user_id, start_date, stop_date, diff_date
) t6
)t7
where rk=1 or rk=min_c;
spark引擎执行结果:
mr引擎执行结果:
结果对比:
代码结果显示流程:

对代码的理解:需要满足一般性需求,而并非特定情况下的特定解答,不具备业务复用性的代码,是没有意义的代码!