HIVE面试题原理详解 统计用户连续交易的总额、连续登陆天数、连续登陆开始和结束时间、间隔天数
友情提示
下面遇到不懂的窗口函数可自行学习
https://blog.youkuaiyun.com/qq_44011386/article/details/103493416
创建数据表
create table lianxv(
uid int comment "用户id",
utime string comment "用户时间",
sal decimal(10,2) comment "用户价格"
)row format delimited fields terminated by ','
location "/hivelianxibiao/lianxv";
添加数据
insert into deal_tb values
(1, "2019-02-08", 6214.23),
(1, "2019-02-08", 6247.32),
(1, "2019-02-09", 85.63),
(1, "2019-02-09", 967.36),
(1, "2019-02-10", 85.69),
(1, "2019-02-12", 769.85),
(1, "2019-02-13", 943.86),
(1, "2019-02-14", 538.42),
(1, "2019-02-15", 369.76),
(1, "2019-02-16", 369.76),
(1, "2019-02-18", 795.15),
(1, "2019-02-19", 715.65),
(1, "2019-02-21", 537.71),
(2, "2019-02-08", 6214.23),
(2, "2019-02-08", 6247.32),
(2, "2019-02-09", 85.63),
(2, "2019-02-09", 967.36),
(2, "2019-02-10", 85.69),
(2, "2019-02-12", 769.85),
(2, "2019-02-13", 943.86),
(2, "2019-02-14", 943.18),
(2, "2019-02-15", 369.76),
(2, "2019-02-18", 795.15),
(2, "2019-02-19", 715.65),
(2, "2019-02-21", 537.71),
(3, "2019-02-08", 6214.23),
(3, "2019-02-08", 6247.32),
(3, "2019-02-09", 85.63),
(3, "2019-02-09", 967.36),
(3, "2019-02-10", 85.69),
(3, "2019-02-12", 769.85),
(3, "2019-02-13", 943.86),
(3, "2019-02-14", 276.81),
(3, "2019-02-15", 369.76),
(3, "2019-02-16", 369.76),
(3, "2019-02-18", 795.15),
(3, "2019-02-19", 715.65),
(3, "2019-02-21", 537.71);
流程图
第一步分析(子表a)
首先聚合每月的价格保证每月只有一条数据
select uid,utime,sum(sal) sal from lianxv group by uid,utime order by uid,utime
第二步分析(子表b)
(下文同)为方便阅读此处的子表a等价于(select uid,utime,sum(sal) sal from lianxv group by uid,utime order by uid,utime)a
在a表的基础上使用row_number()窗口函数对每一行标号后 通过date_sub()时间函数将每个utime减去对应的标号得到新的列 别名为sj
这样使得utime时间连续的数据所对应的sj列相同(均等于开始时间减去一天,少的这一天是因为减去了对应标号,sj列后面只做分组用不需要考虑把那一天加回来)
select uid,utime,sal,
date_sub( utime, row_number() over(partition by uid) ) sj
from a
第三步分析(子表c)
在b表的基础上统计
用户id 连续交易总额 开始时间 结束时间 连续天数 间隔天数
(与下面的列对应)
按照uid,sj进行分组后即相同的uid且相同的sj被分到同一个reduce中
通过datediff() 时间函数比较当前行的sj与前一行的sj的差值即间隔天数
ps:
lag和lead函数可以在同一次查询中取出同一字段的前N行数据(lag)和后N行数据(lead)。
函数:
LAG(EXP_STR,OFFSET,DEFVAL)OVER()
LEAD(EXP_STR,OFFSET,DEFVAL)OVER()
参数如下:
EXP_STR:要取的列
OFFSET: 取偏移后的第几行数据
DEFVAL:没有符合条件的默认值
至于over()设置自定义开窗大小可自行学习窗口函数(此处可删除rows between 1 preceding and current row结果无影响)
select
uid,
sum(sal) lianxvzongshu,
min(utime) kaishi,
max(utime) jieshu,
count(sj) lianxvtian,
datediff(sj, lag(sj,1,sj) over(partition by uid order by sj rows between 1 preceding and current row )
) jiange
from b
group by uid,sj
完整的hql语句如下:
-- 用户id 连续交易总额 开始时间 结束时间 连续天数 间隔天数
select uid, sum(sal) lianxvzongshu, min(utime) kaishi,max(utime) jieshu, count(sj) lianxvtian, datediff(sj, lag(sj,1,sj) over(partition by uid order by sj rows between 1 preceding and current row ) ) jiange
from(
select uid,utime,sal, date_sub( utime,row_number() over(partition by uid) ) sj from
(select uid,utime,sum(sal) sal from lianxv group by uid,utime order by uid,utime)a
)b
group by uid,sj
2023年9月更新完整sql逻辑(工作需求中遇到了)
-- 阿里云 odps sql
DROP TABLE deal_tb;
create table deal_tb(
uid int comment "用户id",
utime string comment "用户时间",
sal decimal(10,2) comment "用户价格"
);
insert into deal_tb values
(1, "2019-02-08", 6247.32),
(1, "2019-02-08", 6247.32),
(1, "2019-02-09", 85.63),
(1, "2019-02-09", 967.36),
(1, "2019-02-10", 85.69),
(1, "2019-02-12", 769.85),
(1, "2019-02-13", 943.86),
(1, "2019-02-14", 538.42),
(1, "2019-02-15", 369.76),
(1, "2019-02-16", 369.76),
(1, "2019-02-18", 795.15),
(1, "2019-02-19", 715.65),
(1, "2019-02-21", 537.71),
(2, "2019-02-08", 6214.23),
(2, "2019-02-08", 6247.32),
(2, "2019-02-09", 85.63),
(2, "2019-02-09", 967.36),
(2, "2019-02-10", 85.69),
(2, "2019-02-12", 769.85),
(2, "2019-02-13", 943.86),
(2, "2019-02-14", 943.18),
(2, "2019-02-15", 369.76),
(2, "2019-02-18", 795.15),
(2, "2019-02-19", 715.65),
(2, "2019-02-21", 537.71),
(3, "2019-02-08", 6214.23),
(3, "2019-02-08", 6247.32),
(3, "2019-02-09", 85.63),
(3, "2019-02-09", 967.36),
(3, "2019-02-10", 85.69),
(3, "2019-02-12", 769.85),
(3, "2019-02-13", 943.86),
(3, "2019-02-14", 276.81),
(3, "2019-02-15", 369.76),
(3, "2019-02-16", 369.76),
(3, "2019-02-18", 795.15),
(3, "2019-02-19", 715.65),
(3, "2019-02-21", 537.71);
-- 连续交易的总额、连续登陆天数、连续登陆开始和结束时间、间隔天数
SELECT
uid,sum(sal) as "连续交易的总额" ,min(utime) as "开始时间",max(utime) as "结束时间"
,count(1) as "连续交易天数"
,datediff(min(utime),lag(max(utime),1,min(utime)) OVER(partition by uid order by min(utime))) as "间隔天数"
from
(
select
uid,utime
,sum(sal) sal
,DATE_SUB(utime,ROW_NUMBER() OVER(PARTITION BY uid ORDER BY utime)) datesub_time
from deal_tb
group by uid,utime
)a
group by uid,datesub_time
;