HIVE面试题原理详解 统计用户连续交易的总额、连续登陆天数、连续登陆开始和结束时间、间隔天数

本文详述了如何使用Hive SQL通过窗口函数统计用户的连续交易总额、连续登陆天数、连续登陆开始和结束时间及间隔天数,涉及row_number、lag、datediff等函数的综合应用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

友情提示

下面遇到不懂的窗口函数可自行学习
https://blog.youkuaiyun.com/qq_44011386/article/details/103493416

创建数据表

create table lianxv(
uid int comment "用户id",
utime string comment "用户时间",
sal decimal(10,2) comment "用户价格"
)row format delimited fields terminated by ','
location "/hivelianxibiao/lianxv";

添加数据

insert into deal_tb values
(1, "2019-02-08", 6214.23), 
(1, "2019-02-08", 6247.32), 
(1, "2019-02-09", 85.63), 
(1, "2019-02-09", 967.36), 
(1, "2019-02-10", 85.69), 
(1, "2019-02-12", 769.85), 
(1, "2019-02-13", 943.86), 
(1, "2019-02-14", 538.42),
(1, "2019-02-15", 369.76),
(1, "2019-02-16", 369.76),
(1, "2019-02-18", 795.15),
(1, "2019-02-19", 715.65),
(1, "2019-02-21", 537.71),
(2, "2019-02-08", 6214.23), 
(2, "2019-02-08", 6247.32), 
(2, "2019-02-09", 85.63), 
(2, "2019-02-09", 967.36), 
(2, "2019-02-10", 85.69), 
(2, "2019-02-12", 769.85), 
(2, "2019-02-13", 943.86), 
(2, "2019-02-14", 943.18),
(2, "2019-02-15", 369.76),
(2, "2019-02-18", 795.15),
(2, "2019-02-19", 715.65),
(2, "2019-02-21", 537.71),
(3, "2019-02-08", 6214.23), 
(3, "2019-02-08", 6247.32), 
(3, "2019-02-09", 85.63), 
(3, "2019-02-09", 967.36), 
(3, "2019-02-10", 85.69), 
(3, "2019-02-12", 769.85), 
(3, "2019-02-13", 943.86), 
(3, "2019-02-14", 276.81),
(3, "2019-02-15", 369.76),
(3, "2019-02-16", 369.76),
(3, "2019-02-18", 795.15),
(3, "2019-02-19", 715.65),
(3, "2019-02-21", 537.71);

流程图

开始
聚合每月的价格保证每月只有一条数据
使用row_number函数标号 每个时间减去对应的标号 别名为sj
按照sj进行分组后即可统计
结束

第一步分析(子表a)

首先聚合每月的价格保证每月只有一条数据

select uid,utime,sum(sal) sal from lianxv group by uid,utime order by uid,utime

第二步分析(子表b)

(下文同)为方便阅读此处的子表a等价于(select uid,utime,sum(sal) sal from lianxv group by uid,utime order by uid,utime)a

在a表的基础上使用row_number()窗口函数对每一行标号后 通过date_sub()时间函数将每个utime减去对应的标号得到新的列 别名为sj
这样使得utime时间连续的数据所对应的sj列相同(均等于开始时间减去一天,少的这一天是因为减去了对应标号,sj列后面只做分组用不需要考虑把那一天加回来)

select uid,utime,sal, 
date_sub( utime, row_number() over(partition by uid) ) sj 
from a              

第三步分析(子表c)

在b表的基础上统计
用户id 连续交易总额 开始时间 结束时间 连续天数 间隔天数
(与下面的列对应)

按照uid,sj进行分组后即相同的uid且相同的sj被分到同一个reduce中
通过datediff() 时间函数比较当前行的sj与前一行的sj的差值即间隔天数

ps:
lag和lead函数可以在同一次查询中取出同一字段的前N行数据(lag)和后N行数据(lead)。
函数:
LAG(EXP_STR,OFFSET,DEFVAL)OVER()
LEAD(EXP_STR,OFFSET,DEFVAL)OVER()
参数如下:
EXP_STR:要取的列
OFFSET: 取偏移后的第几行数据
DEFVAL:没有符合条件的默认值

至于over()设置自定义开窗大小可自行学习窗口函数(此处可删除rows between 1 preceding and current row结果无影响)

select 
uid,
sum(sal) lianxvzongshu, 
min(utime) kaishi,
max(utime) jieshu, 
count(sj) lianxvtian, 
datediff(sj, lag(sj,1,sj) over(partition by uid order by sj  rows between 1 preceding and current row ) 
)  jiange
from b
group by uid,sj

完整的hql语句如下:

     -- 用户id        连续交易总额          开始时间           结束时间             连续天数           间隔天数 
select uid, sum(sal) lianxvzongshu, min(utime) kaishi,max(utime) jieshu, count(sj) lianxvtian, datediff(sj, lag(sj,1,sj) over(partition by uid order by sj  rows between 1 preceding and current row ) )  jiange
 from(
        select uid,utime,sal, date_sub( utime,row_number() over(partition by uid) ) sj from
                (select uid,utime,sum(sal) sal from lianxv group by uid,utime order by uid,utime)a
        )b
 group by uid,sj

2023年9月更新完整sql逻辑(工作需求中遇到了)

-- 阿里云 odps sql
DROP TABLE deal_tb;
create table deal_tb(
uid int comment "用户id",
utime string comment "用户时间",
sal decimal(10,2) comment "用户价格"
);
insert into deal_tb values
(1, "2019-02-08", 6247.32), 
(1, "2019-02-08", 6247.32), 
(1, "2019-02-09", 85.63), 
(1, "2019-02-09", 967.36), 
(1, "2019-02-10", 85.69), 
(1, "2019-02-12", 769.85), 
(1, "2019-02-13", 943.86), 
(1, "2019-02-14", 538.42),
(1, "2019-02-15", 369.76),
(1, "2019-02-16", 369.76),
(1, "2019-02-18", 795.15),
(1, "2019-02-19", 715.65),
(1, "2019-02-21", 537.71),
(2, "2019-02-08", 6214.23), 
(2, "2019-02-08", 6247.32), 
(2, "2019-02-09", 85.63), 
(2, "2019-02-09", 967.36), 
(2, "2019-02-10", 85.69), 
(2, "2019-02-12", 769.85), 
(2, "2019-02-13", 943.86), 
(2, "2019-02-14", 943.18),
(2, "2019-02-15", 369.76),
(2, "2019-02-18", 795.15),
(2, "2019-02-19", 715.65),
(2, "2019-02-21", 537.71),
(3, "2019-02-08", 6214.23), 
(3, "2019-02-08", 6247.32), 
(3, "2019-02-09", 85.63), 
(3, "2019-02-09", 967.36), 
(3, "2019-02-10", 85.69), 
(3, "2019-02-12", 769.85), 
(3, "2019-02-13", 943.86), 
(3, "2019-02-14", 276.81),
(3, "2019-02-15", 369.76),
(3, "2019-02-16", 369.76),
(3, "2019-02-18", 795.15),
(3, "2019-02-19", 715.65),
(3, "2019-02-21", 537.71);

-- 连续交易的总额、连续登陆天数、连续登陆开始和结束时间、间隔天数
SELECT 
    uid,sum(sal) as "连续交易的总额" ,min(utime) as "开始时间",max(utime) as "结束时间"
    ,count(1) as "连续交易天数"
    ,datediff(min(utime),lag(max(utime),1,min(utime)) OVER(partition by uid order by min(utime))) as "间隔天数"
from 
(
    select 
        uid,utime
        ,sum(sal) sal 
        ,DATE_SUB(utime,ROW_NUMBER() OVER(PARTITION BY uid ORDER BY utime)) datesub_time
    from deal_tb 
    group by uid,utime 
)a
group by uid,datesub_time 
;

在这里插入图片描述

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值