hive分析函数sum() over()的应用

本文介绍两种计算App一天内累计访问量的方法:通过笛卡尔积进行不等值匹配和使用sum()over()分析函数。后者更为高效且易于理解,适用于如员工累计收入、公司累计营业额等累计场景。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

问题:

求每个app下,一天内截止到当前时间点的累计访问量

分析1:
原始方法:

在不知道有sum() over()函数的情况下怎么得出结果?笛卡尔积(不等值匹配)
自己和自己关联,左边的时间点大于等于右边的时间点

案例1:
spark-sql> with test1 as
         > (select 100024 as apptypeid,'00:00' as dateline,10 as pv
         > union all
         > select 100024 as apptypeid,'02:00' as dateline,10 as pv
         > union all
         > select 100024 as apptypeid,'04:00' as dateline,10 as pv
         > union all
         > select 100024 as apptypeid,'06:00' as dateline,15 as pv
         > union all
         > select 100024 as apptypeid,'08:00' as dateline,12 as pv
         > union all
         > select 100024 as apptypeid,'10:00' as dateline,10 as pv
         > union all
         > select 100024 as apptypeid,'12:00' as dateline,15 as pv
         > union all
         > select 100024 as apptypeid,'14:00' as dateline,12 as pv
         > union all
         > select 100024 as apptypeid,'16:00' as dateline,10 as pv
         > union all
         > select 100024 as apptypeid,'18:00' as dateline,15 as pv
         > union all
         > select 100024 as apptypeid,'20:00' as dateline,12 as pv
         > union all
         > select 100024 as apptypeid,'22:00' as dateline,10 as pv
         > union all
         > select 100024 as apptypeid,'24:00' as dateline,15 as pv)
         > 
         > select 
         >   t1.apptypeid,
         >   t1.dateline,
         >   t1.pv,
         >   sum(t2.pv) as total_pv
         > from 
         > (select
         >   apptypeid,
         >   dateline,
         >   pv
         > from test1) t1
         > 
         > left join
         > (select
         >   apptypeid,
         >   dateline,
         >   pv
         > from test1) t2
         > on t1.apptypeid=t2.apptypeid and t1.dateline>=t2.dateline
         > group by t1.apptypeid,t1.dateline,t1.pv
         > order by t1.apptypeid,t1.dateline asc;
apptypeid	dateline	pv	  total_pv         
100024  	00:00   	10      10                                                      
100024		02:00		10		20
100024		04:00		10		30
100024		06:00		15		45
100024		08:00		12		57
100024		10:00		10		67
100024		12:00		15		82
100024		14:00		12		94
100024		16:00		10		104
100024		18:00		15		119
100024		20:00		12		131
100024		22:00		10		141
100024		24:00		15		156
Time taken: 20.975 seconds, Fetched 13 row(s)

注: 如果数据量较大,不等值匹配的方法效率就比较低了,而且不是很好理解

分析2:
分析函数:

语法:
sum() over( partition by apptypeid order by dateline desc)
说明:
partition by:以什么分区(分组)
order by:区内或者组内以什么排序,是asc还是desc

可以多个字段分区、多个字段排序。可以不要分区sum() over(order by dateline desc),但这样的分析函数已经失去了灵魂。

也可以distribute by+sort by:
sum() over( distribute by apptypeid sort by dateline desc)

根据app分组,时间点正序,得到的就是截止到当前时间点的累加访问量,即累计访问量

案例2:
spark-sql> with test1 as
         > (select 100024 as apptypeid,'00:00' as dateline,10 as pv
         > union all
         > select 100024 as apptypeid,'02:00' as dateline,10 as pv
         > union all
         > select 100024 as apptypeid,'04:00' as dateline,10 as pv
         > union all
         > select 100024 as apptypeid,'06:00' as dateline,15 as pv
         > union all
         > select 100024 as apptypeid,'08:00' as dateline,12 as pv
         > union all
         > select 100024 as apptypeid,'10:00' as dateline,10 as pv
         > union all
         > select 100024 as apptypeid,'12:00' as dateline,15 as pv
         > union all
         > select 100024 as apptypeid,'14:00' as dateline,12 as pv
         > union all
         > select 100024 as apptypeid,'16:00' as dateline,10 as pv
         > union all
         > select 100024 as apptypeid,'18:00' as dateline,15 as pv
         > union all
         > select 100024 as apptypeid,'20:00' as dateline,12 as pv
         > union all
         > select 100024 as apptypeid,'22:00' as dateline,10 as pv
         > union all
         > select 100024 as apptypeid,'24:00' as dateline,15 as pv)
         > select 
         >   apptypeid,
         >   dateline,
         >   pv,
         >   sum(pv) over(distribute by apptypeid sort by dateline asc) as total_pv
         > from test1;
apptypeid  dateline     pv    total_pv
100024 	 	00:00   	10      10                                                      
100024		02:00		10		20
100024		04:00		10		30
100024		06:00		15		45
100024		08:00		12		57
100024		10:00		10		67
100024		12:00		15		82
100024		14:00		12		94
100024		16:00		10		104
100024		18:00		15		119
100024		20:00		12		131
100024		22:00		10		141
100024		24:00		15		156
Time taken: 22.881 seconds, Fetched 13 row(s)

总结:
分析函数sum() over()应用于数据累计的应用场景
例如:员工每个月的累计收入情况、公司每个月的累计营业额等等

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值