hive分析函数sum() over()的应用

最新推荐文章于 2025-03-28 20:22:11 发布

不想起的昵称

最新推荐文章于 2025-03-28 20:22:11 发布

阅读量1.7k

点赞数 1

分类专栏： hive 文章标签：大数据 hadoop hive

本文链接：https://blog.youkuaiyun.com/weixin_40267121/article/details/119148555

版权

hive 专栏收录该内容

42 篇文章

订阅专栏

本文介绍两种计算App一天内累计访问量的方法：通过笛卡尔积进行不等值匹配和使用sum()over()分析函数。后者更为高效且易于理解，适用于如员工累计收入、公司累计营业额等累计场景。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

问题：

求每个app下，一天内截止到当前时间点的累计访问量

分析1：

原始方法：

在不知道有sum() over()函数的情况下怎么得出结果？笛卡尔积（不等值匹配）
自己和自己关联，左边的时间点大于等于右边的时间点

案例1：

spark-sql> with test1 as
         > (select 100024 as apptypeid,'00:00' as dateline,10 as pv
         > union all
         > select 100024 as apptypeid,'02:00' as dateline,10 as pv
         > union all
         > select 100024 as apptypeid,'04:00' as dateline,10 as pv
         > union all
         > select 100024 as apptypeid,'06:00' as dateline,15 as pv
         > union all
         > select 100024 as apptypeid,'08:00' as dateline,12 as pv
         > union all
         > select 100024 as apptypeid,'10:00' as dateline,10 as pv
         > union all
         > select 100024 as apptypeid,'12:00' as dateline,15 as pv
         > union all
         > select 100024 as apptypeid,'14:00' as dateline,12 as pv
         > union all
         > select 100024 as apptypeid,'16:00' as dateline,10 as pv
         > union all
         > select 100024 as apptypeid,'18:00' as dateline,15 as pv
         > union all
         > select 100024 as apptypeid,'20:00' as dateline,12 as pv
         > union all
         > select 100024 as apptypeid,'22:00' as dateline,10 as pv
         > union all
         > select 100024 as apptypeid,'24:00' as dateline,15 as pv)
         > 
         > select 
         >   t1.apptypeid,
         >   t1.dateline,
         >   t1.pv,
         >   sum(t2.pv) as total_pv
         > from 
         > (select
         >   apptypeid,
         >   dateline,
         >   pv
         > from test1) t1
         > 
         > left join
         > (select
         >   apptypeid,
         >   dateline,
         >   pv
         > from test1) t2
         > on t1.apptypeid=t2.apptypeid and t1.dateline>=t2.dateline
         > group by t1.apptypeid,t1.dateline,t1.pv
         > order by t1.apptypeid,t1.dateline asc;
apptypeid	dateline	pv	  total_pv         
100024  	00:00   	10      10                                                      
100024		02:00		10		20
100024		04:00		10		30
100024		06:00		15		45
100024		08:00		12		57
100024		10:00		10		67
100024		12:00		15		82
100024		14:00		12		94
100024		16:00		10		104
100024		18:00		15		119
100024		20:00		12		131
100024		22:00		10		141
100024		24:00		15		156
Time taken: 20.975 seconds, Fetched 13 row(s)

注：如果数据量较大，不等值匹配的方法效率就比较低了，而且不是很好理解

分析2：

分析函数：

语法：
sum() over( partition by apptypeid order by dateline desc)
说明：
partition by：以什么分区（分组）
order by：区内或者组内以什么排序，是asc还是desc

可以多个字段分区、多个字段排序。可以不要分区sum() over(order by dateline desc)，但这样的分析函数已经失去了灵魂。

也可以distribute by+sort by：
sum() over( distribute by apptypeid sort by dateline desc)

根据app分组，时间点正序，得到的就是截止到当前时间点的累加访问量，即累计访问量

案例2：

spark-sql> with test1 as
         > (select 100024 as apptypeid,'00:00' as dateline,10 as pv
         > union all
         > select 100024 as apptypeid,'02:00' as dateline,10 as pv
         > union all
         > select 100024 as apptypeid,'04:00' as dateline,10 as pv
         > union all
         > select 100024 as apptypeid,'06:00' as dateline,15 as pv
         > union all
         > select 100024 as apptypeid,'08:00' as dateline,12 as pv
         > union all
         > select 100024 as apptypeid,'10:00' as dateline,10 as pv
         > union all
         > select 100024 as apptypeid,'12:00' as dateline,15 as pv
         > union all
         > select 100024 as apptypeid,'14:00' as dateline,12 as pv
         > union all
         > select 100024 as apptypeid,'16:00' as dateline,10 as pv
         > union all
         > select 100024 as apptypeid,'18:00' as dateline,15 as pv
         > union all
         > select 100024 as apptypeid,'20:00' as dateline,12 as pv
         > union all
         > select 100024 as apptypeid,'22:00' as dateline,10 as pv
         > union all
         > select 100024 as apptypeid,'24:00' as dateline,15 as pv)
         > select 
         >   apptypeid,
         >   dateline,
         >   pv,
         >   sum(pv) over(distribute by apptypeid sort by dateline asc) as total_pv
         > from test1;
apptypeid  dateline     pv    total_pv
100024 	 	00:00   	10      10                                                      
100024		02:00		10		20
100024		04:00		10		30
100024		06:00		15		45
100024		08:00		12		57
100024		10:00		10		67
100024		12:00		15		82
100024		14:00		12		94
100024		16:00		10		104
100024		18:00		15		119
100024		20:00		12		131
100024		22:00		10		141
100024		24:00		15		156
Time taken: 22.881 seconds, Fetched 13 row(s)