问题:
求每个app下,一天内截止到当前时间点的累计访问量
分析1:
原始方法:
在不知道有sum() over()函数的情况下怎么得出结果?笛卡尔积(不等值匹配)
自己和自己关联,左边的时间点大于等于右边的时间点
案例1:
spark-sql> with test1 as
> (select 100024 as apptypeid,'00:00' as dateline,10 as pv
> union all
> select 100024 as apptypeid,'02:00' as dateline,10 as pv
> union all
> select 100024 as apptypeid,'04:00' as dateline,10 as pv
> union all
> select 100024 as apptypeid,'06:00' as dateline,15 as pv
> union all
> select 100024 as apptypeid,'08:00' as dateline,12 as pv
> union all
> select 100024 as apptypeid,'10:00' as dateline,10 as pv
> union all
> select 100024 as apptypeid,'12:00' as dateline,15 as pv
> union all
> select 100024 as apptypeid,'14:00' as dateline,12 as pv
> union all
> select 100024 as apptypeid,'16:00' as dateline,10 as pv
> union all
> select 100024 as apptypeid,'18:00' as dateline,15 as pv
> union all
> select 100024 as apptypeid,'20:00' as dateline,12 as pv
> union all
> select 100024 as apptypeid,'22:00' as dateline,10 as pv
> union all
> select 100024 as apptypeid,'24:00' as dateline,15 as pv)
>
> select
> t1.apptypeid,
> t1.dateline,
> t1.pv,
> sum(t2.pv) as total_pv
> from
> (select
> apptypeid,
> dateline,
> pv
> from test1) t1
>
> left join
> (select
> apptypeid,
> dateline,
> pv
> from test1) t2
> on t1.apptypeid=t2.apptypeid and t1.dateline>=t2.dateline
> group by t1.apptypeid,t1.dateline,t1.pv
> order by t1.apptypeid,t1.dateline asc;
apptypeid dateline pv total_pv
100024 00:00 10 10
100024 02:00 10 20
100024 04:00 10 30
100024 06:00 15 45
100024 08:00 12 57
100024 10:00 10 67
100024 12:00 15 82
100024 14:00 12 94
100024 16:00 10 104
100024 18:00 15 119
100024 20:00 12 131
100024 22:00 10 141
100024 24:00 15 156
Time taken: 20.975 seconds, Fetched 13 row(s)
注: 如果数据量较大,不等值匹配的方法效率就比较低了,而且不是很好理解
分析2:
分析函数:
语法:
sum() over( partition by apptypeid order by dateline desc)
说明:
partition by:以什么分区(分组)
order by:区内或者组内以什么排序,是asc还是desc
可以多个字段分区、多个字段排序。可以不要分区sum() over(order by dateline desc),但这样的分析函数已经失去了灵魂。
也可以distribute by+sort by:
sum() over( distribute by apptypeid sort by dateline desc)
根据app分组,时间点正序,得到的就是截止到当前时间点的累加访问量,即累计访问量
案例2:
spark-sql> with test1 as
> (select 100024 as apptypeid,'00:00' as dateline,10 as pv
> union all
> select 100024 as apptypeid,'02:00' as dateline,10 as pv
> union all
> select 100024 as apptypeid,'04:00' as dateline,10 as pv
> union all
> select 100024 as apptypeid,'06:00' as dateline,15 as pv
> union all
> select 100024 as apptypeid,'08:00' as dateline,12 as pv
> union all
> select 100024 as apptypeid,'10:00' as dateline,10 as pv
> union all
> select 100024 as apptypeid,'12:00' as dateline,15 as pv
> union all
> select 100024 as apptypeid,'14:00' as dateline,12 as pv
> union all
> select 100024 as apptypeid,'16:00' as dateline,10 as pv
> union all
> select 100024 as apptypeid,'18:00' as dateline,15 as pv
> union all
> select 100024 as apptypeid,'20:00' as dateline,12 as pv
> union all
> select 100024 as apptypeid,'22:00' as dateline,10 as pv
> union all
> select 100024 as apptypeid,'24:00' as dateline,15 as pv)
> select
> apptypeid,
> dateline,
> pv,
> sum(pv) over(distribute by apptypeid sort by dateline asc) as total_pv
> from test1;
apptypeid dateline pv total_pv
100024 00:00 10 10
100024 02:00 10 20
100024 04:00 10 30
100024 06:00 15 45
100024 08:00 12 57
100024 10:00 10 67
100024 12:00 15 82
100024 14:00 12 94
100024 16:00 10 104
100024 18:00 15 119
100024 20:00 12 131
100024 22:00 10 141
100024 24:00 15 156
Time taken: 22.881 seconds, Fetched 13 row(s)
总结:
分析函数sum() over()应用于数据累计的应用场景
例如:员工每个月的累计收入情况、公司每个月的累计营业额等等