练习一
我们有如下的用户访问数据
userId | visitDate | visitCount |
u01 | 2017/1/21 | 5 |
u02 | 2017/1/23 | 6 |
u03 | 2017/1/22 | 8 |
u04 | 2017/1/20 | 3 |
u01 | 2017/1/23 | 6 |
u01 | 2017/2/21 | 8 |
u02 | 2017/1/23 | 6 |
u01 | 2017/2/22 | 4 |
要求使用SQL统计出每个用户的累积访问次数,如下表所示:
用户id | 月份 | 小计 | 累积 |
u01 | 2017-01 | 11 | 11 |
u01 | 2017-02 | 12 | 23 |
u02 | 2017-01 | 12 | 12 |
u03 | 2017-01 | 8 | 8 |
u04 | 2017-01 | 3 | 3 |
创建表
create table action
(userId string,
visitDate string,
visitCount int)
row format delimited fields terminated by "\t";
①首先我们将原数据日期改为所需格式(-)
注:date_format()函数默认连接符号是-,所以我们需要用regexp_replace(visitdate,'/','-')将/先替换成-
select
userid,
date_format(regexp_replace(visitdate,'/','-'),'yyyy-MM') visitdate,
visitcount
from action; 我们将这个表定义为t1
②我们把每个人每个月访问次数统计出来
select
userid,
visitdate,
sum(visitcount) small_count
from
(select
userid,
date_format(regexp_replace(visitdate,'/','-'),'yyyy-MM') visitdate,
visitcount
from action)t1 group by userid,visitdate; 我们称这张表为t2
③我们可以看到最后的需求是每个用户每个月的小计进行累加,通过over()函数实现
select
userid,
visitdate,
small_count,
sum(small_count) over(distribute by userid sort by visitdate) lei_count
from
(select
userid,
visitdate,
sum(visitcount) small_count
from
(select
userid,
date_format(regexp_replace(visitdate,'/','-'),'yyyy-MM') visitdate,
visitcount
from action)t1 group by userid,visitdate)t2;
练习二
有50W个京东店铺,每个顾客访问任何一个店铺的任何一个商品时都会产生一条访问日志,访问日志存储的表名为visit,访客的用户id为user_id,被访问的店铺名称为shop,请统计:
u1 a
u2 b
u1 b
u1 a
u3 c
u4 b
u1 a
u2 c
u5 b
u4 b
u6 c
u2 c
u1 b
u2 a
u2 a
u3 a
u5 a
u5 a
u5 a
建表:
create table visit(user_id string,shop string) row format delimited fields terminated by '\t';
需求:
1.每个店铺的UV(访客数)
解法1:使用distinct去重处理,真实环境尽量不要用,同一个ID会去到一个reduce中,造成处理数据太多
select
shop,
count(distinct user_id)
from visit
group by
shop;
解法2:使用子查询
select
shop,
count(*) uv
from
(select
shop,
user_id
from visit
group by shop,user_id)t1
group by shop;
2.每个店铺访问次数top3的访客信息。输出店铺名称、访客id、访问次数
①先求出每个店铺每个人点击次数
select
shop,
user_id,
count(user_id) visitcount
from visit group by shop,user_id;此张表为t1
②根据每个店铺每个人的点击次数进行降序排序(rank,dense_rank,row_number根据需求都可以选择)
select
shop,
user_id,
visitcount,
rank() over(distribute by shop sort by visitcount desc) usercount
from
(select
shop,
user_id,
count(user_id) visitcount
from visit group by shop,user_id)t1; 此表我们称为t2
③根据t2表我们利用where筛选usercount<=3
select
shop,
user_id,
visitcount
from
(select
shop,
user_id,
visitcount,
rank() over(distribute by shop sort by visitcount desc) usercount
from
(select
shop,
user_id,
count(user_id) visitcount
from visit group by shop,user_id)t1)t2
where usercount <= 3;
rank函数选择不同的话,结果也会不同,根据需求来定