连续值问题
TopN问题
行列互换问题
一.连续值问题 求连续7天登录的用户
原始数据
uid dt status
1 2019-07-11 1
1 2019-07-12 1
1 2019-07-13 1
1 2019-07-14 1
1 2019-07-15 1
1 2019-07-16 1
1 2019-07-17 1
1 2019-07-18 1
2 2019-07-11 1
2 2019-07-12 1
2 2019-07-13 0
2 2019-07-14 1
2 2019-07-15 1
2 2019-07-16 0
2 2019-07-17 1
2 2019-07-18 0
3 2019-07-11 1
3 2019-07-12 1
3 2019-07-13 1
3 2019-07-14 0
3 2019-07-15 1
3 2019-07-16 1
3 2019-07-17 1
3 2019-07-18 1
此类问题求解思路如下:
- 使用 row_number 在组内给数据编号(rownum)
- 某个值 - rownum = gid,得到结果(gid)可以作为后面分组计算的依据
- 根据求得的gid,作为分组条件,求得最终结果
第一步 使用 row_number 在组内给数据编号(rownum)
select uid,dt,
row_number() over(partition by uid order by dt)rownum
from ulogin
where status = 1;
第二步 此处通过dt - rownum = gid得到gid值
select uid,dt,
row_number() over(partition by uid order by dt)rownum,
date_sub(dt,row_number() over(partition by uid order by dt)) gid
from ulogin
where status = 1;
查到的结果如下:
uid dt rownum gid
1 2019-07-11 1 2019-07-10
1 2019-07-12 2 2019-07-10
1 2019-07-13 3 2019-07-10
1 2019-07-14 4 2019-07-10
1 2019-07-15 5 2019-07-10
1 2019-07-16 6 2019-07-10
1 2019-07-17 7 2019-07-10
1 2019-07-18 8 2019-07-10
2 2019-07-11 1 2019-07-10
2 2019-07-12 2 2019-07-10
2 2019-07-14 3 2019-07-11
2 2019-07-15 4 2019-07-11
2 2019-07-17 5 2019-07-12
3 2019-07-11 1 2019-07-10
3 2019-07-12 2 2019-07-10
3 2019-07-13 3 2019-07-10
3 2019-07-15 4 2019-07-11
3 2019-07-16 5 2019-07-11
3 2019-07-17 6 2019-07-11
3 2019-07-18 7 2019-07-11
第三步 此处通过uid,gid作为分组条件
select uid,count(*) as logincount
from (select uid,dt,
row_number() over(partition by uid order by dt)rownum,
date_sub(dt,row_number() over(partition by uid order by dt)) gid
from ulogin
where status = 1)tmp
group by uid,gid
having count(*)>=7;
得到的结果如下:
uid logincount
1 8
二.TopN问题
编写sql语句实现每班前三名,分数一样并列,同时求出前三名按名次排序的分差
原始数据:
sid class score
1 1901 90
2 1901 90
3 1901 83
4 1901 60
5 1902 66
6 1902 23
7 1902 99
8 1902 67
9 1902 87
待求结果数据如下:
class score rank lagscore
1901 90 1 0
1901 90 1 0
1901 83 2 -7
1901 60 3 -23
1902 99 1 0
1902 87 2 -12
1902 67 3 -20
解题思路:
- 使用排名函数,分数一样并列,使用dense_rank
- 使用行函数,将数据向下移动,相减求分差
- 处理null
第一步 分数一样并列
select sno,class,score,
dense_rank() over(partition by class order by score desc) as rank
from stu;
结果如下:
第二步 将数据向下移动
select sno,class,score,
dense_rank() over(partition by class order by score desc) as rank,
score - lag(score) over (partition by class order by score desc) as lagscore
from stu;
结果如下:
第三步 处理空值
select class,score,
dense_rank() over(partition by class order by score desc) as rank,
nvl(score - lag(score) over (partition by class order by score desc),0) as lagscore
from stu;
结果如下:
二.行列互换问题
-- 数据:
id course
1 java
1 hadoop
1 hive
1 hbase
2 java
2 hive
2 spark
2 flink
3 java
3 hadoop
3 hive
3 kafka
-- 编写sql,得到结果如下(1表示选修,0表示未选修)
id java hadoop hive hbase spark flink kafka
1 1 1 1 1 0 0 0
2 1 0 1 0 1 1 0
3 1 1 1 0 0 0 1
第一步 使用case when;group by + sum
select id,
sum(case when course="java" then 1 else 0 end) as java,
sum(case when course="hadoop" then 1 else 0 end) as hadoop,
sum(case when course="hive" then 1 else 0 end) as hive,
sum(case when course="hbase" then 1 else 0 end) as hbase,
sum(case when course="spark" then 1 else 0 end) as spark,
sum(case when course="flink" then 1 else 0 end) as flink,
sum(case when course="kafka" then 1 else 0 end) as kafka
from rowline1
group by id;
结果如下:
-- 数据:
a b 2
a b 1
a b 3
c d 6
c d 8
c d 8
-- 编写sql实现如下结果
id1 id2 flag
a b 2|1|3
c d 6|8
第一步 将元素聚拢
//collect_set()去除重复元素
select id1, id2, collect_set(flag) flag from rowline2 group by id1, id2;
运行结果如下:
//collect_list()保留重复元素
select id1, id2, collect_list(flag) flag from rowline2 group by id1, id2;
运行结果如下:
//sort_array()去除重复元素且排序
select id1, id2, sort_array(collect_set(flag)) flag from rowline2 group by id1, id2;
第二步 将元素连接在一起
select id1, id2, concat_ws("|", collect_set(flag)) flag from rowline2 group by id1, id2;
//执行此语句会报错Argument type mismatch 'flag': Argument 2 of function CONCAT_WS must //be "string or array<string>", but "array<int>" was found
//所以根据报错语句,我们需要给collect_set(flag)中的flag加了类型转换,转换成String
select id1, id2, concat_ws("|", collect_set(cast (flag as string))) flag
from rowline2
group by id1, id2;
运行结果如下: