窗口函数:lead lag FIRST_VALUE
分析函数: RANK ROW_NUMBER
Hive中的简单窗口函数应用
TOPN
row number
说明:
row_number() over ([partition col1] [order by col2])
rank() over ([partition col1] [order by col2])
dense_rank() over ([partition col1] [order by col2])
它们都是根据col1字段分组,然后对col2字段进行排序,对排序后的每行生成一个行号,这个行号从1开始递增
col1、col2都可以是多个字段,用‘,‘分隔
区别:
1)row_number:不管col2字段的值是否相等,行号一直递增,比如:有两条记录的值相等,但一个是第一,一个是第二
2)rank:上下两条记录的col2相等时,记录的行号是一样的,但下一个col2值的行号递增N(N是重复的次数),比如:有两条并列第一,下一个是第三,没有第二
3)dense_rank:上下两条记录的col2相等时,下一个col2值的行号递增1,比如:有两条并列第一,下一个是第二
select empname,empjob,salary,deptno,row_number() over(partition by deptno order by salary desc ) rank from emp;
empname empjob salary deptno rank
MILLER CLERK 1300.0 10 1
CLARK MANAGER 2450.0 10 2
KING PRESIDENT 5000.0 10 3
SMITH CLERK 800.0 20 1
ADAMS CLERK 1100.0 20 2
JONES MANAGER 2975.0 20 3
SCOTT ANALYST 3000.0 20 4
FORD ANALYST 3000.0 20 5
JAMES CLERK 950.0 30 1
MARTIN SALESMAN 1250.0 30 2
WARD SALESMAN 1250.0 30 3
TURNER SALESMAN 1500.0 30 4
ALLEN SALESMAN 1600.0 30 5
BLAKE MANAGER 2850.0 30 6
//测试原数据
Hive TopN
a chinese 98
a english 90
a math 90
d chinese 88
c english 82
c math 98
b math 79
b chinese 79
b english 79
z english 90
z math 89
z chinese 80
e math 99
e english 87
d english 90
create table t(name string, sub string, score int) row format delimited fields terminated by '\t';
load data local inpath "/home/user01/grades.txt" into table t;
//为每个学生的各门功课成绩排名
1、row_number
select *,row_number() over (partition by name order by score desc) as rank from t;
t.name t.sub t.score rank
a chinese 98 1
a english 90 2
a math 90 3
b chinese 79 1
b english 79 2
b math 79 3
c math 98 1
c english 82 2
d english 90 1
d chinese 88 2
e math 99 1
e english 87 2
z english 90 1
z math 89 2
z chinese 80 3
2.rank //排序字段相同的记录使用相同的排名,下一个从值的行号递增N(N是重复的次数),比如:有两条并列第一,下一个是第三,没有第二
select *,rank() over (partition by name order by score asc) as rank from t;
t.name t.sub t.score rank
a english 90 1
a math 90 1
a chinese 98 3
b chinese 79 1
b english 79 1
b math 79 1
c english 82 1
c math 98 2
d chinese 88 1
d english 90 2
e english 87 1
e math 99 2
z chinese 80 1
z math 89 2
z english 90 3
3、dense_rank //排序字段相同的记录使用相同的排名,下一个值的行号递增1,如下:a的english和math并列第一,下一个chinese是第三,没有第二,b的三门都一样
select *,dense_rank() over (partition by name order by score asc) as rank from t;
t.name t.sub t.score rank
a english 90 1
a math 90 1
a chinese 98 2
b chinese 79 1
b english 79 1
b math 79 1
c english 82 1
c math 98 2
d chinese 88 1
d english 90 2
e english 87 1
e math 99 2
z chinese 80 1
z math 89 2
z english 90 3
业务实例:
统计每个学科的前二名
select * from (select *, row_number() over(partition by name order by score desc) as rank from t )t where rank <=2;
select *,row_number() over () as rank from t rank <=3;
select area, barnd, yuan, row_number() over (partition by area order by yuan desc) as rank from order where rank <=3;
各地区热门商品统计
按地区 分组 再按各个商品的销量进行降序排名
北京 iphone7 70000 1
北京 xiaomi5 60000 2
北京 mate9 50000 3
北京 手机膜 40000 4
...
上海 xiaomi5 70000 1
上海 iphone7 60000 2
上海 mate9 50000 3
上海 手机膜 40000 4
..
分析函数: RANK ROW_NUMBER
Hive中的简单窗口函数应用
TOPN
row number
说明:
row_number() over ([partition col1] [order by col2])
rank() over ([partition col1] [order by col2])
dense_rank() over ([partition col1] [order by col2])
它们都是根据col1字段分组,然后对col2字段进行排序,对排序后的每行生成一个行号,这个行号从1开始递增
col1、col2都可以是多个字段,用‘,‘分隔
区别:
1)row_number:不管col2字段的值是否相等,行号一直递增,比如:有两条记录的值相等,但一个是第一,一个是第二
2)rank:上下两条记录的col2相等时,记录的行号是一样的,但下一个col2值的行号递增N(N是重复的次数),比如:有两条并列第一,下一个是第三,没有第二
3)dense_rank:上下两条记录的col2相等时,下一个col2值的行号递增1,比如:有两条并列第一,下一个是第二
select empname,empjob,salary,deptno,row_number() over(partition by deptno order by salary desc ) rank from emp;
empname empjob salary deptno rank
MILLER CLERK 1300.0 10 1
CLARK MANAGER 2450.0 10 2
KING PRESIDENT 5000.0 10 3
SMITH CLERK 800.0 20 1
ADAMS CLERK 1100.0 20 2
JONES MANAGER 2975.0 20 3
SCOTT ANALYST 3000.0 20 4
FORD ANALYST 3000.0 20 5
JAMES CLERK 950.0 30 1
MARTIN SALESMAN 1250.0 30 2
WARD SALESMAN 1250.0 30 3
TURNER SALESMAN 1500.0 30 4
ALLEN SALESMAN 1600.0 30 5
BLAKE MANAGER 2850.0 30 6
//测试原数据
Hive TopN
a chinese 98
a english 90
a math 90
d chinese 88
c english 82
c math 98
b math 79
b chinese 79
b english 79
z english 90
z math 89
z chinese 80
e math 99
e english 87
d english 90
create table t(name string, sub string, score int) row format delimited fields terminated by '\t';
load data local inpath "/home/user01/grades.txt" into table t;
//为每个学生的各门功课成绩排名
1、row_number
select *,row_number() over (partition by name order by score desc) as rank from t;
t.name t.sub t.score rank
a chinese 98 1
a english 90 2
a math 90 3
b chinese 79 1
b english 79 2
b math 79 3
c math 98 1
c english 82 2
d english 90 1
d chinese 88 2
e math 99 1
e english 87 2
z english 90 1
z math 89 2
z chinese 80 3
2.rank //排序字段相同的记录使用相同的排名,下一个从值的行号递增N(N是重复的次数),比如:有两条并列第一,下一个是第三,没有第二
select *,rank() over (partition by name order by score asc) as rank from t;
t.name t.sub t.score rank
a english 90 1
a math 90 1
a chinese 98 3
b chinese 79 1
b english 79 1
b math 79 1
c english 82 1
c math 98 2
d chinese 88 1
d english 90 2
e english 87 1
e math 99 2
z chinese 80 1
z math 89 2
z english 90 3
3、dense_rank //排序字段相同的记录使用相同的排名,下一个值的行号递增1,如下:a的english和math并列第一,下一个chinese是第三,没有第二,b的三门都一样
select *,dense_rank() over (partition by name order by score asc) as rank from t;
t.name t.sub t.score rank
a english 90 1
a math 90 1
a chinese 98 2
b chinese 79 1
b english 79 1
b math 79 1
c english 82 1
c math 98 2
d chinese 88 1
d english 90 2
e english 87 1
e math 99 2
z chinese 80 1
z math 89 2
z english 90 3
业务实例:
统计每个学科的前二名
select * from (select *, row_number() over(partition by name order by score desc) as rank from t )t where rank <=2;
select *,row_number() over () as rank from t rank <=3;
select area, barnd, yuan, row_number() over (partition by area order by yuan desc) as rank from order where rank <=3;
各地区热门商品统计
按地区 分组 再按各个商品的销量进行降序排名
北京 iphone7 70000 1
北京 xiaomi5 60000 2
北京 mate9 50000 3
北京 手机膜 40000 4
...
上海 xiaomi5 70000 1
上海 iphone7 60000 2
上海 mate9 50000 3
上海 手机膜 40000 4
..