hive中排名函数(重要)
-- 三者的不同点
① row_number() 排名连续 不重复
② dense_rank() 排名连续 且重复
③ rank() 排名不连续 且重复
- row_number函数
-- 需求:统计各班级中学生分数排名前10的学生,要求展示明细数据
# 创建表和导入数据
CREATE TABLE scores(
id STRING,
object_id STRING,
score int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
load data local inpath '/usr/local/soft/hive-3.1.2/data/score.txt' into table scores;
# 关联score表和student表
Create table student_score as
SELECT
T2.*
,T1.sum_score
FROM(
SELECT
id
,sum(score) as sum_score
FROM scores
GROUP BY id
)T1 JOIN student T2 ON T1.id = T2.id
# 进行排名
SELECT
T1.*
FROM (
SELECT
*
,row_number() over(partition by clazz order by sum_score desc) as pm
FROM student_score
)T1 WHERE T1.pm <= 10
-- Time taken: 19.562 seconds
- rank/dense_rank
over可以开启一个窗口,窗口就相当于是当前表的一个副本
如果over中没有添加任何内容,表示整个表的大小
partition by clazz 可以将窗口中的数据进行限制大小 依据是 当前计算行中的班级数据和 窗口中对应班级列数据相同
order by sum_score 对当前窗口中的数据按照指定列进行排序
row_number 可以根据当前行的数据 在窗口中找到对应下标的位置 就可以得到其排名
SELECT
*
,row_number() over(partition by clazz order by sum_score desc) as row_pm
,dense_rank() over(partition by clazz order by sum_score desc) as dense_pm
,rank() over(partition by clazz order by sum_score desc) rank_pm
FROM student_score
- percent_rank/ cume_dist/NTILE
–percent_rank:(rank的结果-1)/(分区内数据的个数-1) 看排名在这个分区中的占比
– cume_dist:计算某个窗口或分区中某个值的累积分布。
–假定升序排序,则使用以下公式确定累积分布:
– 小于等于当前值x的行数 / 窗口或partition分区内的总行数。其中,x 等于 order by 子句中指定的列的当前行中的值。
–NTILE(n):对分区内数据再分成n组,然后打上组号
SELECT
*
,rank() over(partition by clazz order by sum_score desc) rank_pm
,percent_rank() over(partition by clazz order by sum_score desc) as percent_rank
,cume_dist() over(partition by clazz order by sum_score desc) as cume_dist
,NTILE(3) over(partition by clazz order by sum_score desc) as NTILE
FROM student_score