hive中grouping sets , grouping_id, cube,rollup的区别
grouping sets
–建表
create table tmp.garret_test(
name string,
age int,
score int,
subject string
)
–插入数据
insert into table tmp.garret_test values('lina',18,65,'语文')
insert into table tmp.garret_test values('lina',18,78,'数学')
insert into table tmp.garret_test values('Sheldon',17,60,'语文')
insert into table tmp.garret_test values('Sheldon',17,90,'数学')
insert into table tmp.garret_test values('Andy',17,89,'数学')
insert into table tmp.garret_test values('Andy',17,88,'语文')
在一个GROUP BY查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行UNION ALL
GROUPING__ID,表示结果属于哪一个分组集合。
select
name,
age,
count(distinct score) as cnt,
GROUPING__ID
from tmp.garret_test
group by name,age
grouping sets (name,age)
order by GROUPING__ID;
--结果为
lina null 2 1
Sheldon null 2 1
Andy null 2 1
null 17 4 2
null 18 4 2
–等价于
SELECT name,NULL,COUNT(DISTINCT score) AS total_score,1 AS GROUPING__ID FROM tmp.garret_test GROUP BY name
UNION ALL
SELECT NULL,age,COUNT(DISTINCT score) AS total_score,2 AS GROUPING__ID FROM tmp.garret_test GROUP BY age
cube 根据group By的所有的维度进行汇总
SELECT name, age,
COUNT(DISTINCT score) AS uv,
GROUPING__ID
FROM tmp.garret_test
GROUP BY name,age
WITH CUBE
ORDER BY GROUPING__ID;
–结果
null null 6 0
lina null 2 1
Sheldon null 2 1
Andy null 2 1
null 17 4 2
null 18 4 2
lina 18 2 1
Sheldon 17 2 1
Andy 17 2 1
–等价于
SELECT NULL,NULL,COUNT(DISTINCT score) AS uv,0 AS GROUPING__ID FROM tmp.garret_test
UNION ALL
SELECT name,NULL,COUNT(DISTINCT score) AS uv,1 AS GROUPING__ID FROM tmp.garret_test GROUP BY name
UNION ALL
SELECT NULL,age,COUNT(DISTINCT score) AS uv,2 AS GROUPING__ID FROM tmp.garret_test GROUP BY age
UNION ALL
SELECT name,age,COUNT(DISTINCT score) AS uv,3 AS GROUPING__ID FROM tmp.garret_test GROUP BY name,age
rollup 是CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合
SELECT name, age, COUNT(DISTINCT score) AS uv, GROUPING__ID
FROM tmp.garret_test
GROUP BY name,age WITH ROLLUP ORDER BY GROUPING__ID;
null null 6 0
lina null 2 1
Sheldon null 2 1
Andy null 2 1
lina 18 2 1
Sheldon 17 2 1
Andy 17 2 1
本文详细解析了Hive中的groupingsets、grouping_id、cube和rollup四个分组函数的区别与用法,通过创建表格、插入数据并执行查询语句,展示了不同函数在数据聚合上的特点和应用场景。
438

被折叠的 条评论
为什么被折叠?



