大数据之hive:hive新功能之GROUPING SETS,Cube, Rollup

一、GROUPING SETS

1、概述

GROUPING SETS作为GROUP BY的子句,可以简单理解为多条group by语句通过union all把查询结果聚合起来;

2、实战

查看表test_03字段

hive (default)> desc test_03;
OK
col_name	data_type	comment
user_id             	bigint              	                    
device_id           	int                 	                    
os_id               	int                 	                    
app_id              	int                 	                    
--sql1
select device_id,os_id,app_id,count(user_id) from  test_03 group by device_id,os_id,app_id grouping sets((device_id)) 
--等价于sql1
SELECT device_id,null,null,count(user_id) FROM test_03 group by device_id

--sql2
select device_id,os_id,app_id,count(user_id) from  test_03 group by device_id,os_id,app_id grouping sets((device_id,os_id))	
--等价于sql2
SELECT device_id,os_id,null,count(user_id) FROM test_03 group by device_id,os_id
--sql3
select device_id,os_id,app_id,count(user_id) from  test_03 group by device_id,os_id,app_id grouping sets((device_id,os_id),(device_id))
--等价于sql3
SELECT device_id,os_id,null,count(user_id) FROM test_03 group by device_id,os_id 
UNION ALL 
SELECT device_id,null,null,count(user_id) FROM test_03 group by device_id
--sql4
select device_id,os_id,app_id,count(user_id) from  test_03 group by device_id,os_id,app_id grouping sets((device_id),(os_id),(device_id,os_id),())
--等价于sql4	
SELECT device_id,null,null,count(user_id) FROM test_03 group by device_id 
UNION ALL 
SELECT null,os_id,null,count(user_id) FROM test_03 group by os_id 
UNION ALL 
SELECT device_id,os_id,null,count(user_id) FROM test_03 group by device_id,os_id  
UNION ALL 
SELECT null,null,null,count(user_id) FROM test_03

二、Cube

1、概述

cube简称数据魔方,可以实现hive多个任意维度的查询,cube(a,b,c)则首先会对(a,b,c)进行group by,然后依次是(a,b),(a,c),(a),(b,c),(b),©,最后在对全表进行group by,它会统计所选列中值的所有组合的聚合;

2、实战

select device_id,os_id,app_id count(user_id) 
from test_xinyan_reg 
group by device_id,os_id,app_id with cube;

--共有3^0+3^1+3^2+3^3=8种可能,如果维度增大,代码量显著增加
SELECT device_id,null,null ,count(user_id) FROM test_03 group by device_id
UNION ALL
SELECT null,os_id,null,count(user_id) FROM test_xinyan_reg group by os_id
UNION ALL
SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id
UNION ALL
SELECT null,app_id,null,count(user_id) FROM test_xinyan_reg group by app_id
UNION ALL
SELECT device_id,null,app_id ,count(user_id) FROM test_xinyan_reg group by device_id,app_id
UNION ALL
SELECT null,os_id,app_id,count(user_id) FROM test_xinyan_reg group by os_id,app_id
UNION ALL
SELECT device_id,os_id,app_id,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id
UNION ALL
SELECT null,null,null count(user_id) FROM test_xinyan_reg group by client_version

三、Rollup

1、概述

rollup可以实现从右到做递减多级的统计,显示统计某一层次结构的聚合。

2、实战

--sql1
 select device_id,os_id,app_id,count(user_id) 
from test_xinyan_reg 
group by device_id,os_id,app_id with rollup;
--等价于sql1
select device_id,os_id,app_id,count(user_id) 
from test_xinyan_reg 
group by device_id,os_id,app_id,client_version,from_id 
grouping sets ((device_id,os_id,app_id),(device_id,os_id),(device_id),());

四、Grouping_ID函数

当我们没有统计某一列时,它的值显示为null,这可能与列本身就有null值冲突,这就需要一种方法区分是没有统计还是值本来就是null,grouping_id其实就是所统计各列二进制和;
比如表test_04有两个字段id,type;

idtype
11
1null
21
32
3null
SELECT id, type, GROUPING__ID, count(*) from test_04 GROUP BY id, type  grouping sets((id, type),(id)) order by GROUPING__ID;
--结果如下
id	type	grouping__id	_c3
3	2	0	1
2	1	0	1
1	NULL	0	1
1	1	0	1
3	NULL	0	1
3	NULL	1	2
1	NULL	1	2
2	NULL	1	1

grouping_id计算方法
grouping sets 中的每一种粒度,都对应唯一的 grouping__id 值,其计算公式与 group by 的顺序、当前粒度的字段有关。
序号 grouping set 给倒序排列的字段(sex class)赋值 对应的十进制(grouping__id 的值)

1 id 01 1
2 id,type 00 0
这就是上面 grouping sets 的结果中 grouping__id 值的由来。

### Hive SQL Grouping Sets Usage and Examples In Hive, `GROUPING SETS` allow aggregation at multiple levels simultaneously. This feature extends beyond simple GROUP BY operations to provide more flexible data summarization capabilities. For instance, consider a table named `sales_data`, which contains sales information across different regions and product categories: ```sql CREATE TABLE IF NOT EXISTS sales_data ( region STRING, category STRING, amount DOUBLE ); ``` To calculate total sales per region as well as overall totals without repeating rows manually, one can use `GROUPING SETS`. Here’s an example query demonstrating this functionality[^1]: ```sql SELECT COALESCE(region, 'Total') AS region, SUM(amount) AS total_sales FROM sales_data GROUP BY GROUPING SETS ((region), ()) ORDER BY region; ``` This will produce results showing both individual regional sums along with an additional row representing the grand total of all sales amounts. Another common scenario involves analyzing sales not only by each combination of region and category but also obtaining subtotals for just the regions or categories alone. The following code snippet illustrates how such multi-level aggregations could be achieved using `GROUPING SETS`[^2]: ```sql SELECT COALESCE(region, 'All Regions') AS region, COALESCE(category, 'All Categories') AS category, SUM(amount) AS total_sales FROM sales_data GROUP BY GROUPING SETS ((region, category), (region), (category), ()) ORDER BY region NULLS LAST, category NULLS LAST; ``` The above script generates detailed breakdowns alongside higher-level summaries within a single result set efficiently. Additionally, when working with complex queries involving multiple dimensions like time periods combined with geographical areas or other attributes, leveraging `CUBE` or `ROLLUP` operators—which internally utilize `GROUPING SETS`—can simplify syntax while enhancing readability significantly[^3].
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值