善用Grouping Sets 提升代码效率

最新推荐文章于 2023-08-02 12:22:42 发布

原创最新推荐文章于 2023-08-02 12:22:42 发布 · 1.2k 阅读

1 ·

CC 4.0 BY-SA版权

本文为博主原创文章，转载请注明出处http://blog.youkuaiyun.com/wsdc0521

文章标签：

#hive #sql #优化

Impala/Hive/Kudu 专栏收录该内容

22 篇文章

订阅专栏

使用Grouping Sets可按照多个不同的维度组合进行聚合，减少了繁琐的代码，提升整体计算的效率。

比如，需要对国家，省份，城市各维度进行聚合时候，可能会这么写：

select '国家' as gep_type,country as geo_name,count(*) as cnt from tbl group by country
union all 
select '省份' as gep_type,province as geo_name,count(*) as cnt from tbl group by province
union all 
select '城市' as gep_type,city as geo_name,count(*) as cnt from tbl group by city
;

或者这么写：

select country,province,city,count(*)  as cnt from tbl group by country,province,city
union all
select country,province,'ALL' as city,count(*)  as cnt from tbl group by country,province
union all
select country,'ALL' as province,'ALL' as city,count(*)  as cnt from tbl group by country
;

无论哪种方式都会用到union all，而通过grouping sets，我们可以这么写：

-- 1. 先把空值替换成'-'，不然在聚合的时候会被当成'ALL'
with tmp as (
    select 
        coalesce(country, '-') as country
        ,coalesce(province, '-') as province
        ,coalesce(city, '-') as city
    from tbl
)
-- 2.对于非本维度组合的字段，以'ALL'填充
select 
 coalesce(country, 'ALL') as country
,coalesce(province, 'ALL') as province
,coalesce(city, 'ALL') as city
,count(*)  as cnt 
from 
tmp 
group by country,province,city -- 按country,province,city聚合
grouping sets(
    (country) --只按country聚合，等于country,count(*)
    ,(province) --只按province聚合，等于province,count(*)
    ,(city) --只按city聚合，等于city,count(*)
    
    ,(country,province) --按country,province聚合，等于country,province,count(*)
)

有任何疑问都可以在下方留言，我会尽量进行解答！

希望本文对你有帮助，请点个赞鼓励一下作者吧~ 谢谢！