grouping sets 联合 Grouping__ID 如何使用

最新推荐文章于 2025-10-11 14:12:40 发布

原创最新推荐文章于 2025-10-11 14:12:40 发布 · 置顶 · 1.6k 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#hadoop #大数据

grouping sets 联合 Grouping__ID 如何使用

- - - 一、背景

一、背景

(此文的前提是你已经会使用grouping sets 而还不明白Grouping__ID如何使用的情况)在做离线项目开发时，经常会有这种场景：在求某几大类指标时，他们的逻辑其实完全一样，只是其指标所用的去重字段不一样，又或者纬度组合的场景不一样，那么我们一般首先想到的方法是把这几段union all起来。你没有错，union all完全可以解决此场景，但是假如union all的每段代码都很复杂，可能一段代码就有几百甚至达到千行呢？这很正常，对于一个稍微复杂一点的离线项目来说。那么这个时候grouping sets的简化代码的功能就尤为明显了，根据不同维度进行grouping sets的组合搭配完全可以代替union all，而这个时候我们需要区分最终的结果是哪个维度组合而产生呢？这个时候用Grouping__ID就可以用来区分具体的结果是由哪个组合产生的了。具体使用如下：

1.1 union all的使用

假设有如下场景(简化)，
dws表dws_cu_trsf_lead_trans_d结构如下：

字段名	字段类型	备注
region_desc1	string	省
region_desc2	string	市
region_desc3	string	县区
user_name	string	客户
internet_client_type	string	新老客
product_source_type	string	产品类型
product_name	string	产品名称
is_act	string	是否参加活动
is_add_wx	string	是否添加微信
is_buy	string	是否购买
day	string	天分区(全量)

此dws表是已经聚合过的聚合层表，现求省、市、县区的新老客(包括’整体’的类型)、产品类型(包括’整体’的类型)、各产品名称(包括’整体’的类型)对应的参加活动的人数、添加微信人数、购买产品的人数。
假设这里我们使用union all来求：结果如下：


select
    region_desc1
    ,internet_client_type
    ,product_source_type
    ,product_name
    ,count(distinct case when is_act = '1' then user_name end)     --参加活动人数
    ,count(distinct case when is_add_wx = '1' then user_name end)     --添加微信人数
    ,count(distinct case when is_buy = '1' then user_name end)     --购买产品人数
from dws_cu_trsf_lead_trans_d where day = '${IncStartAll}'
group by region_desc1,internet_client_type,product_source_type,product_name

union all

select
    region_desc2
    ,internet_client_type
    ,product_source_type
    ,product_name
    ,count(distinct case when is_act = '1' then user_name end)     --参加活动人数
    ,count(distinct case when is_add_wx = '1' then user_name end)     --添加微信人数
    ,count(distinct case when is_buy = '1' then user_name end)     --购买产品人数
from dws_cu_trsf_lead_trans_d where day = '${IncStartAll}'
group by region_desc1,internet_client_type,product_source_type,product_name

union all

select
    region_desc3
    ,internet_client_type
    ,product_source_type
    ,product_name
    ,count(distinct case when is_act = '1' then user_name end)     --参加活动人数
    ,count(distinct case when is_add_wx = '1' then user_name end)     --添加微信人数
    ,count(distinct case when is_buy = '1' then user_name end)     --购买产品人数
from dws_cu_trsf_lead_trans_d where day = '${IncStartAll}'
group by region_desc1,internet_client_type,product_source_type,product_name

union all

select
    region_desc1
    ,'整体' as internet_client_type
    ,product_source_type
    ,product_name
    ,count(distinct case when is_act = '1' then user_name end)     --参加活动人数
    ,count(distinct case when is_add_wx = '1' then user_name end)     --添加微信人数
    ,count(distinct case when is_buy = '1' then user_name end)     --购买产品人数
from dws_cu_trsf_lead_trans_d where day = '${IncStartAll}'
group by region_desc1,product_source_type,product_name

union all

select
    region_desc2
    ,'整体' as internet_client_type
    ,product_source_type
    ,product_name
    ,count(distinct case when is_act = '1' then user_name end)     --参加活动人数
    ,count(distinct case when is_add_wx = '1' then user_name end)     --添加微信人数
    ,count(distinct case when is_buy = '1' then user_name end)     --购买产品人数
from dws_cu_trsf_lead_trans_d where day = '${IncStartAll}'
group by region_desc1,product_source_type,product_nameunion all

select
    region_desc3
    ,'整体' as internet_client_type
    ,product_source_type
    ,product_name
    ,count(distinct case when is_act = '1' then user_name end)     --参加活动人数
    ,count(distinct case when is_add_wx = '1' then user_name end)     --添加微信人数
    ,count(distinct case when is_buy = '1' then user_name end)     --购买产品人数
from dws_cu_trsf_lead_trans_d where day = '${IncStartAll}'
group by region_desc1,product_source_type,product_name

union all

select
    region_desc1
    ,'整体' as internet_client_type
    ,'整体' as product_source_type
    ,product_name
    ,count(distinct case when is_act = '1' then user_name end)     --参加活动人数
    ,count(distinct case when is_add_wx = '1' then user_name end)     --添加微信人数
    ,count(distinct case when is_buy = '1' then user_name end)     --购买产品人数
from dws_cu_trsf_lead_trans_d where day = '${IncStartAll}'
group by region_desc1,product_name

union all

...

union all

...

太长了，实在不想写，写完会崩溃的。(此处的省、市、县区要求打平设置成同级别观看指标，有的业务场景确实如此)，这个时候Grouping__ID横空出世，完美解决union all的痛点。用Grouping__ID结果如下：

1.2 Grouping__ID的使用

select
    case  when Grouping__ID in ('57','49','41','25','33','17', '9','1') then '省'
          when Grouping__ID in ('58','50','42','26','34','18','10','2') then '市'
          when Grouping__ID in ('60','52','44','28','36','20','12','4') then '县区'
          end as region_type                             --分组字段
    ,case when internet_client_type is null then '整体' else internet_client_type  end as internet_client_type
    ,case when product_source_type is null then '整体' else product_source_type  end as product_source_type
    ,case when product_name is null then '整体' else product_name  end as product_name
    ,count(distinct case when is_act = '1' then user_name end)     --参加活动人数
    ,count(distinct case when is_add_wx = '1' then user_name end)     --添加微信人数
    ,count(distinct case when is_buy = '1' then user_name end)     --购买产品人数
from dws_cu_trsf_lead_trans_d where day = '${IncStartAll}'
group by region_desc1,region_desc2,region_desc3,internet_client_type,product_source_type,product_name
grouping sets(
 --所有维度组合
(region_desc1,internet_client_type,product_source_type,product_name),
(region_desc2,internet_client_type,product_source_type,product_name),
(region_desc3,internet_client_type,product_source_type,product_name),

 --新老客整体维度+其他维度组合
(region_desc1,product_source_type,product_name),
(region_desc2,product_source_type,product_name),
(region_desc3,product_source_type,product_name),

 --产品类型整体维度 + 其他维度组合
(region_desc1,internet_client_type,product_name),
(region_desc2,internet_client_type,product_name),
(region_desc3,internet_client_type,product_name),

 --产品名称整体维度 + 其他维度组合
(region_desc1,internet_client_type,product_source_type),
(region_desc2,internet_client_type,product_source_type),
(region_desc3,internet_client_type,product_source_type),

 --新老客整体维度 + 产品类型整体维度 + 其他维度组合
(region_desc1,product_name),
(region_desc2,product_name),
(region_desc3,product_name),

 --新老客整体维度 + 产品名称整体维度 + 其他维度组合
(region_desc1,product_source_type),
(region_desc2,product_source_type),
(region_desc3,product_source_type),

 --产品类型整体维度 + 产品名称整体维度 + 其他维度组合
(region_desc1,internet_client_type),
(region_desc2,internet_client_type),
(region_desc3,internet_client_type),

--所有维度整体
(region_desc1),
(region_desc2),
(region_desc3)
)

以上在使用grouping sets前需要注意各维度的null值你已经处理好了，因为后续是根据维度的null值判断此维度是整体的。

1.3 Grouping__ID 如何计算

如下图所示，按照维度进行排列，每组grouping set里的维度出现的标记为1，否则为0 (实际上是二进制的0和1)，最后按照反方向把二进制列出来转化成10进制，最后出现的十进制就是Grouping__ID，就可以分的清楚结果到底是哪个组合的了！小伙伴们，是不是很好用呢？纯手打以及整理excel的，看到的点个赞哟！
在这里插入图片描述