5.扩展 GROUP BY

SQL分组高级用法

/*************************************************
主题:扩展 GROUP BY

  CUBE ,ROLLUP,GROUPING SETS,GROUPING_ID,GROUP_ID
*************************************************/

1.ROLLUP ,按列的顺序从右到左递减分组小计(含合计)
        -------------------------------------    
         从右到左递减分组     
         统计级别 : SGNYEAR,AREACODE  --1
                     SGNYEAR           --2
                     NULL(合计)        --3
        -------------------------------------
SELECT SGNYEAR,AREACODE,SUM(CARGOVOLUME) S
  FROM TEST3
 WHERE SGNYEAR BETWEEN '2002' AND '2004'
 GROUP BY ROLLUP(SGNYEAR,AREACODE) ;
 
     SGNYEAR   AREACODE                       S
    --------- --------------------- ----------
    2002      150000                     37239
    2002      230000                     12919
    2002      330000                      2327
    2002      360000                         0  --年的某个AREACODE 小计
    2002                                 52485  --年小计
    2003      130000                      1486
    2003      440000                     73087
    2003      640000                         0
    2003                                 74573
    2004      410000                         0
    2004                                     0
                                        127058  --总计
    --------- --------------------- ----------  

2.ROLLUP部分分组, 分组剔除某些字段来的小计(不含合计)
    
SELECT SGNYEAR,AREACODE,SUM(CARGOVOLUME) S
  FROM TEST3
 WHERE SGNYEAR BETWEEN '2002' AND '2004'
 GROUP BY SGNYEAR ,ROLLUP(AREACODE);--根据年分组,剔除AREARCODE的分组小计 与 整体的合计

    SGNYEAR   AREACODE                       S
    --------- --------------------- ----------
    2002      150000                     37239
    2002      230000                     12919
    2002      330000                      2327
    2002      360000                         0
    2002                                 52485
    2003      130000                      1486
    2003      440000                     73087
    2003      640000                         0
    2003                                 74573
    2004      410000                         0
    2004                                     0
    --------- --------------------- ----------
       
3.CUBE,对不同维度的所有可能分组进行统计(各种可能性分组)
                                
 SELECT SGNYEAR,AREACODE,SUM(CARGOVOLUME) S
   FROM TEST3
  WHERE SGNYEAR BETWEEN '2002' AND '2004'
  GROUP BY CUBE(SGNYEAR,AREACODE) ;  
 
    SGNYEAR   AREACODE                       S
    --------- --------------------- ----------
                                        127058  --总计
              130000                      1486  --AREACODE的小计
              150000                     37239
              230000                     12919
              330000                      2327
              360000                         0
              410000                         0
              440000                     73087
              640000                         0
    2002                                 52485  --SGNYEAR的小计
    2002      150000                     37239  --SGNYEAR,AREACODE的小计
    2002      230000                     12919
    2002      330000                      2327
    2002      360000                         0
    2003                                 74573
    2003      130000                      1486
    2003      440000                     73087
    2003      640000                         0
    2004                                     0
    2004      410000                         0 
    --------- --------------------- ----------

4.CUBE 部分分组
 SELECT SGNYEAR,AREACODE,SUM(CARGOVOLUME) S
   FROM TEST3
  WHERE SGNYEAR BETWEEN '2002' AND '2004'
  GROUP BY SGNYEAR,CUBE(AREACODE) ; --剔除对AREACODE的分组小计 与 整体的合计

    SGNYEAR   AREACODE                       S
    --------- --------------------- ----------
    2002                                 52485
    2002      150000                     37239
    2002      230000                     12919
    2002      330000                      2327
    2002      360000                         0
    2003                                 74573
    2003      130000                      1486
    2003      440000                     73087
    2003      640000                         0
    2004                                     0
    2004      410000                         0
    --------- --------------------- ----------

5.GROUPING SETS 实现小计,没有合计
 SELECT SGNYEAR,AREACODE,SUM(CARGOVOLUME) S
   FROM TEST3
  WHERE SGNYEAR BETWEEN '2002' AND '2004'
  GROUP BY GROUPING SETS(SGNYEAR,AREACODE) ;   --对SGNYEAR,AREACODE分别进行小计统计 
    SGNYEAR   AREACODE                       S
    --------- --------------------- ----------
    2004                                     0
    2002                                 52485
    2003                                 74573
              410000                         0
              330000                      2327
              640000                         0
              360000                         0
              150000                     37239
              230000                     12919
              440000                     73087
              130000                      1486         
    --------- --------------------- ----------   

6.GROUPING SETS 部分统计.对单列进行分组,统计其他维度的小计。
 SELECT SGNYEAR,AREACODE,transtypecode,SUM(CARGOVOLUME) S
   FROM TEST3
  WHERE SGNYEAR BETWEEN '2002' AND '2005'
  GROUP BY SGNYEAR,GROUPING SETS(AREACODE,transtypecode) ;     --统计每年,每个AREACODE 的小计
                                                               --统计每年,每个TRANSTYPECODE 的小计
    SGNYEAR   AREACODE              TRANSTYPECODE                  S
    --------- --------------------- --------------------- ----------
    2002      360000                                               0
    2003      130000                                            1486
    2003      640000                                               0
    2002      330000                                            2327
    2002      230000                                           12919
    2004      410000                                               0
    2003      440000                                           73087
    2005      140000                                           47697
    2005      620000                                            4085
    2002      150000                                           37239
    2002                            3                          15246
    2002                            5                              0
    2002                            6                          37239
    2004                            5                              0
    2005                            3                          47697
    2005                            2                           4085
    2003                            7                              0
    2003                            4                           1486
    2003                            6                          73087                                                                        
    --------- --------------------- --------------------- ----------
   
7.CUBE、ROLLUP 作为 GROUPING SETS 的参数,相当于对每个 CUBE 或 ROLLUP 的操作 UNION ALL
 SELECT AREACODE,transtypecode,SUM(CARGOVOLUME) S
   FROM TEST3
  WHERE SGNYEAR BETWEEN '2002' AND '2005'
  GROUP BY GROUPING SETS(CUBE(AREACODE),ROLLUP(transtypecode));
 
    AREACODE              TRANSTYPECODE                  S
    --------------------- --------------------- ----------
    130000                                            1486
    140000                                           47697
    150000                                           37239
    230000                                           12919
    330000                                            2327
    360000                                               0
    410000                                               0
    440000                                           73087
    620000                                            4085
    640000                                               0
                          6                         110326
                          2                           4085
                          4                           1486
                          3                          62943
                          5                              0
                          7                              0
                                                    178840  --此处两个数据就是ROLLUP的UNION ALL
                                                    178840       
    --------------------- --------------------- ----------     
   
8.组合列分组

  列                    得到结果的描述
 --------------------- --------------------- ----------    
 ROLLUP(a,b,c)         GROUP BY a,b,c   
                       GROUP BY a,b
                       GROUP BY a
                       GROUP BY NULL    
 --------------------- --------------------- ----------
 ROLLUP(a,(b,c))       GROUP BY a,b,c   
                       GROUP BY a            --此处将(b,c)看做一列,所以缩减了
                       GROUP BY NULL  
 --------------------- --------------------- ----------
 --根据年进行分组,得到标准的分组统计
                 --得到 年的小计 ,与标准ROLLUP对比,少了年+编码的小计
 SELECT SGNYEAR,AREACODE,transtypecode,SUM(CARGOVOLUME) S
   FROM TEST3
  WHERE SGNYEAR BETWEEN '2002' AND '2006'
  GROUP BY ROLLUP(SGNYEAR,(AREACODE,transtypecode)) ; 
                                                
    SGNYEAR   AREACODE              TRANSTYPECODE                  S
    --------- --------------------- --------------------- ----------
    2002      150000                6                          37239
    2002      230000                3                          12919  --年份、类型同
    2002      330000                3                           2327  --编码不同
    2002      360000                5                              0
    2002                                                       52485
    2003      130000                4                           1486
    2003      440000                6                          73087
    2003      640000                7                              0
    2003                                                       74573
    2004      410000                5                              0
    2004                                                           0
    2005      140000                3                          47697 
    2005      620000                2                           4085
    2005                                                       51782
    2006      120000                5                              0
    2006      140000                7                             59 
    2006      350000                5                            910
    2006      500000                3                           2000
    2006      530000                5                            171
    2006                                                        3140
    --------- --------------------- --------------------- ----------   
                                     
9.连接分组,允许 GROUP BY 后面有多个 ROLLUP、CUBE、GROUPING SETS,结果是分级别组成的笛卡尔积

 SELECT SGNYEAR,AREACODE,transtypecode,SUM(CARGOVOLUME) S
   FROM TEST3
  WHERE SGNYEAR BETWEEN '2002' AND '2003'
  GROUP BY ROLLUP(AREACODE,transtypecode) ,ROLLUP(SGNYEAR); 

    SGNYEAR   AREACODE              TRANSTYPECODE                  S
    --------- --------------------- --------------------- ----------
              130000                4                           1486
              130000                                            1486
              150000                6                          37239
              150000                                           37239
              230000                3                          12919
              230000                                           12919
              330000                3                           2327
              330000                                            2327
              360000                5                              0
              360000                                               0
              440000                6                          73087
              440000                                           73087
              640000                7                              0
              640000                                               0
                                                              127058
    2002      150000                6                          37239
    2002      150000                                           37239
    2002      230000                3                          12919
    2002      230000                                           12919
    2002      330000                3                           2327
    2002      330000                                            2327
    2002      360000                5                              0
    2002      360000                                               0
    2002                                                       52485
    2003      130000                4                           1486
    2003      130000                                            1486
    2003      440000                6                          73087
    2003      440000                                           73087
    2003      640000                7                              0
    2003      640000                                               0
    2003                                                       74573
    --------- --------------------- --------------------- ----------
   
   
10.重复列分组
 SELECT SGNYEAR,AREACODE,transtypecode,SUM(CARGOVOLUME) S
   FROM TEST3
  WHERE SGNYEAR BETWEEN '2002' AND '2003'
  GROUP BY AREACODE,ROLLUP(AREACODE,transtypecode) ,ROLLUP(SGNYEAR);
  --得到的数据 AREACODE 的小计重复出现
 
  CUBE(A,B,C)==ROLLUP(A),ROLLUP(B),ROLLUP(C)
         
11.GROUPING,对于小计或合计的列返回1,否则返回0. 

 SELECT SGNYEAR,AREACODE,SUM(CARGOVOLUME) S,GROUPING(AREACODE),
        DECODE(GROUPING(AREACODE),1,'小计',NVL(AREACODE,'合计')) ITEM
   FROM TEST3
  WHERE SGNYEAR BETWEEN '2002' AND '2006'
  GROUP BY ROLLUP(SGNYEAR,AREACODE) ;

    SGNYEAR   AREACODE                       S GROUPING(AREACODE) ITEM
    --------- --------------------- ---------- ------------------ ---------------
    2002      150000                     37239                  0 150000
    2002      230000                     12919                  0 230000
    2002      330000                      2327                  0 330000
    2002      360000                         0                  0 360000
    2002                                 52485                  1 小计
    2003      130000                      1486                  0 130000
    2003      440000                     73087                  0 440000
    2003      640000                         0                  0 640000
    2003                                 74573                  1 小计
    2004      410000                         0                  0 410000
    2004                                     0                  1 小计
    2005      140000                     47697                  0 140000
    2005      620000                      4085                  0 620000
    2005                                 51782                  1 小计
    2006      120000                         0                  0 120000
    2006      140000                        59                  0 140000
    2006      350000                       910                  0 350000
    2006      500000                      2000                  0 500000
    2006      530000                       171                  0 530000
    2006                                  3140                  1 小计
                                        181980                  1 小计
    --------- --------------------- ---------- ------------------ ---------------    
  --过滤某些分组结果   
 SELECT SGNYEAR,AREACODE,SUM(CARGOVOLUME) S,DECODE(GROUPING(AREACODE),1,'小计',NVL(AREACODE,'合计')) ITEM
   FROM TEST3
  WHERE SGNYEAR BETWEEN '2002' AND '2006'
  GROUP BY ROLLUP(SGNYEAR,AREACODE)
  HAVING GROUPING(SGNYEAR)=1 OR GROUPING(AREACODE)=0;

12.GROUPING_ID,对多列进行计算,得到分组级别

   分组        位向量(二进制)  GROUPING_ID结果(根据二进制得到的十进制数)
   --------    ---------         --------------
    A,B,C        000                0 
    A,B          001                1
    A            011                3
    NULL         111                7
   
 SELECT SGNYEAR,AREACODE,SUM(CARGOVOLUME) S,GROUPING(AREACODE) YorN,
        GROUPING_ID(SGNYEAR,AREACODE) ORDERS
   FROM TEST3
  WHERE SGNYEAR BETWEEN '2002' AND '2006'
  GROUP BY ROLLUP(SGNYEAR,AREACODE)               --括号内的内容保持一致
 HAVING GROUPING_ID(SGNYEAR,AREACODE) IN (0,3);   --括号内的内容保持一致,但是可以改变顺序来到达某些统计目的
 
    SGNYEAR   AREACODE                       S       YORN     ORDERS
    --------- --------------------- ---------- ---------- ----------
    2002      150000                     37239          0          0
    2002      230000                     12919          0          0
    2002      330000                      2327          0          0
    2002      360000                         0          0          0
    2003      130000                      1486          0          0
    2003      440000                     73087          0          0
    2003      640000                         0          0          0
    2004      410000                         0          0          0
    2005      140000                     47697          0          0
    2005      620000                      4085          0          0
    2006      120000                         0          0          0
    2006      140000                        59          0          0
    2006      350000                       910          0          0
    2006      500000                      2000          0          0
    2006      530000                       171          0          0
                                        181980          1          3
    --------- --------------------- ---------- ---------- ----------
   
 SELECT SGNYEAR,AREACODE,SUM(CARGOVOLUME) S,GROUPING(AREACODE) YorN,
        GROUPING_ID(SGNYEAR,AREACODE) ORDERS
   FROM TEST3
  WHERE SGNYEAR BETWEEN '2002' AND '2006'
  GROUP BY ROLLUP(SGNYEAR,AREACODE)             
 HAVING GROUPING_ID(AREACODE,SGNYEAR) IN (0,1);                                           

13.GROUP_ID , 区分重复分组的结果,第一次出现为 0,后面每次+1.

 SELECT SGNYEAR,AREACODE,transtypecode,SUM(CARGOVOLUME) S,GROUP_ID() GI,
         GROUPING_ID(AREACODE,transtypecode,SGNYEAR) LV
   FROM TEST3
  WHERE SGNYEAR BETWEEN '2002' AND '2003'
  GROUP BY ROLLUP(AREACODE,transtypecode,SGNYEAR) ,ROLLUP(SGNYEAR)
 HAVING GROUP_ID()=0;
 
14.应用
SELECT SYMBOL,ENDDATE,ITEM,SUM(AMOUNT) ,GROUPING_ID(SYMBOL,ENDDATE,ITEM) LV,
       DECODE(GROUPING_ID(SYMBOL,ENDDATE,ITEM) ,3,'公司总计',0  ,'标准小计',7,'总合计',1,'公司年合计') LVNAME,
       GROUP_ID() GI
  FROM INPUT.TBL_CHN_I_Notes_AdminExpense
 GROUP BY ROLLUP(SYMBOL,ENDDATE,ITEM)
--HAVING GROUPING_ID(SYMBOL,ENDDATE,ITEM) IN (0,3,7)
ORDER BY SYMBOL,ENDDATE,ITEM;

--

SELECT  SYMBOL,ENDDATE,ITEM,SUM(AMOUNT)--求每个公司每年,横跨项目计费用
  FROM INPUT.TBL_CHN_I_Notes_AdminExpense
 GROUP BY SYMBOL,ENDDATE,ROLLUP(ITEM)

--

转载于:https://www.cnblogs.com/zihuancc/p/3848287.html

<think>我们正在处理HiveonTez中由于数据倾斜导致的groupby操作性能问题,特别是任务无法完成的情况。结合引用内容,我们将重点讨论如何利用Tez的倾斜优化特性以及相关配置。根据引用[1]和[2]:-引用[1]提到可以通过调整tez.grouping.max-size参数来增加Map任务数量,从而提升并行性。-引用[2]详细介绍了Hive中处理数据倾斜的两种主要策略:将groupby分成两个MR阶段(在Tez中对应两个DAG阶段)以及在Map端进行部分聚合。针对倾斜groupby优化,Tez提供了自动处理倾斜的配置参数。同时,我们也可以手动进行热点key分散。解决方案如下:1.**启用Tez的自动倾斜处理**(推荐首先尝试):```sql--开启groupby倾斜优化SEThive.tez.groupby.skewed=true;--设置倾斜key的阈值(当某个key的记录数超过该值则视为倾斜)SEThive.groupby.skewindata=true;--这个参数在MR引擎中常用,在Tez中通常使用hive.tez.groupby.skewedSEThive.skewjoin.key=100000;--默认值100000,可根据数据量调整```2.**手动分散热点key**(当自动优化效果不足时):引用[2]中提到的方法:将groupby拆分成两个阶段。在Tez中,我们可以通过子查询手动实现:```sqlSELECTkey,SUM(partial_sum)AStotal_sumFROM(SELECTkey,--第一阶段:在map端进行部分聚合SUM(value)ASpartial_sumFROMsource_tableGROUPBYkey)stage1GROUPBYkey;```或者对热点key添加随机后缀分散到多个reduce:```sqlSELECTbase_key,SUM(value)AStotal_sumFROM(SELECT--对热点key添加随机后缀(假设我们已知热点key是'hotkey')CASEWHENkey='hotkey'THENCONCAT(key,'_',CAST(CEIL(RAND()*10)ASSTRING))ELSEkeyENDASbase_key,valueFROMsource_table)tempGROUPBYbase_key;```3.**调整并行度参数**(引用[1]):```sql--增加Map任务数量(减小max-size会使得每个Map处理的数据量变小,从而增加Map数量)SETtez.grouping.max-size=104857600;--100MB(默认是1GB)--调整Reduce任务数量SEThive.exec.reducers.bytes.per.reducer=67108864;--64MB/Reducer(默认256MB)```4.**启用Map端聚合**(引用[2]):```sqlSEThive.map.aggr=true;--默认已开启,确保开启SEThive.groupby.mapaggr.checkinterval=100000;--Map端聚合的行数检查间隔```5.**资源调优**(防止单个Reduce因内存不足而失败):```sql--增加Reduce任务内存SETtez.task.resource.memory.mb=8192;--调整JVM堆大小SEThive.tez.java.opts=-Xmx6144m;```6.**监控与诊断**:-使用TezUI(引用[3]中提到的JobName可通过配置自定义,但默认即可)查看任务进度,特别关注每个Reduce处理的数据量。-检查日志中是否有内存溢出(ContainerkilledbyYARN)或GC过度的信息。如果上述方法仍不能解决问题,可能需要分析数据分布(使用引用[2]中的分布检查查询)并针对性优化。相关问题:1.如何通过TezUI准确定位数据倾斜的Reduce任务?2.HiveonTez中如何平衡Map任务数量和资源利用率?3.除了随机后缀,还有哪些手动处理数据倾斜的技巧?</think>### Hive Tez Group By Skewed Optimization 问题解决方案 针对 Tez 引擎中 `GROUP BY` 数据倾斜导致的任务无法完成问题,结合引用内容的核心优化策略,提供以下解决方案: --- #### 一、核心优化策略(基于引用[1][2]) 1. **启用Tez自动倾斜处理** ```sql -- 开启分组倾斜优化 SET hive.tez.groupby.skewed=true; -- 设置倾斜阈值(默认10万) SET hive.skewjoin.key=500000; ``` **原理**:Tez自动检测倾斜Key,将大Key拆分成多个子任务并行处理[^2]。 2. **两阶段聚合优化** ```sql -- 第一阶段:Map端预聚合 SET hive.map.aggr=true; -- 第二阶段:Reduce端最终聚合(引用[2]策略) SET hive.groupby.skewindata=true; ``` **效果**:通过预聚合减少传输数据量,分散Reduce负载[^2]。 3. **动态调整并行度** ```sql -- 减小分组大小提升并行度(引用[1]) SET tez.grouping.max-size=104857600; -- 100MB -- 控制Reduce数量 SET hive.exec.reducers.bytes.per.reducer=134217728; -- 128MB/Reducer ``` **优势**:将任务从2个并发扩展到15+,显著提升并行性能[^1]。 --- #### 二、高级调优技巧 1. **热点Key手动分散** ```sql SELECT CASE WHEN key = '热点值' THEN CONCAT(key, '_', FLOOR(RAND()*10)) ELSE key END AS skewed_key, SUM(value) FROM table GROUP BY skewed_key; ``` **适用场景**:已知明确热点Key(如NULL或特定枚举值)。 2. **内存资源优化** ```sql -- 防止Reduce OOM(基于倾斜数据量) SET tez.task.resource.memory.mb=12288; -- 12GB SET hive.tez.java.opts=-Xmx8192m; ``` 3. **启用CBO优化器** ```sql SET hive.cbo.enable=true; ANALYZE TABLE table COMPUTE STATISTICS FOR COLUMNS key; -- 收集统计信息 ``` --- #### 三、监控与验证 1. **在Tez UI定位倾斜任务** - 访问 `http://<resourcemanager>:8088/tez-ui` - 检查Vertex中各Task的 `Records/Size` 差异 >50%即存在倾斜 2. **倾斜Key分析** ```sql -- 识别Top热点Key SELECT key, COUNT(*) cnt FROM table GROUP BY key ORDER BY cnt DESC LIMIT 10; ``` 3. **效果验证指标** - Reduce任务完成时间标准差 < 30% - GC时间占比 < 10% - 不再出现单个Task运行超时 > **案例**:某日志分析任务优化后效果: > ```diff > - 原状态:1个Reduce卡住2小时 > + 优化后:15个Reduce在8分钟内完成 > ``` > 关键配置: > ```sql > SET hive.tez.groupby.skewed=true; > SET tez.grouping.max-size=100000000; > ``` --- #### 四、常见失败场景处理 | 现象 | 解决方案 | |---------------------|---------------------------------| | `Container killed` | 增加 `tez.task.resource.memory.mb` | | GC overhead | 提升 `hive.tez.java.opts` 值 | | 数据本地性丢失 | 检查HDFS块分布,启用 `tez.locality.delay` |
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值