很多数据库都存在group by cube效率低于等价多个非cube group by 语句union all的现象,以duckdb为例。
我们基于内存先建立一个一亿行的测试表,包含5个列,分别填充不大于2、100、30、60、5的整数,模拟实际工作中不同规模的分组列。
C:\d>duckdb130
DuckDB v1.3.0 (Ossivalis) 71c5c07cdd
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D .timer on
D create temp table ren as select
((random()*1000)::int%2+1)::int c1,
((random()*1000)::int%100+1)::int c2,
((random()*1000)::int%30+1)::int c3,
((random()*1000)::int%60+1)::int c4,
((random()*1000)::int%5+1)::int c5
from generate_series(1,100000000::int);
Run Time (s): real 10.711 user 10.250000 sys 0.406250
代码1:直接对原始表group by cube, 为精简输出,将结果存入另一个表,用时8秒
D create temp table rc1c2c3c4 as select c1,c2,c3,c4,sum(1)cnt from ren group by cube(c1,c2,c3,c4);
Run Time (s): real 8.233 user 117.093750 sys 1.484375
代码2:直接对原始表group by再union all, 将结果存入另一个表,用时3秒,注意未参与分组的列要用null填充
D create temp table rc1c2c3c4m as
select c1,c2,c3,c4,sum(1)cnt from ren group by c1,c2,c3,c4 union all
select c1,null,null,null,sum(1)cnt from ren group by c1 union all
select null,c2,null,null,sum(1)cnt from ren group by c2 union all
select null,null,c3,null,sum(1)cnt from ren group by c3 union all
select null,null,null,c4,sum(1)cnt from ren group by c4 union all
select c1,c2,null,null,sum(1)cnt from ren group by c1,c2 union all
select c1,null,c3,null,sum(1)cnt from ren group by c1,c3 union all
select c1,null,null,c4,sum(1)cnt from ren group by c1,c4 union all
select null,c2,c3,null,sum(1)cnt from ren group by c2,c3 union all
select null,c2,null,c4,sum(1)cnt from ren group by c2,c4 union all
select null,null,c3,c4,sum(1)cnt from ren group by c3,c4 union all
select c1,c2,c3,null,sum(1)cnt from ren group by c1,c2,c3 union all
select c1,c2,null,c4,sum(1)cnt from ren group by c1,c2,c4 union all
select c1,null,c3,c4,sum(1)cnt from ren group by c1,c3,c4 union all
select null,c2,c3,c4,sum(1)cnt from ren group by c2,c3,c4 union all
select null,null,null,null,sum(1)cnt from ren ;
Run Time (s): real 3.230 user 42.203125 sys 0.515625
代码3:对原始表group by全部分组列后存入实体化的CTE,再对它group by再union all, 将结果存入另一个表,用时1.3秒,需要指出,这种等价改写有局限性,比如只能对计数和求和使用,而且不适用于有distinct的情况。
D create temp table rc1c2c3c4m2 as
with t0 as materialized (select c1,c2,c3,c4,sum(1 )cnt from ren group by c1,c2,c3,c4 )
select c1,c2,c3,c4,cnt from t0 union all
select c1,null,null,null,sum(cnt)cnt from t0 group by c1 union all
select null,c2,null,null,sum(cnt)cnt from t0 group by c2 union all
select null,null,c3,null,sum(cnt)cnt from t0 group by c3 union all
select null,null,null,c4,sum(cnt)cnt from t0 group by c4 union all
select c1,c2,null,null,sum(cnt)cnt from t0 group by c1,c2 union all
select c1,null,c3,null,sum(cnt)cnt from t0 group by c1,c3 union all
select c1,null,null,c4,sum(cnt)cnt from t0 group by c1,c4 union all
select null,c2,c3,null,sum(cnt)cnt from t0 group by c2,c3 union all
select null,c2,null,c4,sum(cnt)cnt from t0 group by c2,c4 union all
select null,null,c3,c4,sum(cnt)cnt from t0 group by c3,c4 union all
select c1,c2,c3,null,sum(cnt)cnt from t0 group by c1,c2,c3 union all
select c1,c2,null,c4,sum(cnt)cnt from t0 group by c1,c2,c4 union all
select c1,null,c3,c4,sum(cnt)cnt from t0 group by c1,c3,c4 union all
select null,c2,c3,c4,sum(cnt)cnt from t0 group by c2,c3,c4 union all
select null,null,null,null,sum(cnt)cnt from t0 ;
Run Time (s): real 1.328 user 13.562500 sys 0.437500
以上两种改写的行数随分组列个数增加呈指数级增加,正确编写很费时费力,我为此还专门让DeepSeek帮我实现了一个将group by cube查询语句转为等价CTE的转换程序,现在看来也没有必要了,因为有下面第4种代码。
代码4:对原始表group by全部分组列后存入实体化的CTE,再对它group by cube, 将结果存入另一个表,用时1.5秒
D create temp table rc1c2c3c4m3 as
with t0 as materialized (select c1,c2,c3,c4,sum(1 )cnt from ren group by c1,c2,c3,c4 )
select c1,c2,c3,c4,sum(cnt )cnt from t0 group by cube(c1,c2,c3,c4);
Run Time (s): real 1.477 user 13.421875 sys 0.265625
仅仅在原始group by cube代码前面加入一行,我们实现了加速5倍的效果,而且保留了cube结果标志列grouping_id(),便于区分不同的汇总层次。
同样的思路也可用于group by rollup, 但是由于这种分组的组合本来就少,提速效果就没那么明显。
D create temp table roc1c2c3c4 as select c1,c2,c3,c4,sum(1)cnt from ren group by rollup(c1,c2,c3,c4);
Run Time (s): real 2.073 user 24.796875 sys 0.375000
D create temp table roc1c2c3c4m3 as
with t0 as materialized (select c1,c2,c3,c4,sum(1 )cnt from ren group by c1,c2,c3,c4 )
select c1,c2,c3,c4,sum(cnt )cnt from t0 group by rollup(c1,c2,c3,c4);
Run Time (s): real 1.315 user 13.406250 sys 0.156250
换用另一种分析型单机数据库glaredb做同样的测试,以验证这种方法的普适性。它没有windows版,所以在WSL中执行。
root@DESKTOP-59T6U68:/mnt/c/d# ./glaredb6
GlareDB Shell
v25.6.1
Enter .help for usage hints.
glaredb> .timer on
glaredb> create temp table ren as select
... ((random()*1000)::int%2+1)::int c1,
... ((random()*1000)::int%100+1)::int c2,
... ((random()*1000)::int%30+1)::int c3,
... ((random()*1000)::int%60+1)::int c4,
... ((random()*1000)::int%5+1)::int c5
... from generate_series(1,100000000::int);
┌───────────────┐
│ rows_inserted │
│ Int64 │
├───────────────┤
│ 100000000 │
└───────────────┘
Execution duration (s): 6.74959
glaredb> create temp table rc1c2c3c4 as select c1,c2,c3,c4,sum(1)cnt from ren group by cube(c1,c2,c3,c4); --代码1
┌───────────────┐
│ rows_inserted │
│ Int64 │
├───────────────┤
│ 572973 │
└───────────────┘
Execution duration (s): 15.61688
glaredb> create temp table rc1c2c3c4m as ...; --代码2
┌───────────────┐
│ rows_inserted │
│ Int64 │
├───────────────┤
│ 572973 │
└───────────────┘
Execution duration (s): 6.58844
glaredb> create temp table rc1c2c3c4m2 as ...; --代码3
┌───────────────┐
│ rows_inserted │
│ Int64 │
├───────────────┤
│ 572973 │
└───────────────┘
Execution duration (s): 1.52624
glaredb> create temp table rc1c2c3c4m3 as
... with t0 as materialized (select c1,c2,c3,c4,sum(1 )cnt from ren group by c1,c2,c3,c4 )
... select c1,c2,c3,c4,sum(cnt )cnt from t0 group by cube(c1,c2,c3,c4); --代码4
┌───────────────┐
│ rows_inserted │
│ Int64 │
├───────────────┤
│ 572973 │
└───────────────┘
Execution duration (s): 1.47236
可见代码4在glaredb中反而比代码3略快,说明这两种写法的效率基本相同。
最后提醒,以上优化都基于一个前提,单次group by 全部分组项的结果行数大大少于原始行数(本例是36万行,只有原始行数的约千分之四),如果不是这样,就不适用。