下面这个SQL,优化前执行时间8.5分钟,优化后执行时间0.8秒,性能提高638倍:
SELECT
SUM(t623064.mtd_quantity) AS c1,
t622947.year_name AS c2
FROM dw.dim_caliber_d t621885 /* AD Dim11 Caliber */,
dw.dim_manage_org_v t621142 /* AD Dim03 Manager Org */,
dw.dim_account_v t621070 /* AD Dim02 Account */,
dw.dim_period_v t622947 /* AD Dim22 Year */,
dw.dm_fin_om_sales_pln_a t623064 /* AD Fct022 Fin Om Sales Year Pln */
WHERE t621070.account_wid = t623064.account_wid
AND t621070.language_code = 'ZHS'
AND t621070.account_category = 'Q'
AND t621142.bd_short_name = '家用'
AND t621142.bg_code = 'M2'
AND t621142.language_code = 'ZHS'
AND t621142.manage_org_wid = t623064.manage_org_wid
AND t621885.caliber_wid = t623064.caliber_wid
AND t622947.language_code = 'ZHS'
AND t622947.period_type = 'Y'
AND t622947.period_wid = t623064.period_wid
AND (t621070.account_code IN ('FIN0201', 'FIN070104', 'FIN070106', 'MFG0101'))
AND (t621885.caliber_code IN ('0', '10', '40', '60'))
AND t621142.bd_code <> 'M100'
GROUP BY t622947.year_name;
这是一个比较典型的BI星形联接进行统计分析的SQL,dw.dm_fin_om_sales_pln_a表是fact表(300万条数据),其他几个以DIM开头的表(视图)是维度表(几百到几万条数据),执行计划如下:
其中,ID=26的索引DM_FIN_OM_SALES_PLN_A_N5是创建在MANAGE_ORG_WID字段上的单字段索引。
从这个执行计划看,应该不需要8分钟这么多的时间(估算时间是7秒)。但是通过sql monitor显示的实际数据来看,是优化器对各维度表使用谓词之后返回的结果估值偏小,导致最后的笛卡尔join返回了一个巨大的值986M:
从sqlhc收集的信息显示,上面用到的多个表的最后统计信息收集时间是在一两年前(而且经过count(*)确认,Num Rows列显示的记录数基本与实际记录数一致):
看到这里,很多人首先想到的是要对这些表重新收集统计信息,让优化器能够正确的评估行源(系统开启了自动收集直方图,各维度表相关字段都有直方图信息)。
但是,sqlhc同时又告诉我们,统计信息并没有问题(系统开启了自动统计信息收集,记录改变超过10%会自动重新收集) Stale Stats=NO:
针对这种典型的星形联接统计分析的BI SQL,我使用了下面两种方法进行优化:
1、SQL改写,将普通的join改为in的semi-join(前提是各维度表的XXX_id字段是唯一的,这样的改写结果才是等价的):
SELECT SUM(t623064.mtd_quantity) AS c1,
t622947.year_name AS c2
FROM dw.dim_period_v t622947 /* AD Dim22 Year */,
dw.dm_fin_om_sales_pln_a t623064 /* AD Fct022 Fin Om Sales Year Pln */
where t622947.period_wid = t623064.period_wid AND t622947.language_code = 'ZHS' AND t622947.period_type = 'Y'
and t623064.account_wid in (select account_wid from dw.dim_account_v t621070 where t621070.language_code = 'ZHS' AND t621070.account_category = 'Q' and t621070.account_code IN ('FIN0201', 'FIN070104', 'FIN070106', 'MFG0101') )
and t623064.manage_org_wid in (select manage_org_wid from dw.dim_manage_org_v t621142 where t621142.bd_short_name = '家用' AND t621142.bg_code = 'M2' AND t621142.language_code = 'ZHS' AND t621142.bd_code <> 'M100')
and t623064.caliber_wid in (select caliber_wid from dw.dim_caliber_d t621885 where t621885.caliber_code IN ('0', '10', '40', '60'))
group by t622947.year_name;
为什么dw.dim_period_v没有使用in,那是因为最后统计结果需要用到这个表除了ID之外的其他字段,不能用IN。
2、使用Hint:
SELECT /*+ STAR_TRANSFORMATION opt_param('star_transformation_enabled' 'true') */
SUM(t623064.mtd_quantity) AS c1,
t622947.year_name AS c2
FROM dw.dim_caliber_d t621885 /* AD Dim11 Caliber */,
dw.dim_manage_org_v t621142 /* AD Dim03 Manager Org */,
dw.dim_account_v t621070 /* AD Dim02 Account */,
dw.dim_period_v t622947 /* AD Dim22 Year */,
dw.dm_fin_om_sales_pln_a t623064 /* AD Fct022 Fin Om Sales Year Pln */
WHERE t621070.account_wid = t623064.account_wid
AND t621070.language_code = 'ZHS'
AND t621070.account_category = 'Q'
AND t621142.bd_short_name = '家用'
AND t621142.bg_code = 'M2'
AND t621142.language_code = 'ZHS'
AND t621142.manage_org_wid = t623064.manage_org_wid
AND t621885.caliber_wid = t623064.caliber_wid
AND t622947.language_code = 'ZHS'
AND t622947.period_type = 'Y'
AND t622947.period_wid = t623064.period_wid
AND (t621070.account_code IN ('FIN0201', 'FIN070104', 'FIN070106', 'MFG0101'))
AND (t621885.caliber_code IN ('0', '10', '40', '60'))
AND t621142.bd_code <> 'M100'
GROUP BY t622947.year_name;
上面两个SQL的执行时间不到1秒(0.8s),执行计划都是一样的:
执行计划显示有Bitmap Index字样,实际上各表都没有创建bitmap index,这个是由这个隐含参数控制的:_b_tree_bitmap_plans,这个参数默认是True,很多情况下这个参数的问题比较多,很多系统都把这个参数改成了False,这星形联接的情况下,这个参数还是比较有用的。
针对这种星形联接的SQL,还有其他多种优化方法,比如创建事实表上的联合索引、创建位图联接索引等,前者组合较多,效果一般;后者改动比较复杂,性价比不高,这里不再赘述。