目前在做hive上表关联的时候,觉得有些以前oracle sql的写法效率特低。尤其是左表的限制条件是写里面,还是写外面的问题,hive和另外两种sql还是有区别的:
oracle:
--写外面
explain plan for select count(*) as cnt
2 from cj_0711_tmp1 a
3 left outer join
4 cj_0711_tmp2 b
5 on (b.stat_date=to_date('20120701')
6 and a.object_id=b.object_id)
7 where a.stat_date=to_date('20120702');
---------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 44 | 61 (2)| 00:00:02 | | |
| 1 | SORT AGGREGATE | | 1 | 44 | | | | |
| 2 | HASH JOIN OUTER | | 10 | 440 | 61 (2)| 00:00:02 | | |
| 3 | PARTITION RANGE SINGLE| | 10 | 220 | 30 (0)| 00:00:01 | KEY | KEY |
| 4 | TABLE ACCESS FULL | CJ_0711_TMP1 | 10 | 220 | 30 (0)| 00:00:01 | KEY | KEY |
| 5 | PARTITION RANGE SINGLE| | 2 | 44 | 30 (0)| 00:00:01 | KEY | KEY |
| 6 | TABLE ACCESS FULL | CJ_0711_TMP2 | 2 | 44 | 30 (0)| 00:00:01 | KEY | KEY |
---------------------------------------------------------------------------------------------------------
--写里面
explain plan for select count(*) as cnt
from cj_0711_tmp1 a
left outer join
cj_0711_tmp2 b
on (b.stat_date=to_date('20120701')
and a.object_id=b.object_id
and a.stat_date=to_date('20120702'))
;
-----------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 22 | 863K (1)| 04:18:56 | | |
| 1 | SORT AGGREGATE | | 1 | 22 | | | | |
| 2 | NESTED LOOPS OUTER | | 28718 | 616K| 863K (1)| 04:18:56 | | |
| 3 | PARTITION RANGE ALL | | 28718 | 616K| 613 (1)| 00:00:12 | 1 | 23 |
| 4 | TABLE ACCESS FULL | CJ_0711_TMP1 | 28718 | 616K| 613 (1)| 00:00:12 | 1 | 23 |
| 5 | VIEW | | 1 | | 30 (0)| 00:00:01 | | |
| 6 | FILTER | | | | | | | |
| 7 | PARTITION RANGE SINGLE| | 1 | 22 | 30 (0)| 00:00:01 | KEY | KEY |
| 8 | TABLE ACCESS FULL | CJ_0711_TMP2 | 1 | 22 | 30 (0)| 00:00:01 | KEY | KEY |
-----------------------------------------------------------------------------------------------------------
--通过这两个执行计划,可以很直接的看到,写在外面是正确的写法,写在里面,直接就是partition range all了。。这个执行计划是很糟糕的。
gp上也是类似的:
--写外面
explain
select count(*) as cnt
from cj_0711_tmp1 a
left outer join
cj_0711_tmp2 b
on (a.email=b.email
and b.stat_date=cast('20120701' as date)
)
where a.stat_date=cast('20120701' as date);
Aggregate (cost=27.39..27.40 rows=1 width=0)
-> Gather Motion 104:1 (slice1; segments: 104) (cost=26.31..27.37 rows=1 width=0)
-> Aggregate (cost=26.31..26.32 rows=1 width=0)
-> Hash Left Join (cost=0.01..26.30 rows=1 width=0)
Hash Cond: a.email::text = b.email::text
-> Append (cost=0.00..26.25 rows=1 width=274)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120701 a (cost=0.00..26.25 rows=1 width=274)
Filter: stat_date = '2012-07-01'::date
-> Hash (cost=0.00..0.00 rows=1 width=274)
-> Append (cost=0.00..0.00 rows=1 width=274)
-> Seq Scan on cj_0711_tmp2_1_prt_p20120701 b (cost=0.00..0.00 rows=1 width=274)
Filter: stat_date = '2012-07-01'::date
--写里面
Aggregate (cost=5212.91..5212.92 rows=1 width=0)
-> Gather Motion 104:1 (slice1; segments: 104) (cost=5211.84..5212.90 rows=1 width=0)
-> Aggregate (cost=5211.84..5211.85 rows=1 width=0)
-> Hash Left Join (cost=0.01..4845.12 rows=1411 width=0)
Hash Cond: a.email::text = b.email::text
Join Filter: a.stat_date = '2012-07-01'::date
-> Append (cost=0.00..3010.80 rows=1411 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120601 a (cost=0.00..23.40 rows=11 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120602 a (cost=0.00..23.40 rows=11 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120603 a (cost=0.00..23.40 rows=11 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120604 a (cost=0.00..173.55 rows=82 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120605 a (cost=0.00..111.15 rows=53 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120606 a (cost=0.00..23.40 rows=11 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120607 a (cost=0.00..173.55 rows=82 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120608 a (cost=0.00..23.40 rows=11 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120609 a (cost=0.00..23.40 rows=11 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120610 a (cost=0.00..23.40 rows=11 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120611 a (cost=0.00..173.55 rows=82 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120612 a (cost=0.00..111.15 rows=53 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120613 a (cost=0.00..23.40 rows=11 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120614 a (cost=0.00..175.50 rows=83 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120615 a (cost=0.00..23.40 rows=11 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120616 a (cost=0.00..23.40 rows=11 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120617 a (cost=0.00..23.40 rows=11 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120618 a (cost=0.00..175.50 rows=83 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120619 a (cost=0.00..111.15 rows=53 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120620 a (cost=0.00..23.40 rows=11 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120621 a (cost=0.00..175.50 rows=83 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120622 a (cost=0.00..23.40 rows=11 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120623 a (cost=0.00..23.40 rows=11 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120624 a (cost=0.00..23.40 rows=11 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120625 a (cost=0.00..173.55 rows=82 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120626 a (cost=0.00..113.10 rows=53 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120627 a (cost=0.00..23.40 rows=11 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120628 a (cost=0.00..173.55 rows=82 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120629 a (cost=0.00..23.40 rows=11 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120630 a (cost=0.00..23.40 rows=11 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120701 a (cost=0.00..23.40 rows=11 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120702 a (cost=0.00..173.55 rows=82 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120703 a (cost=0.00..115.05 rows=54 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120704 a (cost=0.00..23.40 rows=11 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120705 a (cost=0.00..173.55 rows=82 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120706 a (cost=0.00..23.40 rows=11 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120707 a (cost=0.00..21.45 rows=11 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120708 a (cost=0.00..21.45 rows=11 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120709 a (cost=0.00..173.55 rows=82 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120710 a (cost=0.00..0.00 rows=1 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120711 a (cost=0.00..0.00 rows=1 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120712 a (cost=0.00..0.00 rows=1 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120713 a (cost=0.00..0.00 rows=1 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120714 a (cost=0.00..0.00 rows=1 width=282)
-> Seq Scan on cj_0711_tmp1_1_prt_p20120715 a (cost=0.00..0.00 rows=1 width=282)
-> Hash (cost=0.00..0.00 rows=1 width=274)
-> Append (cost=0.00..0.00 rows=1 width=274)
-> Seq Scan on cj_0711_tmp2_1_prt_p20120701 b (cost=0.00..0.00 rows=1 width=274)
Filter: stat_date = '2012-07-01'::date
gp和oracle是一样的,放在外面的是正确写法,放在里面的话,会执行全表扫面,而不会直接走分区。
hive则不同,测试过程如下:
--写外面
explain
select count(*) as cnt
from cj_0711_tmp1 a
left outer join
cj_0711_tmp2 b
on (a.id=b.id
and b.hp_stat_date='2012-06-02')
where a.hp_stat_date='2012-06-02' ;
(TOK_QUERY
(TOK_FROM
(TOK_LEFTOUTERJOIN
(TOK_TABREF cj_0711_tmp1 a)
(TOK_TABREF cj_0711_tmp2 b)
(and
(=
(.
(TOK_TABLE_OR_COL a)
id)
(.
(TOK_TABLE_OR_COL b)
id)
)
(=
(.
(TOK_TABLE_OR_COL b)
hp_stat_date
)
'2012-06-02'
)
)
)
)
(
TOK_INSERT
(TOK_DESTINATION
(TOK_DIR TOK_TMP_FILE)
)
(TOK_SELECT
(TOK_SELEXPR
(TOK_FUNCTIONSTAR count)
cnt)
)
(TOK_WHERE
(=
(.
(TOK_TABLE_OR_COL a)
hp_stat_date)
'2012-06-02')
)
)
)
--写在里面
explain
select count(*) as cnt
from cj_0711_tmp1 a
left outer join
cj_0711_tmp2 b
on (a.id=b.id
and b.hp_stat_date='2012-06-02'
and a.hp_stat_date='2012-06-02')
;
(TOK_QUERY
(TOK_FROM
(TOK_LEFTOUTERJOIN
(TOK_TABREF cj_0711_tmp1 a)
(TOK_TABREF cj_0711_tmp2 b)
(and
(and
(=
(.
(TOK_TABLE_OR_COL a)
id)
(.
(TOK_TABLE_OR_COL b)
id)
)
(=
(.
(TOK_TABLE_OR_COL b)
hp_stat_date)
'2012-06-02')
)
(=
(.
(TOK_TABLE_OR_COL a)
hp_stat_date)
'2012-06-02')
)
)
)
(TOK_INSERT
(TOK_DESTINATION
(TOK_DIR TOK_TMP_FILE)
)
(TOK_SELECT
(TOK_SELEXPR
(TOK_FUNCTIONSTAR count)
cnt)
)
)
)
--很显然,写在里面的是标准写法,语法树中两个and是一起执行;而写在外面的话,b表的分区字段是放在最后面进行insert的时候,才执行的。
so,hive和gp,oracle在写外关联的时候,需要注意下主表的限制条件,前者是需要放里面的,否则会到整个执行计划的最后才开始筛选,浪费系统的性能。
后两者的话,需要放在外面,这样的效率最好,和hive刚好是相反。
具体的hive语法树怎么生成的,还要后续研究代码。。