hive/gp/oracle外关联时写法的区别

本文探讨了在Hive、GP和Oracle数据库中执行外关联查询时,将主表限制条件置于内部或外部的不同影响。通过执行计划分析,发现将条件写在外面对于Oracle和GP而言是更优的选择,避免了全表扫描,而Hive则需要将条件写在内部以提高性能。了解这些差异有助于优化查询性能。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

目前在做hive上表关联的时候,觉得有些以前oracle sql的写法效率特低。尤其是左表的限制条件是写里面,还是写外面的问题,hive和另外两种sql还是有区别的:

oracle:

--写外面

explain plan for select count(*) as cnt
  2  from cj_0711_tmp1 a
  3  left outer join
  4             cj_0711_tmp2 b
  5  on (b.stat_date=to_date('20120701')
  6  and a.object_id=b.object_id)
  7  where a.stat_date=to_date('20120702');

---------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name         | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |              |     1 |    44 |    61   (2)| 00:00:02 |       |       |
|   1 |  SORT AGGREGATE          |              |     1 |    44 |            |          |       |       |
|   2 |   HASH JOIN OUTER        |              |    10 |   440 |    61   (2)| 00:00:02 |       |       |
|   3 |    PARTITION RANGE SINGLE|              |    10 |   220 |    30   (0)| 00:00:01 |   KEY |   KEY |
|   4 |     TABLE ACCESS FULL    | CJ_0711_TMP1 |    10 |   220 |    30   (0)| 00:00:01 |   KEY |   KEY |
|   5 |    PARTITION RANGE SINGLE|              |     2 |    44 |    30   (0)| 00:00:01 |   KEY |   KEY |
|   6 |     TABLE ACCESS FULL    | CJ_0711_TMP2 |     2 |    44 |    30   (0)| 00:00:01 |   KEY |   KEY |
---------------------------------------------------------------------------------------------------------

--写里面

explain plan for select count(*) as cnt
from cj_0711_tmp1 a
left outer join
    cj_0711_tmp2 b
on (b.stat_date=to_date('20120701')
and a.object_id=b.object_id
and a.stat_date=to_date('20120702'))
;

-----------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name         | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |              |     1 |    22 |   863K  (1)| 04:18:56 |       |       |
|   1 |  SORT AGGREGATE            |              |     1 |    22 |            |          |       |       |
|   2 |   NESTED LOOPS OUTER       |              | 28718 |   616K|   863K  (1)| 04:18:56 |       |       |
|   3 |    PARTITION RANGE ALL     |              | 28718 |   616K|   613   (1)| 00:00:12 |     1 |    23 |
|   4 |     TABLE ACCESS FULL      | CJ_0711_TMP1 | 28718 |   616K|   613   (1)| 00:00:12 |     1 |    23 |
|   5 |    VIEW                    |              |     1 |       |    30   (0)| 00:00:01 |       |       |
|   6 |     FILTER                 |              |       |       |            |          |       |       |
|   7 |      PARTITION RANGE SINGLE|              |     1 |    22 |    30   (0)| 00:00:01 |   KEY |   KEY |
|   8 |       TABLE ACCESS FULL    | CJ_0711_TMP2 |     1 |    22 |    30   (0)| 00:00:01 |   KEY |   KEY |
-----------------------------------------------------------------------------------------------------------

--通过这两个执行计划,可以很直接的看到,写在外面是正确的写法,写在里面,直接就是partition range all了。。这个执行计划是很糟糕的。

gp上也是类似的:

--写外面

explain
  select count(*) as cnt
  from cj_0711_tmp1 a
  left outer join
  cj_0711_tmp2 b
  on (a.email=b.email
  and b.stat_date=cast('20120701' as date)
  )
  where a.stat_date=cast('20120701' as date);
 
 Aggregate  (cost=27.39..27.40 rows=1 width=0)
   ->  Gather Motion 104:1  (slice1; segments: 104)  (cost=26.31..27.37 rows=1 width=0)
         ->  Aggregate  (cost=26.31..26.32 rows=1 width=0)
               ->  Hash Left Join  (cost=0.01..26.30 rows=1 width=0)
                     Hash Cond: a.email::text = b.email::text
                     ->  Append  (cost=0.00..26.25 rows=1 width=274)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120701 a  (cost=0.00..26.25 rows=1 width=274)
                                 Filter: stat_date = '2012-07-01'::date
                     ->  Hash  (cost=0.00..0.00 rows=1 width=274)
                           ->  Append  (cost=0.00..0.00 rows=1 width=274)
                                 ->  Seq Scan on cj_0711_tmp2_1_prt_p20120701 b  (cost=0.00..0.00 rows=1 width=274)
                                       Filter: stat_date = '2012-07-01'::date

--写里面

Aggregate  (cost=5212.91..5212.92 rows=1 width=0)
   ->  Gather Motion 104:1  (slice1; segments: 104)  (cost=5211.84..5212.90 rows=1 width=0)
         ->  Aggregate  (cost=5211.84..5211.85 rows=1 width=0)
               ->  Hash Left Join  (cost=0.01..4845.12 rows=1411 width=0)
                     Hash Cond: a.email::text = b.email::text
                     Join Filter: a.stat_date = '2012-07-01'::date
                     ->  Append  (cost=0.00..3010.80 rows=1411 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120601 a  (cost=0.00..23.40 rows=11 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120602 a  (cost=0.00..23.40 rows=11 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120603 a  (cost=0.00..23.40 rows=11 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120604 a  (cost=0.00..173.55 rows=82 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120605 a  (cost=0.00..111.15 rows=53 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120606 a  (cost=0.00..23.40 rows=11 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120607 a  (cost=0.00..173.55 rows=82 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120608 a  (cost=0.00..23.40 rows=11 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120609 a  (cost=0.00..23.40 rows=11 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120610 a  (cost=0.00..23.40 rows=11 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120611 a  (cost=0.00..173.55 rows=82 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120612 a  (cost=0.00..111.15 rows=53 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120613 a  (cost=0.00..23.40 rows=11 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120614 a  (cost=0.00..175.50 rows=83 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120615 a  (cost=0.00..23.40 rows=11 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120616 a  (cost=0.00..23.40 rows=11 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120617 a  (cost=0.00..23.40 rows=11 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120618 a  (cost=0.00..175.50 rows=83 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120619 a  (cost=0.00..111.15 rows=53 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120620 a  (cost=0.00..23.40 rows=11 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120621 a  (cost=0.00..175.50 rows=83 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120622 a  (cost=0.00..23.40 rows=11 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120623 a  (cost=0.00..23.40 rows=11 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120624 a  (cost=0.00..23.40 rows=11 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120625 a  (cost=0.00..173.55 rows=82 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120626 a  (cost=0.00..113.10 rows=53 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120627 a  (cost=0.00..23.40 rows=11 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120628 a  (cost=0.00..173.55 rows=82 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120629 a  (cost=0.00..23.40 rows=11 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120630 a  (cost=0.00..23.40 rows=11 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120701 a  (cost=0.00..23.40 rows=11 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120702 a  (cost=0.00..173.55 rows=82 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120703 a  (cost=0.00..115.05 rows=54 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120704 a  (cost=0.00..23.40 rows=11 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120705 a  (cost=0.00..173.55 rows=82 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120706 a  (cost=0.00..23.40 rows=11 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120707 a  (cost=0.00..21.45 rows=11 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120708 a  (cost=0.00..21.45 rows=11 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120709 a  (cost=0.00..173.55 rows=82 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120710 a  (cost=0.00..0.00 rows=1 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120711 a  (cost=0.00..0.00 rows=1 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120712 a  (cost=0.00..0.00 rows=1 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120713 a  (cost=0.00..0.00 rows=1 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120714 a  (cost=0.00..0.00 rows=1 width=282)
                           ->  Seq Scan on cj_0711_tmp1_1_prt_p20120715 a  (cost=0.00..0.00 rows=1 width=282)
                     ->  Hash  (cost=0.00..0.00 rows=1 width=274)
                           ->  Append  (cost=0.00..0.00 rows=1 width=274)
                                 ->  Seq Scan on cj_0711_tmp2_1_prt_p20120701 b  (cost=0.00..0.00 rows=1 width=274)
                                       Filter: stat_date = '2012-07-01'::date

gp和oracle是一样的,放在外面的是正确写法,放在里面的话,会执行全表扫面,而不会直接走分区。

hive则不同,测试过程如下:

--写外面

explain
select count(*) as cnt
from cj_0711_tmp1 a
left outer join
cj_0711_tmp2 b
on (a.id=b.id
and b.hp_stat_date='2012-06-02')
where a.hp_stat_date='2012-06-02'  ;

 (TOK_QUERY
  (TOK_FROM
   (TOK_LEFTOUTERJOIN
    (TOK_TABREF cj_0711_tmp1 a)
    (TOK_TABREF cj_0711_tmp2 b)
    (and
     (=
      (.
       (TOK_TABLE_OR_COL a)
      id)
      (.
       (TOK_TABLE_OR_COL b)
      id)
     )
     (=
      (.
       (TOK_TABLE_OR_COL b)
       hp_stat_date
      )
      '2012-06-02'
     )
    )
   )
  )
  (
   TOK_INSERT
    (TOK_DESTINATION
     (TOK_DIR TOK_TMP_FILE)
    )
    (TOK_SELECT
     (TOK_SELEXPR
      (TOK_FUNCTIONSTAR count)
     cnt)
    )
    (TOK_WHERE
     (=
      (.
       (TOK_TABLE_OR_COL a)
      hp_stat_date)
     '2012-06-02')
    )
   )
  )

--写在里面

explain
select count(*) as cnt
from cj_0711_tmp1 a
left outer join
cj_0711_tmp2 b
on (a.id=b.id
and b.hp_stat_date='2012-06-02'
and a.hp_stat_date='2012-06-02')
;

(TOK_QUERY
   (TOK_FROM
    (TOK_LEFTOUTERJOIN
     (TOK_TABREF cj_0711_tmp1 a)
     (TOK_TABREF cj_0711_tmp2 b)
     (and
      (and
       (=
        (.
         (TOK_TABLE_OR_COL a)
        id)
        (.
         (TOK_TABLE_OR_COL b)
        id)
       )
       (=
        (.
         (TOK_TABLE_OR_COL b)
        hp_stat_date)
       '2012-06-02')
      )
      (=
       (.
        (TOK_TABLE_OR_COL a)
       hp_stat_date)
      '2012-06-02')
      )
     )
    )
    (TOK_INSERT
     (TOK_DESTINATION
      (TOK_DIR TOK_TMP_FILE)
     )
     (TOK_SELECT
      (TOK_SELEXPR
       (TOK_FUNCTIONSTAR count)
      cnt)
     )
    )
   )

--很显然,写在里面的是标准写法,语法树中两个and是一起执行;而写在外面的话,b表的分区字段是放在最后面进行insert的时候,才执行的。

so,hive和gp,oracle在写外关联的时候,需要注意下主表的限制条件,前者是需要放里面的,否则会到整个执行计划的最后才开始筛选,浪费系统的性能。

后两者的话,需要放在外面,这样的效率最好,和hive刚好是相反。

具体的hive语法树怎么生成的,还要后续研究代码。。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值