hive/gp/oracle外关联时写法的区别-优快云博客

本文链接：https://blog.youkuaiyun.com/cajeep2001/article/details/7742239

本文探讨了在Hive、GP和Oracle数据库中执行外关联查询时，将主表限制条件置于内部或外部的不同影响。通过执行计划分析，发现将条件写在外面对于Oracle和GP而言是更优的选择，避免了全表扫描，而Hive则需要将条件写在内部以提高性能。了解这些差异有助于优化查询性能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

目前在做hive上表关联的时候，觉得有些以前oracle sql的写法效率特低。尤其是左表的限制条件是写里面，还是写外面的问题，hive和另外两种sql还是有区别的：

oracle：

--写外面

explain plan for select count(*) as cnt
2 from cj_0711_tmp1 a
3 left outer join
4 cj_0711_tmp2 b
5 on (b.stat_date=to_date('20120701')
6 and a.object_id=b.object_id)
7 where a.stat_date=to_date('20120702');

---------------------------------------------------------------------------------------------------------
| Id | Operation                | Name         | Rows | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |              |     1 |    44 |    61   (2)| 00:00:02 |       |       |
|   1 | SORT AGGREGATE          |              |     1 |    44 |            |          |       |       |
|   2 |   HASH JOIN OUTER        |              |    10 |   440 |    61   (2)| 00:00:02 |       |       |
|   3 |    PARTITION RANGE SINGLE|              |    10 |   220 |    30   (0)| 00:00:01 |   KEY |   KEY |
|   4 |     TABLE ACCESS FULL    | CJ_0711_TMP1 |    10 |   220 |    30   (0)| 00:00:01 |   KEY |   KEY |
|   5 |    PARTITION RANGE SINGLE|              |     2 |    44 |    30   (0)| 00:00:01 |   KEY |   KEY |
|   6 |     TABLE ACCESS FULL    | CJ_0711_TMP2 |     2 |    44 |    30   (0)| 00:00:01 |   KEY |   KEY |
---------------------------------------------------------------------------------------------------------

--写里面

explain plan for select count(*) as cnt
from cj_0711_tmp1 a
left outer join
cj_0711_tmp2 b
on (b.stat_date=to_date('20120701')
and a.object_id=b.object_id
and a.stat_date=to_date('20120702'))
;

-----------------------------------------------------------------------------------------------------------
| Id | Operation                  | Name         | Rows | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |              |     1 |    22 |   863K (1)| 04:18:56 |       |       |
|   1 | SORT AGGREGATE            |              |     1 |    22 |            |          |       |       |
|   2 |   NESTED LOOPS OUTER       |              | 28718 |   616K|   863K (1)| 04:18:56 |       |       |
|   3 |    PARTITION RANGE ALL     |              | 28718 |   616K|   613   (1)| 00:00:12 |     1 |    23 |
|   4 |     TABLE ACCESS FULL      | CJ_0711_TMP1 | 28718 |   616K|   613   (1)| 00:00:12 |     1 |    23 |
|   5 |    VIEW                    |              |     1 |       |    30   (0)| 00:00:01 |       |       |
|   6 |     FILTER                 |              |       |       |            |          |       |       |
|   7 |      PARTITION RANGE SINGLE|              |     1 |    22 |    30   (0)| 00:00:01 |   KEY |   KEY |
|   8 |       TABLE ACCESS FULL    | CJ_0711_TMP2 |     1 |    22 |    30   (0)| 00:00:01 |   KEY |   KEY |
-----------------------------------------------------------------------------------------------------------

--通过这两个执行计划，可以很直接的看到，写在外面是正确的写法，写在里面，直接就是partition range all了。。这个执行计划是很糟糕的。

gp上也是类似的：

--写外面

explain
select count(*) as cnt
from cj_0711_tmp1 a
left outer join
cj_0711_tmp2 b
on (a.email=b.email
and b.stat_date=cast('20120701' as date)
)
where a.stat_date=cast('20120701' as date);

Aggregate (cost=27.39..27.40 rows=1 width=0)
   -> Gather Motion 104:1 (slice1; segments: 104) (cost=26.31..27.37 rows=1 width=0)
         -> Aggregate (cost=26.31..26.32 rows=1 width=0)
               -> Hash Left Join (cost=0.01..26.30 rows=1 width=0)
                     Hash Cond: a.email::text = b.email::text
                     -> Append (cost=0.00..26.25 rows=1 width=274)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120701 a (cost=0.00..26.25 rows=1 width=274)
                                 Filter: stat_date = '2012-07-01'::date
                     -> Hash (cost=0.00..0.00 rows=1 width=274)
                           -> Append (cost=0.00..0.00 rows=1 width=274)
                                 -> Seq Scan on cj_0711_tmp2_1_prt_p20120701 b (cost=0.00..0.00 rows=1 width=274)
                                       Filter: stat_date = '2012-07-01'::date

--写里面

Aggregate (cost=5212.91..5212.92 rows=1 width=0)
   -> Gather Motion 104:1 (slice1; segments: 104) (cost=5211.84..5212.90 rows=1 width=0)
         -> Aggregate (cost=5211.84..5211.85 rows=1 width=0)
               -> Hash Left Join (cost=0.01..4845.12 rows=1411 width=0)
                     Hash Cond: a.email::text = b.email::text
                     Join Filter: a.stat_date = '2012-07-01'::date
                     -> Append (cost=0.00..3010.80 rows=1411 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120601 a (cost=0.00..23.40 rows=11 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120602 a (cost=0.00..23.40 rows=11 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120603 a (cost=0.00..23.40 rows=11 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120604 a (cost=0.00..173.55 rows=82 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120605 a (cost=0.00..111.15 rows=53 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120606 a (cost=0.00..23.40 rows=11 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120607 a (cost=0.00..173.55 rows=82 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120608 a (cost=0.00..23.40 rows=11 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120609 a (cost=0.00..23.40 rows=11 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120610 a (cost=0.00..23.40 rows=11 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120611 a (cost=0.00..173.55 rows=82 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120612 a (cost=0.00..111.15 rows=53 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120613 a (cost=0.00..23.40 rows=11 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120614 a (cost=0.00..175.50 rows=83 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120615 a (cost=0.00..23.40 rows=11 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120616 a (cost=0.00..23.40 rows=11 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120617 a (cost=0.00..23.40 rows=11 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120618 a (cost=0.00..175.50 rows=83 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120619 a (cost=0.00..111.15 rows=53 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120620 a (cost=0.00..23.40 rows=11 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120621 a (cost=0.00..175.50 rows=83 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120622 a (cost=0.00..23.40 rows=11 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120623 a (cost=0.00..23.40 rows=11 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120624 a (cost=0.00..23.40 rows=11 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120625 a (cost=0.00..173.55 rows=82 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120626 a (cost=0.00..113.10 rows=53 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120627 a (cost=0.00..23.40 rows=11 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120628 a (cost=0.00..173.55 rows=82 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120629 a (cost=0.00..23.40 rows=11 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120630 a (cost=0.00..23.40 rows=11 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120701 a (cost=0.00..23.40 rows=11 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120702 a (cost=0.00..173.55 rows=82 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120703 a (cost=0.00..115.05 rows=54 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120704 a (cost=0.00..23.40 rows=11 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120705 a (cost=0.00..173.55 rows=82 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120706 a (cost=0.00..23.40 rows=11 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120707 a (cost=0.00..21.45 rows=11 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120708 a (cost=0.00..21.45 rows=11 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120709 a (cost=0.00..173.55 rows=82 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120710 a (cost=0.00..0.00 rows=1 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120711 a (cost=0.00..0.00 rows=1 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120712 a (cost=0.00..0.00 rows=1 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120713 a (cost=0.00..0.00 rows=1 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120714 a (cost=0.00..0.00 rows=1 width=282)
                           -> Seq Scan on cj_0711_tmp1_1_prt_p20120715 a (cost=0.00..0.00 rows=1 width=282)
                     -> Hash (cost=0.00..0.00 rows=1 width=274)
                           -> Append (cost=0.00..0.00 rows=1 width=274)
                                 -> Seq Scan on cj_0711_tmp2_1_prt_p20120701 b (cost=0.00..0.00 rows=1 width=274)
                                       Filter: stat_date = '2012-07-01'::date

gp和oracle是一样的，放在外面的是正确写法，放在里面的话，会执行全表扫面，而不会直接走分区。

hive则不同，测试过程如下：

--写外面

explain
select count(*) as cnt
from cj_0711_tmp1 a
left outer join
cj_0711_tmp2 b
on (a.id=b.id
and b.hp_stat_date='2012-06-02')
where a.hp_stat_date='2012-06-02' ;

(TOK_QUERY
  (TOK_FROM
   (TOK_LEFTOUTERJOIN
    (TOK_TABREF cj_0711_tmp1 a)
    (TOK_TABREF cj_0711_tmp2 b)
    (and
     (=
      (.
       (TOK_TABLE_OR_COL a)
      id)
      (.
       (TOK_TABLE_OR_COL b)
      id)
     )
     (=
      (.
       (TOK_TABLE_OR_COL b)
       hp_stat_date
      )
      '2012-06-02'
     )
    )
   )
  )
  (
   TOK_INSERT
    (TOK_DESTINATION
     (TOK_DIR TOK_TMP_FILE)
    )
    (TOK_SELECT
     (TOK_SELEXPR
      (TOK_FUNCTIONSTAR count)
     cnt)
    )
    (TOK_WHERE
     (=
      (.
       (TOK_TABLE_OR_COL a)
      hp_stat_date)
     '2012-06-02')
    )
   )
  )

--写在里面

explain
select count(*) as cnt
from cj_0711_tmp1 a
left outer join
cj_0711_tmp2 b
on (a.id=b.id
and b.hp_stat_date='2012-06-02'
and a.hp_stat_date='2012-06-02')
;

(TOK_QUERY
(TOK_FROM
  (TOK_LEFTOUTERJOIN
   (TOK_TABREF cj_0711_tmp1 a)
   (TOK_TABREF cj_0711_tmp2 b)
   (and
    (and
     (=
      (.
       (TOK_TABLE_OR_COL a)
      id)
      (.
       (TOK_TABLE_OR_COL b)
      id)
     )
     (=
      (.
       (TOK_TABLE_OR_COL b)
      hp_stat_date)
     '2012-06-02')
    )
    (=
     (.
      (TOK_TABLE_OR_COL a)
     hp_stat_date)
    '2012-06-02')
    )
   )
  )
  (TOK_INSERT
   (TOK_DESTINATION
    (TOK_DIR TOK_TMP_FILE)
   )
   (TOK_SELECT
    (TOK_SELEXPR
     (TOK_FUNCTIONSTAR count)
    cnt)
   )
  )
)