Hive MapJoin 执行计划

最新推荐文章于 2025-06-04 12:22:52 发布

javastart

最新推荐文章于 2025-06-04 12:22:52 发布

阅读量1.1k

点赞数 1

分类专栏： hive

hive 专栏收录该内容

55 篇文章

订阅专栏

本文对比分析Hive中使用与未使用MapJoin的执行计划差异，通过具体SQL案例，展示了如何通过调整hive.mapjoin.smalltable.filesize参数，实现MapJoin优化，减少Shuffle过程，提升大数据处理效率。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本文通过展示hive.mapjoin.smalltable.filesize 这个参数的设置,来比较是否使用mapjoin的执行计划的区别

测试sql:

SELECT id, clienttime
FROM (
  SELECT id, clienttime, key
  FROM log_table
  WHERE day = '20180801'
) a1
LEFT JOIN (SELECT key, field2 FROM key_mapping) a2 ON a1.key = a2.key

1. 未使用`mapjoin`

STAGE DEPENDENCIES:
  Stage-4 is a root stage , consists of Stage-1
  Stage-1
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-4
    Conditional Operator

  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: id_app_runtimes
            Statistics: Num rows: 3156646800 Data size: 1666709511947 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: id (type: string), clienttime (type: bigint), key (type: string)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 3156646800 Data size: 1666709511947 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: _col2 (type: string)
                sort order: +
                Map-reduce partition columns: _col2 (type: string)
                Statistics: Num rows: 3156646800 Data size: 1666709511947 Basic stats: COMPLETE Column stats: NONE
                value expressions: _col0 (type: string), _col1 (type: bigint)
          TableScan
            alias: key_mapping
            Statistics: Num rows: 145794 Data size: 29158920 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: key (type: string), field2 (type: string)
              outputColumnNames: _col0, _col1
              Statistics: Num rows: 145794 Data size: 29158920 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: _col0 (type: string)
                sort order: +
                Map-reduce partition columns: _col0 (type: string)
                Statistics: Num rows: 145794 Data size: 29158920 Basic stats: COMPLETE Column stats: NONE
                value expressions: _col1 (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          keys:
            0 _col2 (type: string)
            1 _col0 (type: string)
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: string), _col1 (type: bigint)
            Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

2. 使用了`mapjoin`

STAGE DEPENDENCIES:
  Stage-4 is a root stage , consists of Stage-5, Stage-1
  Stage-5 has a backup stage: Stage-1
  Stage-3 depends on stages: Stage-5
  Stage-1
  Stage-0 depends on stages: Stage-3, Stage-1

STAGE PLANS:
  Stage: Stage-4
    Conditional Operator

  Stage: Stage-5
    Map Reduce Local Work
      Alias -> Map Local Tables:
        a2:key_mapping 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        a2:key_mapping 
          TableScan
            alias: key_mapping
            Statistics: Num rows: 145794 Data size: 29158920 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: key (type: string),  (type: string)
              outputColumnNames: _col0, _col1
              Statistics: Num rows: 145794 Data size: 29158920 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 _col2 (type: string)
                  1 _col0 (type: string)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: id_app_runtimes
            Statistics: Num rows: 3156646800 Data size: 1666709511947 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: id (type: string), clienttime (type: bigint), key (type: string)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 3156646800 Data size: 1666709511947 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Left Outer Join0 to 1
                keys:
                  0 _col2 (type: string)
                  1 _col0 (type: string)
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: string), _col1 (type: bigint)
                  Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: id_app_runtimes
            Statistics: Num rows: 3156646800 Data size: 1666709511947 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: id (type: string), clienttime (type: bigint), key (type: string)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 3156646800 Data size: 1666709511947 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: _col16 (type: string)
                sort order: +
                Map-reduce partition columns: _col2 (type: string)
                Statistics: Num rows: 3156646800 Data size: 1666709511947 Basic stats: COMPLETE Column stats: NONE
                value expressions: _col0 (type: string), _col1 (type: bigint)
          TableScan
            alias: key_mapping
            Statistics: Num rows: 145794 Data size: 29158920 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: key (type: string), field2 (type: string)
              outputColumnNames: _col0, _col1
              Statistics: Num rows: 145794 Data size: 29158920 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: _col0 (type: string)
                sort order: +
                Map-reduce partition columns: _col0 (type: string)
                Statistics: Num rows: 145794 Data size: 29158920 Basic stats: COMPLETE Column stats: NONE
                value expressions: _col1 (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          keys:
            0 _col2 (type: string)
            1 _col0 (type: string)
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: string), _col1 (type: bigint)
            Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

上述执行计划中: