大数据学习25:Hive 中 distribute by 的测试

本文通过一系列测试,详细探讨了Hive中DistributeBy的功能及其与GroupBy的配合使用,旨在合理分配数据到不同的Reduce任务,提高数据处理效率。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

目标:测试 distribute by 的作用,希望针对user列,将不同的用户分配在不同的reduce中,方便后期计算。

======================
测试1:reduce 4个,按 length(user) 分配
测试2:reduce 4个,按 substr(user,0,2) 分配
测试3:reduce 4个,按 user 分配
测试4:reduce 5个,按 user 分配
测试5:reduce 5个,按 length(user) 分配
测试6:reduce 5个,按 substr(user,0,2) 分配

分析执行计划:
======================

hive 表数据

panda   2017-1-01   5t
gifshow 2017-1-01   3t
yy  2017-1-01   2t
laifeng 2017-1-01   2t
panda   2017-1-02   5t
gifshow 2017-1-02   3t
yy  2017-1-02   2t
laifeng 2017-1-02   2t
panda   2017-2-01   4t
gifshow 2017-2-01   3t
yy  2017-2-01   1t
laifeng 2017-2-01   4t
panda   2017-2-02   4t
gifshow 2017-2-02   3t
yy  2017-2-02   1t
laifeng 2017-2-02   4t
panda   2017-3-01   4t
gifshow 2017-3-01   3t
yy  2017-3-01   1t
laifeng 2017-3-01   4t
panda   2017-3-02   4t
gifshow 2017-3-02   3t
yy  2017-3-02   1t
laifeng 2017-3-02   4t
这张表,在hive 中,字段为:user,date,access

测试:1:

hive (default)> set mapred.reduce.tasks=4;
hive (default)> insert overwrite local directory '/root/data/dis_shipin/' row format delimited fields terminated by '\t' 
select user,date,access from shipin_origin cluster by length(user) ;
---
Number of reduce tasks not specified. Defaulting to jobconf value of: 4
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
Stage-Stage-1: Map: 1  Reduce: 4   Cumulative CPU: 8.5 sec   HDFS Read: 15459 HDFS Write: 462 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 500 msec
OK
Time taken: 44.638 seconds
---

结果:

[root@hadoop002 data]#  more  dis_shipin/*
::::::::::::::
dis_shipin/000000_0
::::::::::::::
::::::::::::::
dis_shipin/000001_0
::::::::::::::
panda   2017-1-01       5t
panda   2017-3-02       4t
panda   2017-2-01       4t
panda   2017-3-01       4t
panda   2017-1-02       5t
panda   2017-2-02       4t
::::::::::::::
dis_shipin/000002_0
::::::::::::::
yy      2017-2-01       1t
yy      2017-1-02       2t
yy      2017-3-02       1t
yy      2017-2-02       1t
yy      2017-1-01       2t
yy      2017-3-01       1t
::::::::::::::
dis_shipin/000003_0
::::::::::::::
laifeng 2017-3-02       4t
gifshow 2017-3-02       3t
laifeng 2017-3-01       4t
gifshow 2017-3-01       3t
laifeng 2017-2-02       4t
gifshow 2017-2-02       3t
gifshow 2017-2-01       3t
laifeng 2017-1-02       2t
gifshow 2017-1-02       3t
laifeng 2017-1-01       2t
gifshow 2017-1-01       3t
laifeng 2017-2-01       4t

发现:
手工配置4个reduce输出,由于通过length(user)数字去分配到不同reduce,实验成功,只不过分配的条件不正确,功能无误,有1个 reduce 为空

测试:2:

hive (default)> set mapred.reduce.tasks=4;
hive (default)> insert overwrite local directory '/root/data/dis_shipin/' row format delimited fields terminated by '\t' 
select user,date,access from shipin_origin cluster by substr(user,0,2) ;

Number of reduce tasks not specified. Defaulting to jobconf value of: 4
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
Stage-Stage-1: Map: 1  Reduce: 4   Cumulative CPU: 10.0 sec   HDFS Read: 15895 HDFS Write: 462 SUCCESS
Total MapReduce CPU Time Spent: 10 seconds 0 msec
OK
Time taken: 49.651 seconds

结果:
[root@hadoop002 data]# more  dis_shipin/*
::::::::::::::
dis_shipin/000000_0
::::::::::::::
yy      2017-2-01       1t
yy      2017-2-02       1t
yy      2017-1-01       2t
yy      2017-1-02       2t
yy      2017-3-02       1t
yy      2017-3-01       1t
::::::::::::::
dis_shipin/000001_0
::::::::::::::
laifeng 2017-3-02       4t
laifeng 2017-3-01       4t
laifeng 2017-2-02       4t
laifeng 2017-2-01       4t
laifeng 2017-1-02       2t
laifeng 2017-1-01       2t
panda   2017-2-01       4t
panda   2017-3-02       4t
panda   2017-1-02       5t
panda   2017-1-01       5t
panda   2017-3-01       4t
panda   2017-2-02       4t
::::::::::::::
dis_shipin/000002_0
::::::::::::::
gifshow 2017-3-02       3t
gifshow 2017-3-01       3t
gifshow 2017-2-01       3t
gifshow 2017-1-01       3t
gifshow 2017-1-02       3t
gifshow 2017-2-02       3t
::::::::::::::
dis_shipin/000003_0
::::::::::::::

发现:
手工配置4个reduce输出,由于通过substr(user,0,2)字符去分配到不同reduce,实验结果错误,没有将laifeng和panda分开。有1个reduce未分配数据

测试:3:

hive (default)> set mapred.reduce.tasks=4;
hive (default)> insert overwrite local directory '/root/data/dis_shipin/' row format delimited fields terminated by '\t' 
select user,date,access from shipin_origin cluster by user ;

Number of reduce tasks not specified. Defaulting to jobconf value of: 4
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
Stage-Stage-1: Map: 1  Reduce: 4   Cumulative CPU: 8.9 sec   HDFS Read: 14358 HDFS Write: 462 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 900 msec
OK
Time taken: 49.366 seconds

结果:
[root@hadoop002 data]# more  dis_shipin/*
::::::::::::::
dis_shipin/000000_0
::::::::::::::
laifeng 2017-3-02       4t
laifeng 2017-3-01       4t
laifeng 2017-2-02       4t
laifeng 2017-2-01       4t
laifeng 2017-1-02       2t
laifeng 2017-1-01       2t
yy      2017-1-01       2t
yy      2017-2-01       1t
yy      2017-1-02       2t
yy      2017-2-02       1t
yy      2017-3-02       1t
yy      2017-3-01       1t
::::::::::::::
dis_shipin/000001_0
::::::::::::::
gifshow 2017-3-01       3t
gifshow 2017-2-02       3t
gifshow 2017-2-01       3t
gifshow 2017-1-01       3t
gifshow 2017-1-02       3t
gifshow 2017-3-02       3t
::::::::::::::
dis_shipin/000002_0
::::::::::::::
panda   2017-1-02       5t
panda   2017-3-01       4t
panda   2017-2-01       4t
panda   2017-2-02       4t
panda   2017-3-02       4t
panda   2017-1-01       5t
::::::::::::::
dis_shipin/000003_0
::::::::::::::

发现:
手工配置4个reduce输出,由于通过 user 字符去分配到不同reduce,实验结果错误,没有将laifeng和yy分开。有1个reduce未分配数据

测试4:

hive (default)> set mapred.reduce.tasks=5;
hive (default)> insert overwrite local directory '/root/data/dis_shipin/' row format delimited fields terminated by '\t' 
select user,date,access from shipin_origin cluster by user ;

Number of reduce tasks not specified. Defaulting to jobconf value of: 5
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 5
Stage-Stage-1: Map: 1  Reduce: 5   Cumulative CPU: 8.84 sec   HDFS Read: 17060 HDFS Write: 462 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 840 msec
OK
Time taken: 47.521 seconds

结果:
[root@hadoop002 data]# ll dis_shipin/
total 8
-rw-r--r-- 1 root root   0 Oct 23 11:30 000000_0
-rw-r--r-- 1 root root 240 Oct 23 11:30 000001_0
-rw-r--r-- 1 root root 222 Oct 23 11:30 000002_0
-rw-r--r-- 1 root root   0 Oct 23 11:30 000003_0
-rw-r--r-- 1 root root   0 Oct 23 11:30 000004_0

[root@hadoop002 data]# more dis_shipin/*
::::::::::::::
dis_shipin/000000_0
::::::::::::::
::::::::::::::
dis_shipin/000001_0
::::::::::::::
laifeng 2017-3-02       4t
laifeng 2017-3-01       4t
laifeng 2017-2-02       4t
laifeng 2017-2-01       4t
laifeng 2017-1-02       2t
laifeng 2017-1-01       2t
panda   2017-3-02       4t
panda   2017-2-01       4t
panda   2017-1-01       5t
panda   2017-1-02       5t
panda   2017-3-01       4t
panda   2017-2-02       4t
::::::::::::::
dis_shipin/000002_0
::::::::::::::
gifshow 2017-3-01       3t
gifshow 2017-2-02       3t
gifshow 2017-2-01       3t
gifshow 2017-1-01       3t
gifshow 2017-1-02       3t
gifshow 2017-3-02       3t
yy      2017-2-02       1t
yy      2017-3-02       1t
yy      2017-3-01       1t
yy      2017-2-01       1t
yy      2017-1-01       2t
yy      2017-1-02       2t
::::::::::::::
dis_shipin/000003_0
::::::::::::::
::::::::::::::
dis_shipin/000004_0
::::::::::::::

发现:
手工配置5个reduce输出,由于通过 user 字符去分配到不同reduce,实验结果错误,没有将gifshow和yy分开,panda和laifeng分开,结果和配置4个reduce不同。
有3个reduce未分配数据

测试5:

hive (default)> set mapred.reduce.tasks=5;
hive (default)> insert overwrite local directory '/root/data/dis_shipin/' row format delimited fields terminated by '\t' 
select user,date,access from shipin_origin cluster by length(user) ;

Number of reduce tasks not specified. Defaulting to jobconf value of: 5
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 5
Stage-Stage-1: Map: 1  Reduce: 5   Cumulative CPU: 9.27 sec   HDFS Read: 18369 HDFS Write: 462 SUCCESS
Total MapReduce CPU Time Spent: 9 seconds 270 msec
OK
Time taken: 48.579 seconds

结果:
[root@hadoop002 data]# more dis_shipin/*
::::::::::::::
dis_shipin/000000_0
::::::::::::::
panda   2017-1-01       5t
panda   2017-3-02       4t
panda   2017-2-01       4t
panda   2017-3-01       4t
panda   2017-1-02       5t
panda   2017-2-02       4t
::::::::::::::
dis_shipin/000001_0
::::::::::::::
::::::::::::::
dis_shipin/000002_0
::::::::::::::
yy      2017-2-01       1t
yy      2017-1-02       2t
yy      2017-3-02       1t
yy      2017-2-02       1t
yy      2017-1-01       2t
yy      2017-3-01       1t
laifeng 2017-3-02       4t
gifshow 2017-3-02       3t
laifeng 2017-3-01       4t
gifshow 2017-3-01       3t
laifeng 2017-2-02       4t
gifshow 2017-2-02       3t
gifshow 2017-2-01       3t
laifeng 2017-1-02       2t
gifshow 2017-1-02       3t
laifeng 2017-1-01       2t
gifshow 2017-1-01       3t
laifeng 2017-2-01       4t
::::::::::::::
dis_shipin/000003_0
::::::::::::::
::::::::::::::
dis_shipin/000004_0
::::::::::::::

发现:
手工配置5个reduce输出,由于通过 length(user) 数字去分配到不同reduce,实验结果错误,没有将gifshow,yy,laifeng分开,结果和配置4个reduce不同。
有3个reduce未分配数据

测试:6:

hive (default)> set mapred.reduce.tasks=5;
hive (default)> insert overwrite local directory '/root/data/dis_shipin/' row format delimited fields terminated by '\t' 
select user,date,access from shipin_origin cluster by substr(user,0,2) ;

Number of reduce tasks not specified. Defaulting to jobconf value of: 5 
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 5
Stage-Stage-1: Map: 1  Reduce: 5   Cumulative CPU: 10.4 sec   HDFS Read: 18901 HDFS Write: 462 SUCCESS
Total MapReduce CPU Time Spent: 10 seconds 400 msec
OK
Time taken: 49.339 seconds

结果:
[root@hadoop002 data]# more dis_shipin/*
::::::::::::::
dis_shipin/000000_0
::::::::::::::
laifeng 2017-3-02       4t
laifeng 2017-3-01       4t
laifeng 2017-2-02       4t
laifeng 2017-2-01       4t
laifeng 2017-1-02       2t
laifeng 2017-1-01       2t
::::::::::::::
dis_shipin/000001_0
::::::::::::::
::::::::::::::
dis_shipin/000002_0
::::::::::::::
yy      2017-1-01       2t
yy      2017-2-01       1t
yy      2017-1-02       2t
yy      2017-2-02       1t
yy      2017-3-02       1t
yy      2017-3-01       1t
::::::::::::::
dis_shipin/000003_0
::::::::::::::
gifshow 2017-3-01       3t
gifshow 2017-2-02       3t
gifshow 2017-2-01       3t
gifshow 2017-1-01       3t
gifshow 2017-1-02       3t
gifshow 2017-3-02       3t
::::::::::::::
dis_shipin/000004_0
::::::::::::::
panda   2017-1-02       5t
panda   2017-3-01       4t
panda   2017-2-01       4t
panda   2017-2-02       4t
panda   2017-3-02       4t
panda   2017-1-01       5t

发现:
手工配置5个reduce输出,由于通过 substr(user,0,2) 字符去分配到不同reduce,实验结果正确,达到目标。结果和配置4个reduce不同。
有1个reduce未分配数据。

结合使用:

set mapred.reduce.tasks=5;
hive (default)> select user,substr(date,0,6),sum(cast(substr(access,0,1) as INT)) from shipin_origin distribute by  substr(user,0,2)  group by user,substr(date,0,7) ;
FAILED: ParseException line 1:118 missing EOF at 'group' near ')'
hive (default)> insert overwrite local directory '/root/data/dis_shipin/' row format delimited fields terminated by '\t'
              > select user,substr(date,0,6),sum(cast(substr(access,0,1) as INT)) from shipin_origin  group by user,substr(date,0,7) distribute by  substr(user,0,2);
Query ID = root_20171023091818_928d5dec-5afb-468e-af8c-28750170b31b
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Defaulting to jobconf value of: 5
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1508685632650_0068, Tracking URL = http://hadoop002:8088/proxy/application_1508685632650_0068/
Kill Command = /opt/software/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1508685632650_0068
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 5
2017-10-23 12:32:40,024 Stage-1 map = 0%,  reduce = 0%
2017-10-23 12:32:47,672 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.98 sec
2017-10-23 12:33:11,709 Stage-1 map = 100%,  reduce = 13%, Cumulative CPU 2.98 sec
2017-10-23 12:33:17,787 Stage-1 map = 100%,  reduce = 20%, Cumulative CPU 4.0 sec
2017-10-23 12:33:20,217 Stage-1 map = 100%,  reduce = 47%, Cumulative CPU 6.26 sec
2017-10-23 12:33:22,672 Stage-1 map = 100%,  reduce = 60%, Cumulative CPU 7.42 sec
2017-10-23 12:33:23,931 Stage-1 map = 100%,  reduce = 87%, Cumulative CPU 10.38 sec
2017-10-23 12:33:24,993 Stage-1 map = 100%,  reduce = 93%, Cumulative CPU 11.11 sec
2017-10-23 12:33:26,025 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 11.86 sec
MapReduce Total cumulative CPU time: 11 seconds 860 msec
Ended Job = job_1508685632650_0068
Launching Job 2 out of 2
Number of reduce tasks not specified. Defaulting to jobconf value of: 5
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1508685632650_0069, Tracking URL = http://hadoop002:8088/proxy/application_1508685632650_0069/
Kill Command = /opt/software/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1508685632650_0069
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 5
2017-10-23 12:33:36,609 Stage-2 map = 0%,  reduce = 0%
2017-10-23 12:33:45,209 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 1.78 sec
2017-10-23 12:34:13,386 Stage-2 map = 100%,  reduce = 13%, Cumulative CPU 3.01 sec
2017-10-23 12:34:19,658 Stage-2 map = 100%,  reduce = 33%, Cumulative CPU 5.23 sec
2017-10-23 12:34:20,836 Stage-2 map = 100%,  reduce = 47%, Cumulative CPU 6.38 sec
2017-10-23 12:34:23,210 Stage-2 map = 100%,  reduce = 80%, Cumulative CPU 9.55 sec
2017-10-23 12:34:26,761 Stage-2 map = 100%,  reduce = 87%, Cumulative CPU 10.72 sec
2017-10-23 12:34:27,795 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 12.74 sec
MapReduce Total cumulative CPU time: 12 seconds 740 msec
Ended Job = job_1508685632650_0069
Copying data to local directory /root/data/dis_shipin
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 5   Cumulative CPU: 11.86 sec   HDFS Read: 22734 HDFS Write: 867 SUCCESS
Stage-Stage-2: Map: 1  Reduce: 5   Cumulative CPU: 12.74 sec   HDFS Read: 18045 HDFS Write: 196 SUCCESS
Total MapReduce CPU Time Spent: 24 seconds 600 msec
OK
Time taken: 118.44 seconds

结果:

[root@hadoop002 data]# more dis_shipin/*
::::::::::::::
dis_shipin/000000_0
::::::::::::::
laifeng 2017-1 4
laifeng 2017-3 8
laifeng 2017-2 8
::::::::::::::
dis_shipin/000001_0
::::::::::::::
::::::::::::::
dis_shipin/000002_0
::::::::::::::
yy      2017-3 2
yy      2017-2 2
yy      2017-1 4
::::::::::::::
dis_shipin/000003_0
::::::::::::::
gifshow 2017-3 6
gifshow 2017-2 6
gifshow 2017-1 6
::::::::::::::
dis_shipin/000004_0
::::::::::::::
panda   2017-3 8
panda   2017-2 8
panda   2017-1 10

发现:
手工配置5个reduce输出,由于通过 substr(user,0,2) 字符去分配到不同reduce,通过group by 进行聚合。 实验结果正确,达到目标。

============================================================================================
但是有个疑问,是先执行的 distribute by 还是 先做 group by ?还需要查看hive 的执行计划。
进行分析:

#本条命令只对 reduce 进行分配
hive (default)> explain
              > select user,date,cast(substr(access,0,1) as INT) from shipin_origin distribute by  substr(user,0,2);
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: shipin_origin
            Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: user (type: string), date (type: string), UDFToInteger(substr(access, 0, 1)) (type: int)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                sort order: 
                Map-reduce partition columns: substr(_col0, 0, 2) (type: string)
                Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
                value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: int)
      Reduce Operator Tree:
        Select Operator
          expressions: VALUE._col0 (type: string), VALUE._col1 (type: string), VALUE._col2 (type: int)
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Time taken: 0.098 seconds, Fetched: 39 row(s)
##以下为综合使用
hive (default)> explain
              > select user,substr(date,0,7),sum(cast(substr(access,0,1) as INT)) from shipin_origin  group by user,substr(date,0,7) distribute by  substr(user,0,2);
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 depends on stages: Stage-2

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: shipin_origin
            Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: user (type: string), date (type: string), access (type: string)
              outputColumnNames: user, date, access
              Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: sum(UDFToInteger(substr(access, 0, 1)))
                keys: user (type: string), substr(date, 0, 7) (type: string)
                mode: hash
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string), _col1 (type: string)
                  sort order: ++
                  Map-reduce partition columns: _col0 (type: string), _col1 (type: string)
                  Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col2 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: sum(VALUE._col0)
          keys: KEY._col0 (type: string), KEY._col1 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            Reduce Output Operator
              sort order: 
              Map-reduce partition columns: substr(_col0, 0, 2) (type: string)
              Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
              value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: bigint)
      Reduce Operator Tree:
        Select Operator
          expressions: VALUE._col0 (type: string), VALUE._col1 (type: string), VALUE._col2 (type: bigint)
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Time taken: 0.121 seconds, Fetched: 70 row(s)

发现:先做了group by , 再做 distribute by 分配输出到不同reduce ,从效率上看,没有进行并行 group by 处理

那么如何先做 distribute by 呢? 这里考虑 做固化查询
虽然从执行计划上看是先做了distribute,但是在结果上看,后来做group by 的时候会再次 因为之前设置的 reduce 数量,导致结果分片。

hive (default)> explain
              > insert overwrite local directory '/root/data/dis_shipin/' row format delimited fields terminated by '\t'
              > select A.user,A.date,sum(A.acc) from 
              > (select user,substr(date,0,6)as  date  ,cast(substr(access,0,1) as INT) as acc from shipin_origin distribute by  substr(user,0,2)) A group by A.user,A.date;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 depends on stages: Stage-2

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: shipin_origin
            Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: user (type: string), substr(date, 0, 6) (type: string), UDFToInteger(substr(access, 0, 1)) (type: int)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                sort order: 
                Map-reduce partition columns: substr(_col0, 0, 2) (type: string)
                Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
                value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: int)
      Reduce Operator Tree:
        Select Operator
          expressions: VALUE._col0 (type: string), VALUE._col1 (type: string), VALUE._col2 (type: int)
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
          Group By Operator
            aggregations: sum(_col2)
            keys: _col0 (type: string), _col1 (type: string)
            mode: hash
            outputColumnNames: _col0, _col1, _col2
            Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            Reduce Output Operator
              key expressions: _col0 (type: string), _col1 (type: string)
              sort order: ++
              Map-reduce partition columns: _col0 (type: string), _col1 (type: string)
              Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
              value expressions: _col2 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: sum(VALUE._col0)
          keys: KEY._col0 (type: string), KEY._col1 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Move Operator
      files:
          hdfs directory: false
          destination: /root/data/dis_shipin

Time taken: 0.221 seconds, Fetched: 70 row(s)
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值