目标:测试 distribute by 的作用,希望针对user列,将不同的用户分配在不同的reduce中,方便后期计算。
======================
测试1:reduce 4个,按 length(user) 分配
测试2:reduce 4个,按 substr(user,0,2) 分配
测试3:reduce 4个,按 user 分配
测试4:reduce 5个,按 user 分配
测试5:reduce 5个,按 length(user) 分配
测试6:reduce 5个,按 substr(user,0,2) 分配
分析执行计划:
======================
hive 表数据
panda 2017-1-01 5t
gifshow 2017-1-01 3t
yy 2017-1-01 2t
laifeng 2017-1-01 2t
panda 2017-1-02 5t
gifshow 2017-1-02 3t
yy 2017-1-02 2t
laifeng 2017-1-02 2t
panda 2017-2-01 4t
gifshow 2017-2-01 3t
yy 2017-2-01 1t
laifeng 2017-2-01 4t
panda 2017-2-02 4t
gifshow 2017-2-02 3t
yy 2017-2-02 1t
laifeng 2017-2-02 4t
panda 2017-3-01 4t
gifshow 2017-3-01 3t
yy 2017-3-01 1t
laifeng 2017-3-01 4t
panda 2017-3-02 4t
gifshow 2017-3-02 3t
yy 2017-3-02 1t
laifeng 2017-3-02 4t
这张表,在hive 中,字段为:user,date,access
测试:1:
hive (default)> set mapred.reduce.tasks=4;
hive (default)> insert overwrite local directory '/root/data/dis_shipin/' row format delimited fields terminated by '\t'
select user,date,access from shipin_origin cluster by length(user) ;
---
Number of reduce tasks not specified. Defaulting to jobconf value of: 4
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
Stage-Stage-1: Map: 1 Reduce: 4 Cumulative CPU: 8.5 sec HDFS Read: 15459 HDFS Write: 462 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 500 msec
OK
Time taken: 44.638 seconds
---
结果:
[root@hadoop002 data]# more dis_shipin/*
::::::::::::::
dis_shipin/000000_0
::::::::::::::
::::::::::::::
dis_shipin/000001_0
::::::::::::::
panda 2017-1-01 5t
panda 2017-3-02 4t
panda 2017-2-01 4t
panda 2017-3-01 4t
panda 2017-1-02 5t
panda 2017-2-02 4t
::::::::::::::
dis_shipin/000002_0
::::::::::::::
yy 2017-2-01 1t
yy 2017-1-02 2t
yy 2017-3-02 1t
yy 2017-2-02 1t
yy 2017-1-01 2t
yy 2017-3-01 1t
::::::::::::::
dis_shipin/000003_0
::::::::::::::
laifeng 2017-3-02 4t
gifshow 2017-3-02 3t
laifeng 2017-3-01 4t
gifshow 2017-3-01 3t
laifeng 2017-2-02 4t
gifshow 2017-2-02 3t
gifshow 2017-2-01 3t
laifeng 2017-1-02 2t
gifshow 2017-1-02 3t
laifeng 2017-1-01 2t
gifshow 2017-1-01 3t
laifeng 2017-2-01 4t
发现:
手工配置4个reduce输出,由于通过length(user)数字去分配到不同reduce,实验成功,只不过分配的条件不正确,功能无误,有1个 reduce 为空
测试:2:
hive (default)> set mapred.reduce.tasks=4;
hive (default)> insert overwrite local directory '/root/data/dis_shipin/' row format delimited fields terminated by '\t'
select user,date,access from shipin_origin cluster by substr(user,0,2) ;
Number of reduce tasks not specified. Defaulting to jobconf value of: 4
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
Stage-Stage-1: Map: 1 Reduce: 4 Cumulative CPU: 10.0 sec HDFS Read: 15895 HDFS Write: 462 SUCCESS
Total MapReduce CPU Time Spent: 10 seconds 0 msec
OK
Time taken: 49.651 seconds
结果:
[root@hadoop002 data]# more dis_shipin/*
::::::::::::::
dis_shipin/000000_0
::::::::::::::
yy 2017-2-01 1t
yy 2017-2-02 1t
yy 2017-1-01 2t
yy 2017-1-02 2t
yy 2017-3-02 1t
yy 2017-3-01 1t
::::::::::::::
dis_shipin/000001_0
::::::::::::::
laifeng 2017-3-02 4t
laifeng 2017-3-01 4t
laifeng 2017-2-02 4t
laifeng 2017-2-01 4t
laifeng 2017-1-02 2t
laifeng 2017-1-01 2t
panda 2017-2-01 4t
panda 2017-3-02 4t
panda 2017-1-02 5t
panda 2017-1-01 5t
panda 2017-3-01 4t
panda 2017-2-02 4t
::::::::::::::
dis_shipin/000002_0
::::::::::::::
gifshow 2017-3-02 3t
gifshow 2017-3-01 3t
gifshow 2017-2-01 3t
gifshow 2017-1-01 3t
gifshow 2017-1-02 3t
gifshow 2017-2-02 3t
::::::::::::::
dis_shipin/000003_0
::::::::::::::
发现:
手工配置4个reduce输出,由于通过substr(user,0,2)字符去分配到不同reduce,实验结果错误,没有将laifeng和panda分开。有1个reduce未分配数据
测试:3:
hive (default)> set mapred.reduce.tasks=4;
hive (default)> insert overwrite local directory '/root/data/dis_shipin/' row format delimited fields terminated by '\t'
select user,date,access from shipin_origin cluster by user ;
Number of reduce tasks not specified. Defaulting to jobconf value of: 4
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
Stage-Stage-1: Map: 1 Reduce: 4 Cumulative CPU: 8.9 sec HDFS Read: 14358 HDFS Write: 462 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 900 msec
OK
Time taken: 49.366 seconds
结果:
[root@hadoop002 data]# more dis_shipin/*
::::::::::::::
dis_shipin/000000_0
::::::::::::::
laifeng 2017-3-02 4t
laifeng 2017-3-01 4t
laifeng 2017-2-02 4t
laifeng 2017-2-01 4t
laifeng 2017-1-02 2t
laifeng 2017-1-01 2t
yy 2017-1-01 2t
yy 2017-2-01 1t
yy 2017-1-02 2t
yy 2017-2-02 1t
yy 2017-3-02 1t
yy 2017-3-01 1t
::::::::::::::
dis_shipin/000001_0
::::::::::::::
gifshow 2017-3-01 3t
gifshow 2017-2-02 3t
gifshow 2017-2-01 3t
gifshow 2017-1-01 3t
gifshow 2017-1-02 3t
gifshow 2017-3-02 3t
::::::::::::::
dis_shipin/000002_0
::::::::::::::
panda 2017-1-02 5t
panda 2017-3-01 4t
panda 2017-2-01 4t
panda 2017-2-02 4t
panda 2017-3-02 4t
panda 2017-1-01 5t
::::::::::::::
dis_shipin/000003_0
::::::::::::::
发现:
手工配置4个reduce输出,由于通过 user 字符去分配到不同reduce,实验结果错误,没有将laifeng和yy分开。有1个reduce未分配数据
测试4:
hive (default)> set mapred.reduce.tasks=5;
hive (default)> insert overwrite local directory '/root/data/dis_shipin/' row format delimited fields terminated by '\t'
select user,date,access from shipin_origin cluster by user ;
Number of reduce tasks not specified. Defaulting to jobconf value of: 5
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 5
Stage-Stage-1: Map: 1 Reduce: 5 Cumulative CPU: 8.84 sec HDFS Read: 17060 HDFS Write: 462 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 840 msec
OK
Time taken: 47.521 seconds
结果:
[root@hadoop002 data]# ll dis_shipin/
total 8
-rw-r--r-- 1 root root 0 Oct 23 11:30 000000_0
-rw-r--r-- 1 root root 240 Oct 23 11:30 000001_0
-rw-r--r-- 1 root root 222 Oct 23 11:30 000002_0
-rw-r--r-- 1 root root 0 Oct 23 11:30 000003_0
-rw-r--r-- 1 root root 0 Oct 23 11:30 000004_0
[root@hadoop002 data]# more dis_shipin/*
::::::::::::::
dis_shipin/000000_0
::::::::::::::
::::::::::::::
dis_shipin/000001_0
::::::::::::::
laifeng 2017-3-02 4t
laifeng 2017-3-01 4t
laifeng 2017-2-02 4t
laifeng 2017-2-01 4t
laifeng 2017-1-02 2t
laifeng 2017-1-01 2t
panda 2017-3-02 4t
panda 2017-2-01 4t
panda 2017-1-01 5t
panda 2017-1-02 5t
panda 2017-3-01 4t
panda 2017-2-02 4t
::::::::::::::
dis_shipin/000002_0
::::::::::::::
gifshow 2017-3-01 3t
gifshow 2017-2-02 3t
gifshow 2017-2-01 3t
gifshow 2017-1-01 3t
gifshow 2017-1-02 3t
gifshow 2017-3-02 3t
yy 2017-2-02 1t
yy 2017-3-02 1t
yy 2017-3-01 1t
yy 2017-2-01 1t
yy 2017-1-01 2t
yy 2017-1-02 2t
::::::::::::::
dis_shipin/000003_0
::::::::::::::
::::::::::::::
dis_shipin/000004_0
::::::::::::::
发现:
手工配置5个reduce输出,由于通过 user 字符去分配到不同reduce,实验结果错误,没有将gifshow和yy分开,panda和laifeng分开,结果和配置4个reduce不同。
有3个reduce未分配数据
测试5:
hive (default)> set mapred.reduce.tasks=5;
hive (default)> insert overwrite local directory '/root/data/dis_shipin/' row format delimited fields terminated by '\t'
select user,date,access from shipin_origin cluster by length(user) ;
Number of reduce tasks not specified. Defaulting to jobconf value of: 5
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 5
Stage-Stage-1: Map: 1 Reduce: 5 Cumulative CPU: 9.27 sec HDFS Read: 18369 HDFS Write: 462 SUCCESS
Total MapReduce CPU Time Spent: 9 seconds 270 msec
OK
Time taken: 48.579 seconds
结果:
[root@hadoop002 data]# more dis_shipin/*
::::::::::::::
dis_shipin/000000_0
::::::::::::::
panda 2017-1-01 5t
panda 2017-3-02 4t
panda 2017-2-01 4t
panda 2017-3-01 4t
panda 2017-1-02 5t
panda 2017-2-02 4t
::::::::::::::
dis_shipin/000001_0
::::::::::::::
::::::::::::::
dis_shipin/000002_0
::::::::::::::
yy 2017-2-01 1t
yy 2017-1-02 2t
yy 2017-3-02 1t
yy 2017-2-02 1t
yy 2017-1-01 2t
yy 2017-3-01 1t
laifeng 2017-3-02 4t
gifshow 2017-3-02 3t
laifeng 2017-3-01 4t
gifshow 2017-3-01 3t
laifeng 2017-2-02 4t
gifshow 2017-2-02 3t
gifshow 2017-2-01 3t
laifeng 2017-1-02 2t
gifshow 2017-1-02 3t
laifeng 2017-1-01 2t
gifshow 2017-1-01 3t
laifeng 2017-2-01 4t
::::::::::::::
dis_shipin/000003_0
::::::::::::::
::::::::::::::
dis_shipin/000004_0
::::::::::::::
发现:
手工配置5个reduce输出,由于通过 length(user) 数字去分配到不同reduce,实验结果错误,没有将gifshow,yy,laifeng分开,结果和配置4个reduce不同。
有3个reduce未分配数据
测试:6:
hive (default)> set mapred.reduce.tasks=5;
hive (default)> insert overwrite local directory '/root/data/dis_shipin/' row format delimited fields terminated by '\t'
select user,date,access from shipin_origin cluster by substr(user,0,2) ;
Number of reduce tasks not specified. Defaulting to jobconf value of: 5
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 5
Stage-Stage-1: Map: 1 Reduce: 5 Cumulative CPU: 10.4 sec HDFS Read: 18901 HDFS Write: 462 SUCCESS
Total MapReduce CPU Time Spent: 10 seconds 400 msec
OK
Time taken: 49.339 seconds
结果:
[root@hadoop002 data]# more dis_shipin/*
::::::::::::::
dis_shipin/000000_0
::::::::::::::
laifeng 2017-3-02 4t
laifeng 2017-3-01 4t
laifeng 2017-2-02 4t
laifeng 2017-2-01 4t
laifeng 2017-1-02 2t
laifeng 2017-1-01 2t
::::::::::::::
dis_shipin/000001_0
::::::::::::::
::::::::::::::
dis_shipin/000002_0
::::::::::::::
yy 2017-1-01 2t
yy 2017-2-01 1t
yy 2017-1-02 2t
yy 2017-2-02 1t
yy 2017-3-02 1t
yy 2017-3-01 1t
::::::::::::::
dis_shipin/000003_0
::::::::::::::
gifshow 2017-3-01 3t
gifshow 2017-2-02 3t
gifshow 2017-2-01 3t
gifshow 2017-1-01 3t
gifshow 2017-1-02 3t
gifshow 2017-3-02 3t
::::::::::::::
dis_shipin/000004_0
::::::::::::::
panda 2017-1-02 5t
panda 2017-3-01 4t
panda 2017-2-01 4t
panda 2017-2-02 4t
panda 2017-3-02 4t
panda 2017-1-01 5t
发现:
手工配置5个reduce输出,由于通过 substr(user,0,2) 字符去分配到不同reduce,实验结果正确,达到目标。结果和配置4个reduce不同。
有1个reduce未分配数据。
结合使用:
set mapred.reduce.tasks=5;
hive (default)> select user,substr(date,0,6),sum(cast(substr(access,0,1) as INT)) from shipin_origin distribute by substr(user,0,2) group by user,substr(date,0,7) ;
FAILED: ParseException line 1:118 missing EOF at 'group' near ')'
hive (default)> insert overwrite local directory '/root/data/dis_shipin/' row format delimited fields terminated by '\t'
> select user,substr(date,0,6),sum(cast(substr(access,0,1) as INT)) from shipin_origin group by user,substr(date,0,7) distribute by substr(user,0,2);
Query ID = root_20171023091818_928d5dec-5afb-468e-af8c-28750170b31b
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Defaulting to jobconf value of: 5
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1508685632650_0068, Tracking URL = http://hadoop002:8088/proxy/application_1508685632650_0068/
Kill Command = /opt/software/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1508685632650_0068
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 5
2017-10-23 12:32:40,024 Stage-1 map = 0%, reduce = 0%
2017-10-23 12:32:47,672 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.98 sec
2017-10-23 12:33:11,709 Stage-1 map = 100%, reduce = 13%, Cumulative CPU 2.98 sec
2017-10-23 12:33:17,787 Stage-1 map = 100%, reduce = 20%, Cumulative CPU 4.0 sec
2017-10-23 12:33:20,217 Stage-1 map = 100%, reduce = 47%, Cumulative CPU 6.26 sec
2017-10-23 12:33:22,672 Stage-1 map = 100%, reduce = 60%, Cumulative CPU 7.42 sec
2017-10-23 12:33:23,931 Stage-1 map = 100%, reduce = 87%, Cumulative CPU 10.38 sec
2017-10-23 12:33:24,993 Stage-1 map = 100%, reduce = 93%, Cumulative CPU 11.11 sec
2017-10-23 12:33:26,025 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.86 sec
MapReduce Total cumulative CPU time: 11 seconds 860 msec
Ended Job = job_1508685632650_0068
Launching Job 2 out of 2
Number of reduce tasks not specified. Defaulting to jobconf value of: 5
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1508685632650_0069, Tracking URL = http://hadoop002:8088/proxy/application_1508685632650_0069/
Kill Command = /opt/software/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1508685632650_0069
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 5
2017-10-23 12:33:36,609 Stage-2 map = 0%, reduce = 0%
2017-10-23 12:33:45,209 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 1.78 sec
2017-10-23 12:34:13,386 Stage-2 map = 100%, reduce = 13%, Cumulative CPU 3.01 sec
2017-10-23 12:34:19,658 Stage-2 map = 100%, reduce = 33%, Cumulative CPU 5.23 sec
2017-10-23 12:34:20,836 Stage-2 map = 100%, reduce = 47%, Cumulative CPU 6.38 sec
2017-10-23 12:34:23,210 Stage-2 map = 100%, reduce = 80%, Cumulative CPU 9.55 sec
2017-10-23 12:34:26,761 Stage-2 map = 100%, reduce = 87%, Cumulative CPU 10.72 sec
2017-10-23 12:34:27,795 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 12.74 sec
MapReduce Total cumulative CPU time: 12 seconds 740 msec
Ended Job = job_1508685632650_0069
Copying data to local directory /root/data/dis_shipin
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 5 Cumulative CPU: 11.86 sec HDFS Read: 22734 HDFS Write: 867 SUCCESS
Stage-Stage-2: Map: 1 Reduce: 5 Cumulative CPU: 12.74 sec HDFS Read: 18045 HDFS Write: 196 SUCCESS
Total MapReduce CPU Time Spent: 24 seconds 600 msec
OK
Time taken: 118.44 seconds
结果:
[root@hadoop002 data]# more dis_shipin/*
::::::::::::::
dis_shipin/000000_0
::::::::::::::
laifeng 2017-1 4
laifeng 2017-3 8
laifeng 2017-2 8
::::::::::::::
dis_shipin/000001_0
::::::::::::::
::::::::::::::
dis_shipin/000002_0
::::::::::::::
yy 2017-3 2
yy 2017-2 2
yy 2017-1 4
::::::::::::::
dis_shipin/000003_0
::::::::::::::
gifshow 2017-3 6
gifshow 2017-2 6
gifshow 2017-1 6
::::::::::::::
dis_shipin/000004_0
::::::::::::::
panda 2017-3 8
panda 2017-2 8
panda 2017-1 10
发现:
手工配置5个reduce输出,由于通过 substr(user,0,2) 字符去分配到不同reduce,通过group by 进行聚合。 实验结果正确,达到目标。
============================================================================================
但是有个疑问,是先执行的 distribute by 还是 先做 group by ?还需要查看hive 的执行计划。
进行分析:
#本条命令只对 reduce 进行分配
hive (default)> explain
> select user,date,cast(substr(access,0,1) as INT) from shipin_origin distribute by substr(user,0,2);
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: shipin_origin
Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: user (type: string), date (type: string), UDFToInteger(substr(access, 0, 1)) (type: int)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
sort order:
Map-reduce partition columns: substr(_col0, 0, 2) (type: string)
Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: int)
Reduce Operator Tree:
Select Operator
expressions: VALUE._col0 (type: string), VALUE._col1 (type: string), VALUE._col2 (type: int)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.098 seconds, Fetched: 39 row(s)
##以下为综合使用
hive (default)> explain
> select user,substr(date,0,7),sum(cast(substr(access,0,1) as INT)) from shipin_origin group by user,substr(date,0,7) distribute by substr(user,0,2);
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 depends on stages: Stage-2
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: shipin_origin
Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: user (type: string), date (type: string), access (type: string)
outputColumnNames: user, date, access
Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: sum(UDFToInteger(substr(access, 0, 1)))
keys: user (type: string), substr(date, 0, 7) (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type: string)
sort order: ++
Map-reduce partition columns: _col0 (type: string), _col1 (type: string)
Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
value expressions: _col2 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: sum(VALUE._col0)
keys: KEY._col0 (type: string), KEY._col1 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
sort order:
Map-reduce partition columns: substr(_col0, 0, 2) (type: string)
Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: bigint)
Reduce Operator Tree:
Select Operator
expressions: VALUE._col0 (type: string), VALUE._col1 (type: string), VALUE._col2 (type: bigint)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.121 seconds, Fetched: 70 row(s)
发现:先做了group by , 再做 distribute by 分配输出到不同reduce ,从效率上看,没有进行并行 group by 处理
那么如何先做 distribute by 呢? 这里考虑 做固化查询
虽然从执行计划上看是先做了distribute,但是在结果上看,后来做group by 的时候会再次 因为之前设置的 reduce 数量,导致结果分片。
hive (default)> explain
> insert overwrite local directory '/root/data/dis_shipin/' row format delimited fields terminated by '\t'
> select A.user,A.date,sum(A.acc) from
> (select user,substr(date,0,6)as date ,cast(substr(access,0,1) as INT) as acc from shipin_origin distribute by substr(user,0,2)) A group by A.user,A.date;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 depends on stages: Stage-2
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: shipin_origin
Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: user (type: string), substr(date, 0, 6) (type: string), UDFToInteger(substr(access, 0, 1)) (type: int)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
sort order:
Map-reduce partition columns: substr(_col0, 0, 2) (type: string)
Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: int)
Reduce Operator Tree:
Select Operator
expressions: VALUE._col0 (type: string), VALUE._col1 (type: string), VALUE._col2 (type: int)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: sum(_col2)
keys: _col0 (type: string), _col1 (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type: string)
sort order: ++
Map-reduce partition columns: _col0 (type: string), _col1 (type: string)
Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
value expressions: _col2 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: sum(VALUE._col0)
keys: KEY._col0 (type: string), KEY._col1 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 462 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Move Operator
files:
hdfs directory: false
destination: /root/data/dis_shipin
Time taken: 0.221 seconds, Fetched: 70 row(s)