官方文档
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain
关于Hive执行计划简述
一般执行计划有两个部分:
stage dependencies 各个stage之间的依赖性
stage plan 各个stage的执行计划
一个stage并不一定是一个MR,有可能是Fetch Operator,也有可能是Move Operator。
一个MR的执行计划分为两个部分:
Map Operator Tree MAP端的执行计划
Reduce Operator Tree Reduce端的执行计划
一些常见的Operator:
TableScan 读取数据,常见的属性 alias
Select Operator 选取操作
Group By Operator 分组聚合, 常见的属性 aggregations、mode , 当没有keys属性时只有一个分组。
Reduce Output Operator 输出结果给Reduce , 常见的属性 sort order
Fetch Operator 客户端获取数据 , 常见属性 limit
常见的属性的取值及含义:
aggregations 用在Group By Operator中
count()计数
mode 用在Group By Operator中
hash 待定
mergepartial 合并部分聚合结果
final
sort order 用于Reduce Output Operator中
+ 正序排序
不排序
++按两列正序排序,如果有两列
+- 正反排序,如果有两列
-反向排序
如此类推
下面是一些典型的操作的执行计划
先看一个简单的执行计划
hive> explain select count(*) from t_data1 ;
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
#说明stage之间的依赖性
STAGE PLANS: #各个stage的执行计划
Stage: Stage-1
Map Reduce #这个stage是一个MR
Map Operator Tree: #Map阶段的操作树
TableScan #扫描表,获取数据
alias: t_data1 扫描的表别名
Statistics: Num rows: 1 Data size: 43835224 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator #选取操作
Statistics: Num rows: 1 Data size: 43835224 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator #分组聚合操作,不指定Key,只有一个分组
aggregations: count() 聚合操作
mode: hash 模式?
outputColumnNames: _col0 输出列名
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator #输出结果给Reduce
sort order: #不排序
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col0 (type: bigint) #value表达式
Reduce Operator Tree: #Reduce的操作树
Group By Operator #分组聚合操作
aggregations: count(VALUE._col0) 聚合操作
mode: mergepartial 合并各个map所贡献的各部分
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator #文件输出操作
compressed: false 不压缩
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0 #依赖于Stage1的stage0
Fetch Operator #获取数据操作
limit: -1 #不限定
Processor Tree:
ListSink
这是一个简单的count(*)的执行计划
再来看一个count(distinct)的执行计划
hive> explain select count(distinct sid) from t_data1 ;
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: t_data1
Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: sid (type: bigint) #选取SID
outputColumnNames: sid
Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
Group By Operator #分组聚合操作
aggregations: count(DISTINCT sid) #聚合算子
keys: sid (type: bigint) #分组键
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator #输出到Reduce
key expressions: _col0 (type: bigint) #键表达式
sort order: + #正向排序
Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Group By Operator #分组聚合操作
aggregations: count(DISTINCT KEY._col0:0._col0)
mode: mergepartial #合并各个部分聚合结果
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
分组聚合的例子
explain select applove_date , count(*) from t_data1 group by applove_date ;
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: t_data1
Statistics: Num rows: 1095880 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
Sele