hive学习(3)--- 较大数据下的hql执行速度

本文通过一个示例展示了在20600000条数据下,Hive执行包含多表连接、条件判断、排序及限制的HQL语句的过程。执行过程中涉及3个MapReduce作业,分析了Hive根据查询复杂度规划Job的初步规律,并对比了加入排序操作对Job数量的影响。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

生成一个20000000条数据的wetherdata4.txt文件,追加到weather表中,形成20600000条数据下的查询性能情况分析:

第一条hql语句:

select cy.number,wh.*,pm.pmlevel                                                 
     from cityinfo cy join weather wh on (cy.name=wh.city) 
     join pminfo pm on (pm.pmvalue=wh.pmvalue) 
    where wh.city='hangzhou' and wh.weath='fine' and wh.minTemperat in         
    ( -18,25,43) order by maxTemperat DESC limit 20;

三表联合,有判断条件,有in,有排序


执行过程:

hive> select cy.number,wh.*,pm.pmlevel                                                 
    >      from cityinfo cy join weather wh on (cy.name=wh.city) 
    >      join pminfo pm on (pm.pmvalue=wh.pmvalue) 
    >     where wh.city='hangzhou' and wh.weath='fine' and wh.minTemperat in         
    >     ( -18,25,43) order by maxTemperat DESC limit 20;
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks not specified. Estimated from input data size: 2
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_1400798880449_0059, Tracking URL = http://Master.Hadoop:8088/proxy/application_1400798880449_0059/
Kill Command = /usr/hadoop/bin/hadoop job  -kill job_1400798880449_0059
Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 2
2014-05-24 09:54:29,453 Stage-1 map = 0%,  reduce = 0%
2014-05-24 09:54:36,505 Stage-1 map = 20%,  reduce = 0%, Cumulative CPU 1.32 sec
2014-05-24 09:54:37,528 Stage-1 map = 20%,  reduce = 0%, Cumulative CPU 1.32 sec
2014-05-24 09:54:38,551 Stage-1 map = 20%,  reduce = 0%, Cumulative CPU 1.32 sec
2014-05-24 09:54:39,948 Stage-1 map = 20%,  reduce = 0%, Cumulative CPU 13.56 sec
2014-05-24 09:54:41,309 Stage-1 map = 21%,  reduce = 0%, Cumulative CPU 17.58 sec
2014-05-24 09:54:42,509 Stage-1 map = 35%,  reduce = 0%, Cumulative CPU 21.57 sec
2014-05-24 09:54:44,675 Stage-1 map = 35%,  reduce = 0%, Cumulative CPU 26.09 sec
2014-05-24 09:54:46,488 Stage-1 map = 43%,  reduce = 0%, Cumulative CPU 33.4 sec
2014-05-24 09:54:47,899 Stage-1 map = 49%,  reduce = 0%, Cumulative CPU 35.77 sec
2014-05-24 09:54:48,957 Stage-1 map = 49%,  reduce = 0%, Cumulative CPU 35.77 sec
2014-05-24 09:54:49,997 Stage-1 map = 87%,  reduce = 0%, Cumulative CPU 46.21 sec
2014-05-24 09:54:51,026 Stage-1 map = 87%,  reduce = 0%, Cumulative CPU 46.21 sec
2014-05-24 09:54:52,049 Stage-1 map = 87%,  reduce = 7%, Cumulative CPU 47.25 sec
2014-05-24 09:54:53,085 Stage-1 map = 100%,  reduce = 7%, Cumulative CPU 49.46 sec
2014-05-24 09:54:54,109 Stage-1 map = 100%,  reduce = 57%, Cumulative CPU 51.61 sec
2014-05-24 09:54:55,141 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 53.67 sec
2014-05-24 09:54:56,177 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 53.67 sec
MapReduce Total cumulative CPU time: 53 seconds 670 msec
Ended Job = job_1400798880449_0059
Launching Job 2 out of 3
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_1400798880449_0060, Tracking URL = http://Master.Hadoop:8088/proxy/application_1400798880449_0060/
Kill Command = /usr/hadoop/bin/hadoop job  -kill job_1400798880449_0060
Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 1
2014-05-24 09:54:59,962 Stage-2 map = 0%,  reduce = 0%
2014-05-24 09:55:04,109 Stage-2 map = 50%,  reduce = 0%, Cumulative CPU 0.94 sec
2014-05-24 09:55:05,130 Stage-2 map = 50%,  reduce = 0%, Cumulative CPU 0.94 sec
2014-05-24 09:55:06,164 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 3.34 sec
2014-05-24 09:55:07,187 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 3.34 sec
2014-05-24 09:55:08,212 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 3.34 sec
2014-05-24 09:55:09,237 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 3.34 sec
2014-05-24 09:55:10,258 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 5.39 sec
2014-05-24 09:55:11,294 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 5.39 sec
MapReduce Total cumulative CPU time: 5 seconds 390 msec
Ended Job = job_1400798880449_0060
Launching Job 3 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_1400798880449_0061, Tracking URL = http://Master.Hadoop:8088/proxy/application_1400798880449_0061/
Kill Command = /usr/hadoop/bin/hadoop job  -kill job_1400798880449_0061
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2014-05-24 09:55:15,223 Stage-3 map = 0%,  reduce = 0%
2014-05-24 09:55:19,362 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.27 sec
2014-05-24 09:55:20,404 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.27 sec
2014-05-24 09:55:21,425 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.27 sec
2014-05-24 09:55:22,446 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.27 sec
2014-05-24 09:55:23,534 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 1.83 sec
2014-05-24 09:55:24,571 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 1.83 sec
2014-05-24 09:55:25,603 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 1.83 sec
MapReduce Total cumulative CPU time: 1 seconds 830 msec
Ended Job = job_1400798880449_0061
MapReduce Jobs Launched: 
Job 0: Map: 5  Reduce: 2   Cumulative CPU: 53.67 sec   HDFS Read: 1061735802 HDFS Write: 6918992 SUCCESS
Job 1: Map: 2  Reduce: 1   Cumulative CPU: 5.39 sec   HDFS Read: 6923192 HDFS Write: 7232163 SUCCESS
Job 2: Map: 1  Reduce: 1   Cumulative CPU: 1.83 sec   HDFS Read: 7232532 HDFS Write: 1126 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 0 seconds 890 msec
OK


从上面的日志看,一共执行了3次job,经过尝试发现,hive规划的job的次数取决于hql语句的复杂程度,具体规则是什么样的,有待后续研究,此处举一个例子:

select *
from cityinfo join weather on (cityinfo.name=weather.city) 
where city='hangzhou' and weath='fine' and minTemperat=-16 
上面这句语句,hql只会生成一个job,当然可能包括N个mapper和M个reducer

但是如果后面跟上排序的功能,就会变成2个job:

select *
from cityinfo join weather on (cityinfo.name=weather.city) 
where city='hangzhou' and weath='fine' and minTemperat=-16 
order by maxTemperat

执行过程中的数据倾斜于top信息:





评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值