优化大纲:https://blog.youkuaiyun.com/u010003835/article/details/105334641
测试表以及测试数据
+----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
| CREATE TABLE `datacube_salary_org`( |
| `company_name` string COMMENT '????', |
| `dep_name` string COMMENT '????', |
| `user_id` bigint COMMENT '??id', |
| `user_name` string COMMENT '????', |
| `salary` decimal(10,2) COMMENT '??', |
| `create_time` date COMMENT '????', |
| `update_time` date COMMENT '????') |
| PARTITIONED BY ( |
| `pt` string COMMENT '????') |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' |
| WITH SERDEPROPERTIES ( |
| 'field.delim'=',', |
| 'serialization.format'=',') |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.mapred.TextInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION |
| 'hdfs://cdh-manager:8020/user/hive/warehouse/data_warehouse_test.db/datacube_salary_org' |
| TBLPROPERTIES ( |
| 'transient_lastDdlTime'='1586310488') |
+----------------------------------------------------+
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
| datacube_salary_org.company_name | datacube_salary_org.dep_name | datacube_salary_org.user_id | datacube_salary_org.user_name | datacube_salary_org.salary | datacube_salary_org.create_time | datacube_salary_org.update_time | datacube_salary_org.pt |
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
| s.zh | engineer | 1 | szh | 28000.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| s.zh | engineer | 2 | zyq | 26000.00 | 2020-04-03 | 2020-04-03 | 20200405 |
| s.zh | tester | 3 | gkm | 20000.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| x.qx | finance | 4 | pip | 13400.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| x.qx | finance | 5 | kip | 24500.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| x.qx | finance | 6 | zxxc | 13000.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| x.qx | kiccp | 7 | xsz | 8600.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| s.zh | engineer | 1 | szh | 28000.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| s.zh | engineer | 2 | zyq | 26000.00 | 2020-04-03 | 2020-04-03 | 20200406 |
| s.zh | tester | 3 | gkm | 20000.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| x.qx | finance | 4 | pip | 13400.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| x.qx | finance | 5 | kip | 24500.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| x.qx | finance | 6 | zxxc | 13000.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| x.qx | kiccp | 7 | xsz | 8600.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| s.zh | enginer | 1 | szh | 28000.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| s.zh | enginer | 2 | zyq | 26000.00 | 2020-04-03 | 2020-04-03 | 20200407 |
| s.zh | tester | 3 | gkm | 20000.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| x.qx | finance | 4 | pip | 13400.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| x.qx | finance | 5 | kip | 24500.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| x.qx | finance | 6 | zxxc | 13000.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| x.qx | kiccp | 7 | xsz | 8600.00 | 2020-04-07 | 2020-04-07 | 20200407 |
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
场景五.排序问题
1) 合理使用 ORDER BY 与 SORT BY , 在两者之间做取舍
2) 通过使用 LIMIT 限制排序的输出
1) 合理使用 ORDER BY 与 SORT BY , 在两者之间做取舍
应该尽量避免使用 ORDER BY ,ORDER BY 会在全局进行排序。会单独增加一个作业,在全局进行排序。并且在排序中,数据会被分放在一个Reducer 中。
SORT BY 只是在每一个 reducer 中,进行局部的排序。
这里不熟悉SORT BY 的同学,可以看下我的这篇文章。
https://blog.youkuaiyun.com/u010003835/article/details/80938339
可以看到这SORT BY 与 ORDER BY 是不一样的。
我们利用以上数据执行下排序,看看两者有什么区别:
SORT BY
我们先设置下 Reduce 的数量:
因为数量量的问题,我们的Reducer 默认是1。设置 Reducer 数量,主要是帮助我们调整数据的分发问题。
set mapred.reduce.tasks=5;
SORT BY SQL
SELECT user_id, user_name, salary
FROM datacube_salary_org
SORT BY user_id;
SORT BY 的 EXPLAIN 结果:
INFO : Starting task [Stage-2:EXPLAIN] in serial mode
INFO : Completed executing command(queryId=hive_20200414042647_8a05e5e1-5eff-4fe7-b3e1-10e84b19c6c0); Time taken: 0.003 seconds
INFO : OK
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: datacube_salary_org |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: user_id (type: bigint), user_name (type: string), salary (type: decimal(10,2)) |
| outputColumnNames: _col0, _col1, _col2 |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: bigint) |
| sort order: + |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col1 (type: string), _col2 (type: decimal(10,2)) |
| Reduce Operator Tree: |
| Select Operator |
| expressions: KEY.reducesinkkey0 (type: bigint), VALUE._col0 (type: string), VALUE._col1 (type: decimal(10,2)) |
| outputColumnNames: _col0, _col1, _col2 |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
39 rows selected (0.153 seconds)
执行结果:
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 5
INFO : 2020-04-14 04:28:56,154 Stage-1 map = 0%, reduce = 0%
INFO : 2020-04-14 04:29:03,348 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.02 sec
INFO : 2020-04-14 04:29:08,477 Stage-1 map = 100%, reduce = 20%, Cumulative CPU 3.77 sec
INFO : 2020-04-14 04:29:12,587 Stage-1 map = 100%, reduce = 40%, Cumulative CPU 5.53 sec
INFO : 2020-04-14 04:29:16,693 Stage-1 map = 100%, reduce = 60%, Cumulative CPU 7.04 sec
INFO : 2020-04-14 04:29:20,804 Stage-1 map = 100%, reduce = 80%, Cumulative CPU 8.76 sec
INFO : 2020-04-14 04:29:25,005 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 10.99 sec
INFO : MapReduce Total cumulative CPU time: 10 seconds 990 msec
INFO : Ended Job = job_1586835454940_0014
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 1 Reduce: 5 Cumulative CPU: 10.99 sec HDFS Read: 28509 HDFS Write: 1002 SUCCESS
INFO : Total MapReduce CPU Time Spent: 10 seconds 990 msec
INFO : Completed executing command(queryId=hive_20200414042847_8447505f-581e-4204-804c-f63d97dcd553); Time taken: 39.228 seconds
INFO : OK
+----------+------------+-----------+
| user_id | user_name | salary |
+----------+------------+-----------+
| 6 | zxxc | 13000.00 |
| 7 | xsz | 8600.00 |
| 5 | kip | 24500.00 |
| 6 | zxxc | 13000.00 |
| 7 | xsz | 8600.00 |
| 1 | szh | 28000.00 |
| 1 | szh | 28000.00 |
| 5 | kip | 24500.00 |
| 5 | kip | 24500.00 |
| 2 | zyq | 26000.00 |
| 2 | zyq | 26000.00 |
| 2 | zyq | 26000.00 |
| 3 | gkm | 20000.00 |
| 4 | pip | 13400.00 |
| 4 | pip | 13400.00 |
| 1 | szh | 28000.00 |
| 3 | gkm | 20000.00 |
| 3 | gkm | 20000.00 |
| 4 | pip | 13400.00 |
| 6 | zxxc | 13000.00 |
| 7 | xsz | 8600.00 |
+----------+------------+-----------+
21 rows selected (39.402 seconds)
可以看到是5个升序的子结果,也印证了数据被分发到了5个reducer
| 6 | zxxc | 13000.00 |
| 7 | xsz | 8600.00 |
| 5 | kip | 24500.00 |
| 6 | zxxc | 13000.00 |
| 7 | xsz | 8600.00 |
| 1 | szh | 28000.00 |
| 1 | szh | 28000.00 |
| 5 | kip | 24500.00 |
| 5 | kip | 24500.00 |
| 2 | zyq | 26000.00 |
| 2 | zyq | 26000.00 |
| 2 | zyq | 26000.00 |
| 3 | gkm | 20000.00 |
| 4 | pip | 13400.00 |
| 4 | pip | 13400.00 |
| 1 | szh | 28000.00 |
| 3 | gkm | 20000.00 |
| 3 | gkm | 20000.00 |
| 4 | pip | 13400.00 |
| 6 | zxxc | 13000.00 |
| 7 | xsz | 8600.00 |
ORDER BY
我们先设置下 Reduce 的数量:
因为数量量的问题,我们的Reducer 默认是1。设置 Reducer 数量,主要是帮助我们调整数据的分发问题。
set mapred.reduce.tasks=5;
ORDER BY SQL
SELECT user_id, user_name, salary
FROM datacube_salary_org
ORDER BY user_id;
EXPLAIN 结果
INFO : Starting task [Stage-2:EXPLAIN] in serial mode
INFO : Completed executing command(queryId=hive_20200414044711_439184d4-6d74-46d6-8b79-1853b21090f2); Time taken: 0.006 seconds
INFO : OK
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: datacube_salary_org |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: user_id (type: bigint), user_name (type: string), salary (type: decimal(10,2)) |
| outputColumnNames: _col0, _col1, _col2 |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: bigint) |
| sort order: + |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col1 (type: string), _col2 (type: decimal(10,2)) |
| Reduce Operator Tree: |
| Select Operator |
| expressions: KEY.reducesinkkey0 (type: bigint), VALUE._col0 (type: string), VALUE._col1 (type: decimal(10,2)) |
| outputColumnNames: _col0, _col1, _col2 |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
39 rows selected (0.143 seconds)
运行输出
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO : 2020-04-14 04:56:06,003 Stage-1 map = 0%, reduce = 0%
INFO : 2020-04-14 04:56:13,192 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.31 sec
INFO : 2020-04-14 04:56:18,323 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.99 sec
INFO : MapReduce Total cumulative CPU time: 3 seconds 990 msec
INFO : Ended Job = job_1586835454940_0015
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.99 sec HDFS Read: 11305 HDFS Write: 654 SUCCESS
INFO : Total MapReduce CPU Time Spent: 3 seconds 990 msec
INFO : Completed executing command(queryId=hive_20200414045557_ba114e3d-fe27-4268-89a6-ecb424340503); Time taken: 21.548 seconds
INFO : OK
+----------+------------+-----------+
| user_id | user_name | salary |
+----------+------------+-----------+
| 1 | szh | 28000.00 |
| 1 | szh | 28000.00 |
| 1 | szh | 28000.00 |
| 2 | zyq | 26000.00 |
| 2 | zyq | 26000.00 |
| 2 | zyq | 26000.00 |
| 3 | gkm | 20000.00 |
| 3 | gkm | 20000.00 |
| 3 | gkm | 20000.00 |
| 4 | pip | 13400.00 |
| 4 | pip | 13400.00 |
| 4 | pip | 13400.00 |
| 5 | kip | 24500.00 |
| 5 | kip | 24500.00 |
| 5 | kip | 24500.00 |
| 6 | zxxc | 13000.00 |
| 6 | zxxc | 13000.00 |
| 6 | zxxc | 13000.00 |
| 7 | xsz | 8600.00 |
| 7 | xsz | 8600.00 |
| 7 | xsz | 8600.00 |
+----------+------------+-----------+
21 rows selected (21.682 seconds)
可以看出来, 两者的 EXPLAIN 结果没有差异
主要的差异体现在 Redcuer wo数量上,ORDER BY 会严格控制 Reducer 为1 ,因为需要进行全局排序!!!,去调整 Reducer 数量是没有意义的。
2) 通过使用 LIMIT 限制排序的输出
我们都知道 ODRER BY + LIMIT N 可以取到前N 个数据。
但是 SORT BY + LIMIT N 也可以取到前N 个数据,并且更高效 !!!(大数据情况下)
我们按照 user_id 取前5条数据
ORDER BY 写法
set mapred.reduce.tasks=5;
SELECT user_id, user_name, salary
FROM datacube_salary_org
ORDER BY user_id
LIMIT 5;
执行结果:
INFO : Starting Job = job_1586835454940_0018, Tracking URL = http://cdh-manager:8088/proxy/application_1586835454940_0018/
INFO : Kill Command = /opt/cloudera/parcels/CDH-6.0.1-1.cdh6.0.1.p0.590678/lib/hadoop/bin/hadoop job -kill job_1586835454940_0018
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO : 2020-04-14 05:21:17,010 Stage-1 map = 0%, reduce = 0%
INFO : 2020-04-14 05:21:24,194 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.48 sec
INFO : 2020-04-14 05:21:30,379 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.22 sec
INFO : MapReduce Total cumulative CPU time: 3 seconds 220 msec
INFO : Ended Job = job_1586835454940_0018
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.22 sec HDFS Read: 11463 HDFS Write: 222 SUCCESS
INFO : Total MapReduce CPU Time Spent: 3 seconds 220 msec
INFO : Completed executing command(queryId=hive_20200414052108_f4b9e94b-ade0-4756-9175-1139a5975b7f); Time taken: 22.635 seconds
INFO : OK
+----------+------------+-----------+
| user_id | user_name | salary |
+----------+------------+-----------+
| 1 | szh | 28000.00 |
| 1 | szh | 28000.00 |
| 1 | szh | 28000.00 |
| 2 | zyq | 26000.00 |
| 2 | zyq | 26000.00 |
+----------+------------+-----------+
5 rows selected (22.751 seconds)
可以看到只有一个 Reduce
SORT BY 写法
set mapred.reduce.tasks=5;
SELECT user_id, user_name, salary
FROM datacube_salary_org
SORT BY user_id
LIMIT 5;
执行结果:
INFO : Starting Job = job_1586835454940_0020, Tracking URL = http://cdh-manager:8088/proxy/application_1586835454940_0020/
INFO : Kill Command = /opt/cloudera/parcels/CDH-6.0.1-1.cdh6.0.1.p0.590678/lib/hadoop/bin/hadoop job -kill job_1586835454940_0020
INFO : Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
INFO : 2020-04-14 05:23:52,360 Stage-2 map = 0%, reduce = 0%
INFO : 2020-04-14 05:23:58,508 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 2.28 sec
INFO : 2020-04-14 05:24:04,679 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 4.09 sec
INFO : MapReduce Total cumulative CPU time: 4 seconds 90 msec
INFO : Ended Job = job_1586835454940_0020
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 1 Reduce: 5 Cumulative CPU: 10.08 sec HDFS Read: 24849 HDFS Write: 610 SUCCESS
INFO : Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 4.09 sec HDFS Read: 7846 HDFS Write: 222 SUCCESS
INFO : Total MapReduce CPU Time Spent: 14 seconds 170 msec
INFO : Completed executing command(queryId=hive_20200414052303_4228dcc5-a45f-4fe1-93e8-20be71cf84d3); Time taken: 62.358 seconds
INFO : OK
+----------+------------+-----------+
| user_id | user_name | salary |
+----------+------------+-----------+
| 1 | szh | 28000.00 |
| 1 | szh | 28000.00 |
| 1 | szh | 28000.00 |
| 2 | zyq | 26000.00 |
| 2 | zyq | 26000.00 |
+----------+------------+-----------+
可以看到第一个作业也就是 SORT BY 是 5个 Reduce, 而第二个作业 LIMIT 是 1个Reduce
我们看下执行计划
ORDER BY + LIMIT
INFO : Starting task [Stage-2:EXPLAIN] in serial mode
INFO : Completed executing command(queryId=hive_20200414052614_9ded2fc7-ad41-4f6d-9c3d-8a46f529bc7d); Time taken: 0.005 seconds
INFO : OK
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: datacube_salary_org |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: user_id (type: bigint), user_name (type: string), salary (type: decimal(10,2)) |
| outputColumnNames: _col0, _col1, _col2 |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: bigint) |
| sort order: + |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| TopN Hash Memory Usage: 0.1 |
| value expressions: _col1 (type: string), _col2 (type: decimal(10,2)) |
| Reduce Operator Tree: |
| Select Operator |
| expressions: KEY.reducesinkkey0 (type: bigint), VALUE._col0 (type: string), VALUE._col1 (type: decimal(10,2)) |
| outputColumnNames: _col0, _col1, _col2 |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Limit |
| Number of rows: 5 |
| Statistics: Num rows: 5 Data size: 240 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 5 Data size: 240 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: 5 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
43 rows selected (0.142 seconds)
SORT BY + LIMIT
INFO : Starting task [Stage-3:EXPLAIN] in serial mode
INFO : Completed executing command(queryId=hive_20200414052914_89ed6d4a-e498-4372-bdc1-6aa8ceb43ddd); Time taken: 0.004 seconds
INFO : OK
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-2 depends on stages: Stage-1 |
| Stage-0 depends on stages: Stage-2 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: datacube_salary_org |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: user_id (type: bigint), user_name (type: string), salary (type: decimal(10,2)) |
| outputColumnNames: _col0, _col1, _col2 |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: bigint) |
| sort order: + |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| TopN Hash Memory Usage: 0.1 |
| value expressions: _col1 (type: string), _col2 (type: decimal(10,2)) |
| Reduce Operator Tree: |
| Select Operator |
| expressions: KEY.reducesinkkey0 (type: bigint), VALUE._col0 (type: string), VALUE._col1 (type: decimal(10,2)) |
| outputColumnNames: _col0, _col1, _col2 |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Limit |
| Number of rows: 5 |
| Statistics: Num rows: 5 Data size: 240 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-2 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| Reduce Output Operator |
| key expressions: _col0 (type: bigint) |
| sort order: + |
| Statistics: Num rows: 5 Data size: 240 Basic stats: COMPLETE Column stats: NONE |
| TopN Hash Memory Usage: 0.1 |
| value expressions: _col1 (type: string), _col2 (type: decimal(10,2)) |
| Reduce Operator Tree: |
| Select Operator |
| expressions: KEY.reducesinkkey0 (type: bigint), VALUE._col0 (type: string), VALUE._col1 (type: decimal(10,2)) |
| outputColumnNames: _col0, _col1, _col2 |
| Statistics: Num rows: 5 Data size: 240 Basic stats: COMPLETE Column stats: NONE |
| Limit |
| Number of rows: 5 |
| Statistics: Num rows: 5 Data size: 240 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 5 Data size: 240 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: 5 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
69 rows selected (0.143 seconds)
STAGE1 -> STAGE2 -> STAGE0
可以看到第一个阶段 STAGE1 是每个Reduce 取前5。 第二个阶段,所有的Reduce 取了前5
总结
通过 使用SORT BY + LMIT N , 我们只需要每个 Reduce 控制前N 条记录即可。然后我们在总的 M个 Reduce 中再去选取其中的 前N 个数据就可以找到前 N 排名的数据了。
Tips
1) 使用 SORT BY + LIMIT N 的方式会比 ORDER BY 多一个JOB。所以,在使用 SORT BY + LIMIT 的方式,我们也要注意数据量的规模
2) 据说每个Reduce 取前N 条记录,使用的是冒泡排序。