hive中的bucket table

当数据量比较大,我们需要更快的完成任务,多个map和reduce进程是唯一的选择。
但是如果输入文件是一个的话,map任务只能启动一个。
此时bucket table是个很好的选择,通过指定CLUSTERED的字段,将文件通过hash打散成多个小文件。

create table sunwg_test11(id int,name string)
CLUSTERED BY(id) SORTED BY(name) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘/t’;

执行insert前不要忘记设置
set hive.enforce.bucketing = true;
强制采用多个reduce进行输出

hive> INSERT OVERWRITE TABLE sunwg_test11 select * from test09;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 32
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Starting Job = job_201103070826_0018, Tracking URL = http://hadoop00:50030/jobdetails.jsp?jobid=job_201103070826_0018
Kill Command = /home/hjl/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=hadoop00:9001 -kill job_201103070826_0018
2011-03-08 11:34:23,055 Stage-1 map = 0%,  reduce = 0%
2011-03-08 11:34:27,084 Stage-1 map = 6%,  reduce = 0%
2011-03-08 11:34:29,100 Stage-1 map = 13%,  reduce = 0%
2011-03-08 11:34:32,124 Stage-1 map = 19%,  reduce = 0%
2011-03-08 11:34:34,142 Stage-1 map = 22%,  reduce = 0%
2011-03-08 11:34:35,151 Stage-1 map = 25%,  reduce = 0%
2011-03-08 11:34:37,167 Stage-1 map = 28%,  reduce = 0%
2011-03-08 11:34:39,182 Stage-1 map = 31%,  reduce = 0%
2011-03-08 11:34:41,199 Stage-1 map = 34%,  reduce = 1%
2011-03-08 11:34:42,211 Stage-1 map = 38%,  reduce = 1%
2011-03-08 11:34:44,233 Stage-1 map = 41%,  reduce = 1%
2011-03-08 11:34:46,250 Stage-1 map = 44%,  reduce = 1%
2011-03-08 11:34:48,270 Stage-1 map = 47%,  reduce = 1%
2011-03-08 11:34:49,280 Stage-1 map = 50%,  reduce = 1%
2011-03-08 11:34:51,300 Stage-1 map = 53%,  reduce = 1%
2011-03-08 11:34:53,316 Stage-1 map = 56%,  reduce = 1%
2011-03-08 11:34:55,330 Stage-1 map = 59%,  reduce = 1%
2011-03-08 11:34:56,340 Stage-1 map = 63%,  reduce = 1%
2011-03-08 11:34:58,357 Stage-1 map = 66%,  reduce = 1%
2011-03-08 11:35:00,378 Stage-1 map = 69%,  reduce = 1%
2011-03-08 11:35:02,393 Stage-1 map = 72%,  reduce = 1%
2011-03-08 11:35:04,409 Stage-1 map = 75%,  reduce = 1%
2011-03-08 11:35:05,419 Stage-1 map = 78%,  reduce = 1%
2011-03-08 11:35:07,435 Stage-1 map = 81%,  reduce = 1%
2011-03-08 11:35:09,451 Stage-1 map = 84%,  reduce = 2%
2011-03-08 11:35:12,475 Stage-1 map = 88%,  reduce = 2%
2011-03-08 11:35:14,496 Stage-1 map = 91%,  reduce = 2%
2011-03-08 11:35:16,513 Stage-1 map = 94%,  reduce = 2%
2011-03-08 11:35:18,528 Stage-1 map = 97%,  reduce = 2%
2011-03-08 11:35:20,552 Stage-1 map = 100%,  reduce = 2%
2011-03-08 11:35:25,589 Stage-1 map = 100%,  reduce = 6%
2011-03-08 11:35:33,645 Stage-1 map = 100%,  reduce = 9%
2011-03-08 11:35:34,654 Stage-1 map = 100%,  reduce = 13%
2011-03-08 11:35:39,693 Stage-1 map = 100%,  reduce = 16%
2011-03-08 11:35:41,710 Stage-1 map = 100%,  reduce = 19%
2011-03-08 11:35:45,740 Stage-1 map = 100%,  reduce = 22%
2011-03-08 11:35:47,757 Stage-1 map = 100%,  reduce = 25%
2011-03-08 11:35:52,793 Stage-1 map = 100%,  reduce = 28%
2011-03-08 11:35:54,808 Stage-1 map = 100%,  reduce = 31%
2011-03-08 11:35:59,844 Stage-1 map = 100%,  reduce = 34%
2011-03-08 11:36:01,861 Stage-1 map = 100%,  reduce = 38%
2011-03-08 11:36:05,891 Stage-1 map = 100%,  reduce = 41%
2011-03-08 11:36:07,911 Stage-1 map = 100%,  reduce = 44%
2011-03-08 11:36:12,947 Stage-1 map = 100%,  reduce = 47%
2011-03-08 11:36:13,958 Stage-1 map = 100%,  reduce = 50%
2011-03-08 11:36:19,002 Stage-1 map = 100%,  reduce = 53%
2011-03-08 11:36:21,017 Stage-1 map = 100%,  reduce = 56%
2011-03-08 11:36:26,053 Stage-1 map = 100%,  reduce = 59%
2011-03-08 11:36:28,068 Stage-1 map = 100%,  reduce = 63%
2011-03-08 11:36:33,106 Stage-1 map = 100%,  reduce = 66%
2011-03-08 11:36:35,122 Stage-1 map = 100%,  reduce = 69%
2011-03-08 11:36:39,152 Stage-1 map = 100%,  reduce = 72%
2011-03-08 11:36:41,169 Stage-1 map = 100%,  reduce = 75%
2011-03-08 11:36:46,208 Stage-1 map = 100%,  reduce = 78%
2011-03-08 11:36:48,227 Stage-1 map = 100%,  reduce = 81%
2011-03-08 11:36:53,262 Stage-1 map = 100%,  reduce = 84%
2011-03-08 11:36:54,271 Stage-1 map = 100%,  reduce = 88%
2011-03-08 11:36:59,309 Stage-1 map = 100%,  reduce = 91%
2011-03-08 11:37:01,328 Stage-1 map = 100%,  reduce = 94%
2011-03-08 11:37:06,365 Stage-1 map = 100%,  reduce = 97%
2011-03-08 11:37:08,382 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201103070826_0018
Loading data to table sunwg_test11
5 Rows loaded to sunwg_test11
OK
Time taken: 175.036 seconds

hive的sunwg_test11文件夹下面出现了32个文件,而不是一个文件
[hadoop@hadoop00 ~]$ hadoop fs -ls /hjl/sunwg_test11
Found 32 items
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:20 /hjl/sunwg_test11/attempt_201103070826_0018_r_000000_0
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:20 /hjl/sunwg_test11/attempt_201103070826_0018_r_000001_0
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:20 /hjl/sunwg_test11/attempt_201103070826_0018_r_000002_0
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:20 /hjl/sunwg_test11/attempt_201103070826_0018_r_000003_0
-rw-r–r–   3 hjl hadoop          8 2011-03-08 11:20 /hjl/sunwg_test11/attempt_201103070826_0018_r_000004_0
-rw-r–r–   3 hjl hadoop          9 2011-03-08 11:20 /hjl/sunwg_test11/attempt_201103070826_0018_r_000005_0
-rw-r–r–   3 hjl hadoop          8 2011-03-08 11:20 /hjl/sunwg_test11/attempt_201103070826_0018_r_000006_0
-rw-r–r–   3 hjl hadoop          9 2011-03-08 11:20 /hjl/sunwg_test11/attempt_201103070826_0018_r_000007_0
-rw-r–r–   3 hjl hadoop          9 2011-03-08 11:20 /hjl/sunwg_test11/attempt_201103070826_0018_r_000008_0
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:20 /hjl/sunwg_test11/attempt_201103070826_0018_r_000009_0
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:20 /hjl/sunwg_test11/attempt_201103070826_0018_r_000010_0
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:20 /hjl/sunwg_test11/attempt_201103070826_0018_r_000011_0
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:20 /hjl/sunwg_test11/attempt_201103070826_0018_r_000012_0
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:20 /hjl/sunwg_test11/attempt_201103070826_0018_r_000013_0
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:21 /hjl/sunwg_test11/attempt_201103070826_0018_r_000014_0
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:21 /hjl/sunwg_test11/attempt_201103070826_0018_r_000015_0
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:21 /hjl/sunwg_test11/attempt_201103070826_0018_r_000016_0
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:21 /hjl/sunwg_test11/attempt_201103070826_0018_r_000017_0
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:21 /hjl/sunwg_test11/attempt_201103070826_0018_r_000018_0
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:21 /hjl/sunwg_test11/attempt_201103070826_0018_r_000019_0
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:21 /hjl/sunwg_test11/attempt_201103070826_0018_r_000020_0
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:21 /hjl/sunwg_test11/attempt_201103070826_0018_r_000021_0
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:21 /hjl/sunwg_test11/attempt_201103070826_0018_r_000022_0
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:21 /hjl/sunwg_test11/attempt_201103070826_0018_r_000023_0
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:21 /hjl/sunwg_test11/attempt_201103070826_0018_r_000024_0
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:21 /hjl/sunwg_test11/attempt_201103070826_0018_r_000025_0
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:21 /hjl/sunwg_test11/attempt_201103070826_0018_r_000026_0
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:21 /hjl/sunwg_test11/attempt_201103070826_0018_r_000027_0
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:21 /hjl/sunwg_test11/attempt_201103070826_0018_r_000028_0
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:21 /hjl/sunwg_test11/attempt_201103070826_0018_r_000029_0
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:21 /hjl/sunwg_test11/attempt_201103070826_0018_r_000030_0
-rw-r–r–   3 hjl hadoop          0 2011-03-08 11:21 /hjl/sunwg_test11/attempt_201103070826_0018_r_000031_0

文件被打散后,可以启动多个mapreduce task
当执行一些操作的时候,你会发现系统启动了32个map任务。

resource:http://blog.itpub.net/26613085/viewspace-1130855/

<think>好的,我现在需要回答用户关于Hive查询优化技巧和策略的问题。首先,我应该回顾用户提供的引用内容,确保回答准确并引用相关部分。用户提到了索引、分区、分桶、避免数据倾斜等,这些都是关键点。 首先,Hive的索引虽然在某些情况下有用,但引用[1]和[2]都提到索引可能不如分区和分桶有效。所以,我需要强调分区和分桶策略的重要性,并解释它们如何减少数据扫描量。 接下来,用户提到的数据倾斜问题,引用[1]、[4]、[5]都涉及这一点。需要详细说明如何通过调整Join策略,比如使用Map Join或Bucket Map Join来处理。特别是对于大表和小表的情况,Map Join有效,而Bucket Map Join适用于大表之间。 然后,优化执行引擎和参数配置也很重要。比如启用Tez或Spark引擎,调整并行执行等,引用[3]提到了并发执行,需要展开说明。 数据倾斜的解决方案包括使用skew join和distribute by,引用[4]提到了数据倾斜的来源,这里需要给出具体方法。另外,压缩和存储格式优化,比如ORC和Parquet,可以结合引用[2]提到的数据类型选择。 最后,用户可能需要知道如何应用这些技巧,比如在编写查询时使用谓词下推、列裁剪,或者利用窗口函数优化聚合操作,引用[3]中的窗口函数和group by技巧需要提及。 还要注意回答的结构,确保每个技巧清晰,并自然地添加引用。同时,生成的相关问题要覆盖用户可能进一步询问的方向,比如索引的适用场景、分区与分桶的区别、处理数据倾斜的具体步骤等。</think>### Hive 查询优化技巧与策略 #### 1. **合理使用分区与分桶** - **分区(Partitioning)**:将表按特定字段(如日期、地区)划分为独立目录,查询时跳过无关分区。例如: ```sql CREATE TABLE sales (id INT, product STRING) PARTITIONED BY (dt STRING); ``` 查询时通过分区字段过滤可减少扫描数据量[^1]。 - **分桶(Bucketing)**:将数据按哈希值分布到固定数量的桶中,适合优化 JOIN 操作。例如: ```sql CREATE TABLE orders (order_id INT, user_id INT) CLUSTERED BY (user_id) INTO 32 BUCKETS; ``` 分桶后可启用 `Bucket Map Join` 提升大表关联效率[^5]。 --- #### 2. **优化 JOIN 操作** - **Map Join**:将小表加载到内存中广播到大表,避免 Shuffle。需满足小表条件(通常小于25MB): ```sql SET hive.auto.convert.join=true; ``` - **Bucket Map Join**:若两个表均分桶且桶数成倍数,可直接按桶关联,避免全表扫描。 - **处理数据倾斜**:对倾斜键添加随机前缀打散分布,或使用 `skew join` 自动优化: ```sql SET hive.optimize.skewjoin=true; ``` --- #### 3. **调整执行引擎与参数** - **启用 Tez/Spark 引擎**:替换传统 MapReduce,提升 DAG 执行效率: ```sql SET hive.execution.engine=tez; ``` - **并发执行**:对无依赖的 Stage 开启并行: ```sql SET hive.exec.parallel=true; ``` - **合并小文件**:减少 Map 任务数量: ```sql SET hive.merge.mapfiles=true; ``` --- #### 4. **数据压缩与存储优化** - **列式存储**:使用 ORC 或 Parquet 格式,支持谓词下推和列裁剪。例如: ```sql CREATE TABLE logs (ip STRING, url STRING) STORED AS ORC; ``` - **压缩算法**:启用 Snappy 或 Zlib 压缩,减少 I/O 开销: ```sql SET hive.exec.compress.output=true; ``` --- #### 5. **SQL 编写技巧** - **避免 `SELECT *`**:仅选择必要字段,减少数据传输量。 - **提前过滤数据**:在子查询或 CTE 中先过滤再关联。 - **利用窗口函数**:替代多重子查询,例如用 `ROW_NUMBER()` 去重[^3]。 --- #### 6. **统计信息与 CBO** - 收集表统计信息以启用基于代价的优化(CBO): ```sql ANALYZE TABLE sales COMPUTE STATISTICS; ``` CBO 会自动选择最优执行计划[^2]。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值