关闭hive自动开启mapjoin

Hive的mapjoin可以将小表放到内存然后进行表的关联,极大的提升了hive语句的执行效率,在Hive0.11前,必须使用MAPJOIN来标记显示地启动该优化操作,在Hive0.11后,Hive默认启动该优化,也就是不在需要显示的使用MAPJOIN标记,其会在必要的时候触发该优化操作将普通JOIN转换成MapJoin。实际使用中我遇到了如下问题

Launching Job 2 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Selecting local mode for task: Stage-10
Job running in-process (local Hadoop)
2021-02-02 09:34:02,323 Stage-10 map = 0%,  reduce = 0%
2021-02-02 09:34:04,325 Stage-10 map = 100%,  reduce = 0%
Ended Job = job_local1553976964_0023
Stage-11 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/software_tools/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/software_tools/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2021-02-02 09:34:12	Starting to launch local task to process map join;	maximum memory = 477626368
2021-02-02 09:34:14	Processing rows:	200000	Hashtable size:	199999	Memory usage:	135134360	percentage:	0.283
2021-02-02 09:34:14	Processing rows:	300000	Hashtable size:	299999	Memory usage:	177934256	percentage:	0.373
2021-02-02 09:34:14	Dump the side-table for tag: 1 with group count: 336680 into file: file:/tmp/root/310054d9-60d9-49ca-afd8-39aa8052b6e2/hive_2021-02-02_09-33-49_434_7693544935790785889-1/-local-10005/HashTable-Stage-8/MapJoin-mapfile131--.hashtable
2021-02-02 09:34:16	Uploaded 1 File to: file:/tmp/root/310054d9-60d9-49ca-afd8-39aa8052b6e2/hive_2021-02-02_09-33-49_434_7693544935790785889-1/-local-10005/HashTable-Stage-8/MapJoin-mapfile131--.hashtable (20807331 bytes)
2021-02-02 09:34:16	End of local task; Time Taken: 3.263 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 4 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Selecting local mode for task: Stage-8
Job running in-process (local Hadoop)
2021-02-02 09:34:19,175 Stage-8 map = 0%,  reduce = 0%
2021-02-02 09:35:20,308 Stage-8 map = 0%,  reduce = 0%
2021-02-02 09:36:21,223 Stage-8 map = 0%,  reduce = 0%
2021-02-02 09:37:21,310 Stage-8 map = 0%,  reduce = 0%
2021-02-02 09:38:22,568 Stage-8 map = 0%,  reduce = 0%
2021-02-02 09:39:23,080 Stage-8 map = 0%,  reduce = 0%
2021-02-02 09:40:24,211 Stage-8 map = 0%,  reduce = 0%
2021-02-02 09:41:25,590 Stage-8 map = 0%,  reduce = 0%
2021-02-02 09:42:25,741 Stage-8 map = 0%,  reduce = 0%
2021-02-02 09:43:25,911 Stage-8 map = 0%,  reduce = 0%
2021-02-02 09:44:26,041 Stage-8 map = 0%,  reduce = 0%
2021-02-02 09:45:26,746 Stage-8 map = 0%,  reduce = 0%
2021-02-02 09:46:28,234 Stage-8 map = 0%,  reduce = 0%

日志提示了内存不足
2021-02-02 09:34:14 Processing rows: 200000 Hashtable size: 199999 Memory usage: 135134360 percentage: 0.283
2021-02-02 09:34:14 Processing rows: 300000 Hashtable size: 299999 Memory usage: 177934256 percentage: 0.373
并且一直输出
Stage-8 map = 0%, reduce = 0%
Stage-8 map = 0%, reduce = 0%
程序卡住不动

解决方法:
关闭mapjoin
set hive.auto.convert.join=false;(关闭自动MAPJOIN转换操作)
set hive.ignore.mapjoin.hint=false;(不忽略MAPJOIN标记,默认为忽略,这句可不加)
不忽略MAPJOIN标记是针对手写的mapjon语句而言,如下
select /+MAPJOIN(smallTableTwo)/ …这种语句;

Hive中的Map Join是一种优化技术,用于在执行连接操作时减少数据倾斜和提高查询性能。Map Join通过在Map阶段完成连接操作,避免了Reducer阶段的处理,从而加快了查询速度。 ### 开启Map Join的方法 1. **自动Map Join**: Hive可以根据查询的实际情况自动选择使用Map Join。可以通过设置以下参数来启用自动Map Join: ```sql set hive.auto.convert.join=true; set hive.auto.convert.join.noconditionaltask=true; set hive.auto.convert.join.noconditionaltask.size=10000000; -- 设置小表的最大字节数 ``` 2. **手动Map Join**: 可以在SQL查询中使用提示(Hint)来手动指定Map Join。语法如下: ```sql SELECT /*+ MAPJOIN(table_name) */ columns FROM table1 JOIN table2 ON table1.column = table2.column; ``` ### 注意事项 - **小表大小**:Map Join适用于小表与大表之间的连接操作。小表的大小应小于`hive.auto.convert.join.noconditionaltask.size`参数设置的值。 - **资源消耗**:Map Join会将小表的数据加载到内存中,因此需要确保集群有足够的内存资源。 ### 示例 假设我们有两个表`orders`和`customers`,其中`customers`表是小表,我们可以使用Map Join来优化连接操作: ```sql set hive.auto.convert.join=true; set hive.auto.convert.join.noconditionaltask=true; set hive.auto.convert.join.noconditionaltask.size=10000000; SELECT /*+ MAPJOIN(customers) */ o.order_id, c.customer_name FROM orders o JOIN customers c ON o.customer_id = c.customer_id; ``` 在上述示例中,`customers`表被标记为Map Join的小表,Hive会在Map阶段完成连接操作。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值