Hive的数据倾斜及优化

最新推荐文章于 2024-11-29 15:26:16 发布

qq_39682761

最新推荐文章于 2024-11-29 15:26:16 发布

阅读量330

点赞数

分类专栏： Hive 文章标签： hive优化

本文链接：https://blog.youkuaiyun.com/qq_39682761/article/details/88649933

版权

Hive 专栏收录该内容

3 篇文章

订阅专栏

本文探讨了Hive中的数据倾斜问题，特别是在group by、reduce join和count(distinct)场景下。提出了优化方案，包括模型设计、避免数据倾斜、减少job数、设置合理task数、小文件合并、优化reducetask个数、合理设计分桶和分区，以及高效使用join和group by。同时强调了mapjoin的使用和文件存储格式选择的重要性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

hive不怕数据多，就怕数据倾斜。

1. 数据倾斜

hive容易产生数据倾斜的场景：
1）group by 不和聚合函数一起使用
2）reduce join
3） count(distinct )

2.hive的优化:

1 好的模型设计
2 解决数据倾斜
在map端进行join，减少reducetask数据量
聚合函数和group by结合使用
null值不参与连接
3 减少job数
4 设置合理的map reduce的task数，手动指定一般不超过datanode的95%
5 对小文件进行合并，减少maptask
6 单个作业最优不如整体最优

1）排序的选择：

order by   性能比较低的
sort by   局部排序
distribute by   分桶|分区
cluster by    分+排序

2）尽量避免使用笛卡尔积

hive 笛卡尔积是没有关联建的关联  转换为MR的时候  mapkey
	select * from a,b; select * from a join b;
	若需求中必须做笛卡尔积，最好自己设定一个关联建
		表1：1000万 条数据
		表2：20万   条数据    1-1000
		1）对小表额外添加一个建  作为关联使用的建  随机产生
		大表进行复制操作    小表有多少个不同的关联建   大表就复制多少份  大表的每一个复本的关联建就是小表中其中一个数据的关联建
		2）大表随机生成一个关联建
		小表进行复制   大表的不同的关联建的份数
		第二种方式更加合理
			1）reducetask的并行度高  分区数量更多
			2）每一个reducetask中数据量不是太大
		select cast(rabd()*1000 as int),* from stu;

3）使用join替代in/exists

	hive 1.2.1  1.2.2   不支持
	hive 2.3.2 支持   但是性能低  mapkey
	select * from stu where age in(18,20,30,34);
	mapkey  null  value:表的所有数据
	
高效替代方案：
		left semi join
		inner join

4）多重查询| 数据插入

能使用多重查询  不使用单重查询
数据插入：能使用多重插入    不使用单重插入
减少对表的扫描次数  提升性能的
	from 表
	insert into table ... where 
	insert into ......

使用multi group by：

from area 
 insert overwrite table temp1
  	select Provice,city,county,count(rainfall) from area where data="2018-09-02" group by provice,city,count
 insert overwrite table temp2
  	select Provice,count(rainfall) from area where data="2018-09-02" group by provice

使用multi group by 之前必须配置参数：

<property>
    <name>hive.multigroupby.singlemr</name>
    <value>true</value>
</property>

5）jvm重用:通过参数配置一个container中重复运行的task数量

maptask|reducetask--->yearnchild----》container----->jvm虚拟机
一个container中  只会执行一个maptask|reducetask   这个container就会销毁
4节点       1300M   10maptask
一个节点上  会启动多个maptask任务  资源不够
我们如果可以container重用  性能提升

mapred.job.reuse.jvm.num.tasks=1
这个参数决定的是一个container中可以重复运行的task的数量  默认值1   代表一个container只能运行一个task
mapred.job.reuse.jvm.num.tasks=3;
一个container中可以运行3个maptask|reducetask任务的  节省了启动和销毁的时间  提升了性能

6）小文件合并：多个文件进行逻辑合并

CombineTextInputFormat 文件合并的输入类不会将文件进行物理合并的将文件输入maptask的时候将多个小文件作为一个切片输入的
hive 2 默认开启小文件合并了
set hive.merge.mapfiles = true ##在 map的任务结束时合并小文件，默认合并
set hive.merge.mapredfiles = false ## true 时在 MapReduce 的任务结束时合并小文件默认不合并的
set hive.merge.size.per.task = 25610001000 ##合并文件的大小最大不超过256M
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
##执行 Map 前进行小文件合并 hive文件输入的默认类 mapreduce的map端的文件输入类默认会进行小文件合并

7）reducetask的个数

经验：不超过  datanode*0.95

8）合理设计分桶

1）提升抽样性能散列一个桶中的数据就可以直接作为抽样数据的
2）获得更高的查询处理效率。
连接两个在（包含连接列的）相同列上划分了桶的表，可以使用 Map 端连接（Map-side join）高效的实现。比如JOIN操作。对于JOIN操作两个表有一个相同的列，如果对这两个表都进行了桶操作。那么将保存相同列值的桶进行JOIN操作就可以，可以大大较少JOIN的数据量。

如何获取每一个桶中的数据
	tablesample(BUCKET x OUT OF y);
	x:抽取的是第几个桶簇的数据
	y:桶簇的个数
	桶簇：半个/一个或多个桶组成的一个集合
	y=1   代表的是有1个桶簇  3个桶的数据全部在一个桶簇中 
	y=2    2个桶簇   一个桶簇中1.5个桶
	y=3    3个桶簇    一个桶簇---1个桶的数据
	x=2   代表抽取的是第2个桶簇
	select * from stu_buk tablesample(bucket 1 out of 3);
	y=6   6个桶簇  一个桶簇---0.5个桶
	x=2   获取的是第二个桶簇
		第一个桶的后半部分？   age%3=0
	select * from stu_buk tablesample(bucket 2 out of 6);
	轮询式划分桶簇的   每个桶先按照顺序划分  轮询
	
	select * from stu_buk tablesample(bucket 4 out of 6);

9）合理设计分区

若表中的数据量很大，而查询的时候通常按照某一个字段进行过滤查询，这时表需要创建为分区表的提升查询性能减少扫描范围
select * from test where date=""; 全表扫描的
建分区表
分区字段 date 按照分区字段进行分开存储的每一个分区对应一个目录
表—》hdfs对应的是目录表相当于目录的管理者
在进行按照分区字段查询的时候相当于查询的一个分区的小表提升查询性能
常用的分区字段：时间|地域
时间：多级分区 /year/month/day/hour
日志数据的原始表一般建分区表按时间分区

10）join:能使用mapjoin 尽量使用mapjoin

表关联：
1）大小（小小）小表不能超过23.8M,在hive中默认的是map端的join
参数：

	<property>
		<name>hive.auto.convert.join</name>
		<value>true</value>
		<description>Whether Hive enables the optimization about converting common join into mapjoin based on the input file size</description>
	</property>
指定是否启动mapjoin
	  <property>
			<name>hive.mapjoin.smalltable.filesize</name>
			<value>25000000</value>
			<description>
			  The threshold for the input file size of the small tables; if the file size is smaller 
			  than this threshold, it will try to convert the common join into map join
			</description>
	  </property>

mapjoin对于小表的大小的限定默认大小不得超过25M左右
较小表大小不超过23.8M 执行的都是mapjoin

2)大* 中
中表超过23.8M 放在缓存中足够的,但默认执行的是reducejoin(容易产生数据倾斜的)
强制执行mapjoin /+mapjoin(需要放在缓存中的中表)/
user 1T log 300M

select  /*+mapjoin(a)*/  * from log a join user b on a.userid=b.userid;
执行的仍然是mapjoin

3）大* 大
user 1T 所有用户 10 有效匹配的数据2G 200M
log 存储一天的数据 10G
1）其中的一个表进行瘦身
user 1T 所有用户 userid
log 存储一天的数据 10G userid(大量的重复数据) 一个用户10条日志信息
log表 50
瘦身的表 user表根据log中的有效的userid表瘦身user表

瘦身user表：
1）获取log的去重之后的userid
	select distinct userid from log where userid is not null;
2)使用user表关联上面的数据  user表的瘦身
	select 
	/*+mapjoin(a)*/b.* 
	from (select distinct userid from log where userid is not null) a join user b on a.userid=b.userid;
3)开始进行真正的关联
	select 
	/*+mapjoin(c)*/* 
	from (
	select 
	/*+mapjoin(a)*/b.* 
	from (select distinct userid from log where userid is not null) a join user b on a.userid=b.userid
	) c join log d on c.userid=d.userid;

注意：
1）hive中支持多条件的关联 and，不支持or连接
select * from a join b on a.id=b.id or a.name=b.name;
2)hive中尽量使用等值连接
不支持非等值连接非等值连接数值
3）hive关联的时候经验：将小表放在左侧大表放右侧
a join b on … join c on …
a:1000
b:10000
c:100000

11）group by和聚合函数一起使用

默认在map端执行聚合

12）合理运用文件存储格式

建表  stored as textfile|sequencefile|rcfile