20企业级调优2-表的优化

最新推荐文章于 2024-12-21 10:50:04 发布

原创最新推荐文章于 2024-12-21 10:50:04 发布 · 186 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#hadoop #hive #企业级调优 #表的优化

hive basic 专栏收录该内容

22 篇文章

订阅专栏

一、小表、大表join

1.定义
	将key相对分散，并且数据量小的表放在join的左边，这样可以有效减少内存溢出错误发生的几率；
	再进一步，可以使用map join让小的维度表（1000条以下的记录条数）先进内存。在map端完成reduce。
	实际测试发现：新版的hive已经对小表JOIN大表和大表JOIN小表进行了优化。小表放在左边和右边已经没有明显区别。

2.实例
	(1)需求
		测试大表JOIN小表 和 小表JOIN大表的效率
		
	(2)建大表、小表和JOIN后表的语句
			//创建大表
		create table bigtable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) 
		row format delimited fields terminated by '\t';
			//创建小表
		create table smalltable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string)
		row format delimited fields terminated by '\t';
			//创建join后表的语句
		create table jointable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string)
		row format delimited fields terminated by '\t';
		
	(3)分别向大表和小表中导入数据
		load data local inpath '/root/hivedata/bigtable' into table bigtable;
		load data local inpath '/root/hivedata/smalltable' into table smalltable;
		
	(4)关闭mapjoin功能(默认为开启)
		set hive.auto.convert.join = false;
	
	(5)执行小表JOIN大表语句
		insert overwrite table jointable
		select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
		from smalltable s
		left join bigtable  b
		on b.id = s.id;
		
	(6)执行大表JOIN小表语句
		insert overwrite table jointable
		select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
		from bigtable  b
		left join smalltable  s
		on s.id = b.id;

二、大表Join大表

1.空KEY过滤
	有时join超时是因为某些key对应的数据太多，而相同key对应的数据都会发送到相同的reducer上，从而导致内存不够。
	key对应数据异常，我们就需要在SQL语句中过滤。
	例如：key对应字段为空，操作如下：
		(1)配置历史服务器(mapred-site.xml)
			<property>
				<name>mapreduce.jobhistory.address</name>
				<value>node01:10020</value>
			</property>
			<property>
				<name>mapreduce.jobhistory.webapp.address</name>
				<value>node01:19888</value>
			</property>
		
		(2)启动历史服务器
			sbin/mr-jobhistory-daemon.sh start historyserver
				在"http://node01:19888/jobhistory"中查看jobhistory
			
		(3)创建原始数据表、空id表、合并后数据表
				//创建原始表
			create table ori(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) 
			row format delimited fields terminated by '\t';
				//创建空id表
			create table nullidtable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) 
			row format delimited fields terminated by '\t';
				//创建join后表
			create table jointable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) 
			row format delimited fields terminated by '\t';
			
		(4)分别加载原始数据和空id数据到对应表中
			load data local inpath '/root/hivedata/ori' into table ori;
			load data local inpath '/root/hivedata/nullid' into table nullidtable;
		
		(5)测试不过滤空id
			insert overwrite table jointable 
				select n.* from nullidtable n left join ori o on n.id = o.id;
		
		(6)测试过滤空id
			insert overwrite table jointable 
				select n.* from (select * from nullidtable where id is not null )
					n  left join ori o on n.id = o.id;
				
2.空key转换	
	有时虽然某个key为空对应的数据很多，但是相应的数据不是异常数据，必须要包含在join的结果中，
	此时我们可以表a中key为空的字段赋一个随机的值，使得数据随机均匀地分不到不同的reducer上。
	
	2.1 实例一：不随机分布空null值：
		(1)设置5个reduce个数
			set mapreduce.job.reduces=5;
		(2)JOIN两张表
			insert overwrite table jointable 
				select n.* from nullidtable n left join ori b on n.id = b.id;
		(3)结果
			出现了数据倾斜，某些reducer的资源消耗远大于其他reducer。
			
	2.2 实例二：随机分布null值	
		(1)设置5个reduce个数
			set mapreduce.job.reduces = 5;
		(2)JOIN两张表
			insert overwrite table jointable
			select n.* from nullidtable n full join ori o on 
			case when n.id is null then concat('hive', rand()) else n.id end = o.id;
		(3)结果
			消除了数据倾斜，负载均衡reducer的资源消耗

三、MapJoin

1.定义
	如果不指定MapJoin或者不符合MapJoin的条件，那么Hive解析器会将Join操作转换成Common Join，
	即：在Reduce阶段完成join。容易发生数据倾斜。
	可以用MapJoin把小表全部加载到内存在map端进行join，避免reducer处理。
			
2.开启MapJoin参数设置
	(1)设置自动选择Mapjoin
		set hive.auto.convert.join = true; 默认为true
	
	(2)大表小表的阈值设置(默认25M一下认为是小表)
		set hive.mapjoin.smalltable.filesize=25000000;
	
3.实例
	(1)开启mapjoin功能
		set hive.auto.convert.join = true; 默认为true

	(2)执行小表JOIN大表语句
		insert overwrite table jointable
		select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
		from smalltable s
		join bigtable  b
		on s.id = b.id;
	
	(3)执行大表JOIN小表语句
		insert overwrite table jointable
		select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
		from bigtable  b
		join smalltable  s
		on s.id = b.id;

四、GroupBy

1.定义
	默认情况下，Map阶段同一Key数据分发给一个reduce，当一个key数据过大时就倾斜了。
	并不是所有的聚合操作都需要在Reduce端完成，很多聚合操作都可以先在Map端进行部分聚合，最后在Reduce端得出最终结果。
	
2.开启map端聚合参数设置
	(1)是否在Map端进行聚合，默认为True
		hive.map.aggr = true
		
	(2)在Map端进行聚合操作的条目数目
		hive.groupby.mapaggr.checkinterval = 100000
		
	(3)有数据倾斜的时候进行负载均衡（默认是false）
		hive.groupby.skewindata = true

五、Count(Distincf) 去重统计

1.定义
	数据量小的时候无所谓，数据量大的情况下，由于COUNT DISTINCT操作需要用一个Reduce Task来完成，
	这一个Reduce需要处理的数据量太大，就会导致整个Job很难完成，
	一般COUNT DISTINCT使用先GROUP BY再COUNT的方式替换
	
2.实例
	(1)1．创建一张大表
		create table bigtable(id bigint, time bigint, uid string, keyword
		string, url_rank int, click_num int, click_url string) row format delimited
		fields terminated by '\t';
	
	(2)加载数据
		load data local inpath '/root/hivedata/bigtable' into table bigtable;
	
	(3)设置5个reduce个数
		set mapreduce.job.reduces = 5;
	
	(4)执行去重id查询
		select count(distinct id) from bigtable;

	(5)采用GROUP by去重id
		select count(id) from (select id from bigtable group by id) a;		
	
	(6)总结
		以上虽然会多用一个Job来完成，但在数据量大的情况下，这个绝对是值得的。

六、笛卡尔积

尽量避免笛卡尔积，join的时候不加on条件，或者无效的on条件，
Hive只能使用1个reducer来完成笛卡尔积。

七、行列过滤

1.定义
	列处理：在SELECT中，只拿需要的列，如果有，尽量使用分区过滤，少用SELECT *。
	行处理：在分区剪裁中，当使用外关联时，如果将副表的过滤条件写在Where后面，那么就会先全表关联，之后再过滤。

2.行处理实例
	(1)测试先关联两张表，再用where条件过滤
		select o.id from bigtable b
		join ori o on o.id = b.id
		where o.id <= 10;

	(2)通过子查询后，再关联表
		select b.id from bigtable b
		join (select id from ori where id <= 10 ) o on b.id = o.id;

八、动态分区调整

1．开启动态分区参数设置
	(1)开启动态分区功能（默认true，开启）
		hive.exec.dynamic.partition=true
	
	(2)设置为非严格模式（动态分区的模式，默认strict，表示必须指定至少一个分区为静态分区，nonstrict模式表示允许所有的分区字段都可以使用动态分区。）
		hive.exec.dynamic.partition.mode=nonstrict
	
	(3)在所有执行MR的节点上，最大一共可以创建多少个动态分区。
		hive.exec.max.dynamic.partitions=1000
	
	(4)在每个执行MR的节点上，最大可以创建多少个动态分区。该参数需要根据实际的数据来设定。比如：源数据中包含了一年的数据，即day字段有365个值，那么该参数就需要设置成大于365，如果使用默认值100，则会报错。
		hive.exec.max.dynamic.partitions.pernode=100
	
	(5)整个MR Job中，最大可以创建多少个HDFS文件。
		hive.exec.max.created.files=100000
	
	(6)当有空分区生成时，是否抛出异常。一般不需要设置。
		hive.error.on.empty.partition=false

2.实例
	(1)需求：
		将ori中的数据按照时间(如：20111230000008)，插入到目标表ori_partitioned_target的相应分区中。
		
	(2)创建分区表
		create table ori_partitioned(id bigint, time bigint, uid string, keyword string,
		url_rank int, click_num int, click_url string) 
		partitioned by (p_time bigint) 
		row format delimited fields terminated by '\t';
	
	(3)加载数据到分区表中
		load data local inpath '/home/atguigu/ds1' into table ori_partitioned partition(p_time='20111230000010') ;
		load data local inpath '/home/atguigu/ds2' into table ori_partitioned partition(p_time='20111230000011') ;
	
	(4)创建目标分区表
		create table ori_partitioned_target(id bigint, time bigint, uid string,
		keyword string, url_rank int, click_num int, click_url string) 
		PARTITIONED BY (p_time STRING) row format delimited fields terminated by '\t';
	
	(5)设置动态分区
		set hive.exec.dynamic.partition = true;
		set hive.exec.dynamic.partition.mode = nonstrict;
		set hive.exec.max.dynamic.partitions = 1000;
		set hive.exec.max.dynamic.partitions.pernode = 100;
		set hive.exec.max.created.files = 100000;
		set hive.error.on.empty.partition = false;
			insert overwrite table ori_partitioned_target partition (p_time) 
				select id, time, uid, keyword, url_rank, click_num, click_url, p_time from ori_partitioned;
		
	(6)查看目标分区表的分区情况
		show partitions ori_partitioned_target;