Hive任务优化基础

最新推荐文章于 2025-08-30 01:54:58 发布

转载最新推荐文章于 2025-08-30 01:54:58 发布 · 1.2k 阅读

15 篇文章

订阅专栏

1.少用count(distinct);

  select count(distinct cookie_id) from lxw_t1;
  性能差的原因：只会用一个reduce去处理;
  优化的写法：select count(1) from (select cookie_id from lxw_t1 group by cookie_id) x;
  配合set mapred.reduce.tasks会取得更好的效果;

2.表关联时，过滤条件写在合适的位置;

  select a.id,b.name
  from lxw_t1 a
  left outer join lxw_t2
  on a.id = b.id
  where a.pt = '2012-11-22';
  性能差的原因：这样写会导致先关联，后过滤

  正确的写法：
  select a.id,b.name
  from lxw_t1 a
  left outer join lxw_t2
  on (a.id = b.id and a.pt = '2012-11-22');

3.防止数据倾斜;

  关于数据倾斜，网上一搜一大片;
  现象：任务进度(reduce)长时间处于99%或者100%
  原因：因为特殊值或者业务本身特点，使得分配到一个reduce节点上的key数据量非常大。
  解决办法：
   a)参数调节: SET hive.groupby.skewindata=TRUE;
   b)存在大量空值或者null或者特殊值时候，可以先过滤，在最后结果+1即可;
         SELECT CAST(COUNT(DISTINCT imei)+1 AS bigint)
         FROM woa_all_user_info_his where pt = '2012-05-28'
         AND imei <> '' AND imei IS NOT NULL;
c)多重count DISTINCT;
         union all + rownumber + sum group by

4.防止笛卡尔积;
这个相信大家都懂;

5.合理使用mapjoin

主要针对一个很小的表和很大的表join
Select /*+ mapjoin(tablelist) */ , tablelist中的表会读入内存，然后分发到所有的reduce端
相关参数：hive.mapjoin.cache.numrows=25000
hive.mapjoin.smalltable.filesize=25000000
另外：大数据的mapjoin可参考Bucket Map Join

6.合理使用union all和multi insert

  对同一张表的union all 要比多重insert快的多，
  原因是hive本身对这种union all做过优化，即只扫描一次源表;

  select type,popt_id,login_date
         FROM (
         select 'm3_login' as type,popt_id,login_date
         from lxw_test3
         where login_date>='2012-02-01' and login_date<'2012-05-01'
         union all
         select 'mn_login' as type,popt_id,login_date
         from lxw_test3
         where login_date>='2012-05-01' and login_date<='2012-05-09'
                     union all
         ......
) x;

而多重insert也只扫描一次，但应为要insert到多个分区，所以做了很多其他的事情，导致消耗的时间非常长；

from lxw_test3
insert overwrite table lxw_test6 partition (flag = '1')
select 'm3_login' as type,popt_id,login_date
where login_date>='2012-02-01' and login_date<'2012-05-01'
insert overwrite table lxw_test6 partition (flag = '2')
select 'mn_login' as type,popt_id,login_date
where login_date>='2012-05-01' and login_date<='2012-05-09'
insert overwrite table lxw_test6 partition (flag = '3')
select 'm3_g_login' as type,popt_id,login_date
where login_date>='2012-02-01' and login_date<'2012-05-01' and apptypeid='1'

7.合理使用动态分区

SET hive.exec.dynamic.partition=true;
  SET hive.exec.dynamic.partition.mode=nonstrict;

  create table lxw_test1 (
sndaid string,
mobile string
) partitioned by (pt STRING)
stored as rcfile;

insert overwrite table lxw_test1 partition (pt)
select sndaid,mobile,pt from woa_user_info_mes_tmp1 limit 10;

注意：动态分区字段pt要放在select的最后一个位置