hive中如何确定一个mapreduce作业的reduce数量

最新推荐文章于 2025-03-20 20:06:18 发布

12345677654321000000

最新推荐文章于 2025-03-20 20:06:18 发布

阅读量2.6k

点赞数

CC 4.0 BY-SA版权

分类专栏： hive

本文链接：https://blog.youkuaiyun.com/zhoudetiankong/article/details/51121857

hive 专栏收录该内容

36 篇文章

订阅专栏

本文介绍Hive中reduce任务数量的计算方法，包括输入文件大小、每个reduce处理的数据量及最大reduce数量等关键参数。

版本：hive1.2.1

看源码：org.apache.hadoop.hive.ql.exec.Utilities类中的estimateReducers方法

   参数1： totalInputFileSize     job的所有输入的总的字节数
       参数2： bytesPerReducer     每个reduce的数据量，由hive.exec.reducers.bytes.per.reducer参数指定，当前版本默认是256MB
       参数3： maxReducers          一个maprduce作业所允许的最大的reduce数量，由参数hive.exec.reducers.max指定，默认是1099
       参数4: powersOfTwo            bucket相关的参数，默认是false


  public static int estimateReducers(long totalInputFileSize, long bytesPerReducer,
      int maxReducers, boolean powersOfTwo) {
    double bytes = Math.max(totalInputFileSize, bytesPerReducer);
    int reducers = (int) Math.ceil(bytes / bytesPerReducer);
    reducers = Math.max(1, reducers);
    reducers = Math.min(maxReducers, reducers);

    int reducersLog = (int)(Math.log(reducers) / Math.log(2)) + 1;
    int reducersPowerTwo = (int)Math.pow(2, reducersLog);

    if (powersOfTwo) {
      // If the original number of reducers was a power of two, use that
      if (reducersPowerTwo / 2 == reducers) {
        // nothing to do
      } else if (reducersPowerTwo > maxReducers) {
        // If the next power of two greater than the original number of reducers is greater
        // than the max number of reducers, use the preceding power of two, which is strictly
        // less than the original number of reducers and hence the max
        reducers = reducersPowerTwo / 2;
      } else {
        // Otherwise use the smallest power of two greater than the original number of reducers
        reducers = reducersPowerTwo;
      }
    }
    return reducers;
  }

由这段代码可知，reduce的数量是min(max(totalInputFileSize/bytesPerReducer,1),maxReducers)来决定的。

当然，也不是所有的mapreduce作业都会走这个计算reduce的流程，有些sql，比如order by操作，会使reduce数为1.