MapReduce中的Map，Reduce个数设定

最新推荐文章于 2023-10-08 10:05:22 发布

原创最新推荐文章于 2023-10-08 10:05:22 发布 · 1w 阅读

9 ·

CC 4.0 BY-SA版权

文章标签：

#MR

MapReduce中，Map任务个数由输入文件大小、splitsize参数和HDFS块大小决定，Reduce任务数量可通过setNumReduceTasks设置。FileInputFormat逻辑划分文件，TextInputFormat解析键值对。默认Reduce任务数为1，过多会影响性能。

一、Map的个数

在map阶段读取数据前，FileInputFormat会将输入文件分割成spilt，而spilt的个数决定了map的个数（一个spilt分片对应一个map）。影响map个数的因素只要有：

1）文件的大小。比如，当文件大于128M（block默认值）而小于256M时，文件会被划分成两个spilt。

2）文件的个数。FileInputFormat按文件进行分割，如果单个文件大于128M会被划分为多个spilt，反之，如果单个文件小于128M，FileInputFormat会把小文件单独划分成一个spilt。

3）splitsize的大小。分片是按照splitsize的大小进行的，默认情况下，splitsize的大小等同于hdfs的block大小。但可以通过参数调节。

InputSplit=Math.max(minSize,Math.min(max.Size,blockSize))

其中：

minSize = mapred.min.split.size

maxSize = maperd.max.split.szie

我们可以在MapReduce程序的驱动部分添加如下代码进行设置：

TextInputFormat.setMinInputSplitSize(job,1204*64L) ; //设置最小分片大小

TextInputFormat.setMinInputSplitSize(job,1204*64L); //设置最大分片大小

总结如下：

当mapreduce.input.fileinputformat.split.maxsize > mapreduce.input.fileinputformat.split.minsize > dfs.blockSize的情况下，此时的splitSize 将由mapreduce.input.fileinputformat.split.minsize参数决定

当mapreduce.input.fileinputformat.split.maxsize > dfs.blockSize > mapreduce.input.fileinputformat.split.minsize的情况下，此时的splitSize 将由dfs.blockSize配置决定

当dfs.blockSize > mapreduce.input.fileinputformat.split.maxsize > mapreduce.input.fileinputformat.split.minsize的情况下，此时的splitSize将由mapreduce.input.fileinputformat.split.maxsize参数决定。

其中TextInputFormat继承自FileInputFormat，FileInputFormat继承自InputFormat。

InputFormat这个类会将文件file进行逻辑划分，划分成的每一个spilt对应一个map，在MR中运行

FileInputFormat这个类先对输入文件进行逻辑上划分，以128M（hdfs block默认值）为单位，将原始数据从逻辑上分割成若干个split，每个split切片对应一个Mapper任务。要注意的是FileInputFormat这个类只对比HDFS Block大的文件进行划分，比HDFS Block小的文件不进行划分，此时的小文件会被当做一个split块并分配一个Mapper任务。这也是Hadoop处理大文件的效率要比处理很多小文件的效率高的原因。

当FileInputFormat这个类将文件file切分成block块之后，TextInputFormat这个类随后将每个split块中的每行记录解析成一个一个的键值对，即<k1,v1>

综上：我们可以简单理解为FileInputFormat这个类是将文件file切分成split块，而TextInputFormat这个类是负责将每一行记录解析为键值对<k1,v1>。

二、Reduce的个数

Reduce任务是一个数据聚合的步骤，数量默认为1。而使用过多的Reduce任务则意味着复杂的shuffle，并使输出文件数量激增。而reduce的个数设置相比map的个数设置就要简单的多，只需要设置setNumReduceTasks即可.

下面是以WordCount为例的一个简单示范。

package hadoop;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * Hadoop
 */
public class WordCountAPP {
    public static class MyMapper extends Mapper<LongWritable,Text,Text,LongWritable>{
        @Override
        protected void setup(Context context) throws IOException, InterruptedException {
            System.out.println("setUp");
        }

        @Override
        protected void cleanup(Context context) throws IOException, InterruptedException {
            System.out.println("cleanUp");
        }

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
                    String[] values = value.toString().split(",");

                    for (String s : values){
                        Text k = new Text(s);
                        context.write(k, new LongWritable(1));
                    }
        }
    }
    public static class MyReducer extends Reducer<Text,LongWritable,Text,LongWritable>{
        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
                long sum = 0;
                for (LongWritable v : values){
                    sum += v.get();
                }
                context.write(key,new LongWritable(sum));
        }
    }


    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf,"wordcount");
        job.setJarByClass(WordCountAPP.class);

        if (args.length < 2){
            System.err.println("Please enter <inputPath> and <outputPath>");
        }
        Path outputPath = new Path(args[1]);
       // Path outputPath = new Path(args[1]);
        FileSystem fs = FileSystem.get(conf);
        if (fs.exists(outputPath)){
            fs.delete(outputPath,true);
            System.out.println("outputpath exists,delete!");
        }
        //FileInputFormat.setInputPaths(job,new Path(args[0]));
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileInputFormat.setMinInputSplitSize(job,1024*1204*64L);
        FileInputFormat.setMaxInputSplitSize(job, 1204*64L);
        job.setMapperClass(MyMapper.class);
        job.setReducerClass(MyReducer.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        job.setNumReduceTasks(2);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

//        FileOutputFormat.setOutputPath(job,new Path(args[1]));
        FileOutputFormat.setOutputPath(job, outputPath);

        System.exit(job.waitForCompletion(true) ? 0: 1);
    }



}

代码中设置的split大小为64M，切割文件大小为98M，运行结果如下：