MapReduce 浅析 (2)

最新推荐文章于 2026-01-08 15:41:03 发布

原创最新推荐文章于 2026-01-08 15:41:03 发布 · 518 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#Cloud #Hadoop #MapReduce #大数据 #云计算

Cloud 专栏收录该内容

2 篇文章

订阅专栏

本文介绍MapReduce中的优化技术，包括In-Mapper Combining、Combiner、Partitioner及Comparator的作用和实现方法，帮助读者理解如何有效减少数据传输量并提高处理效率。

本文目录

1 示例代码

2 In-Mapper Combining

1 示例代码

示例代码出自《MapReduce 浅析 (1)》，本文将基于该示例代码进行分析。

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class NumberCount {

    /**
     * Mapper     : 学生A、学生B、学生C
     * Object     : 输入数据 Key 类型
     * Text       : 输入数据 Value 类型
     * Text       : 输出数据 Key 类型（输出至 Reducer 的 Key 类型）
     * IntWritable: 输出数据 Value 类型（输出至 Reducer 的 Value 类型）
     */
    public static class StudentMapper extends Mapper<Object, Text, Text, IntWritable> {

        /**
         * Mapper Task: 统计各自内容中，每个数字出现的次数。
         * key        : 输入数据 Key （源自 Hadoop 分派的数据）
         * value      : 输入数据 Value （源自 Hadoop 分派的数据）
         * context    : 输出对象 Context （用于输出数据至 Reducer 的对象）
         */
        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

            // 用数组记录数字[0-9]个数
            int[] numberCount = new int[] { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };

            // 将输入字符串中的非数字内容替换为空
            String strNum = value.toString().replaceAll("[^0-9]", "");

            // 遍历替换后的字符串中的每个字符
            for (int index = 0; index < strNum.length(); index++) {
                int arrayIndex = strNum.charAt(index) - 48;
                numberCount[arrayIndex]++;
            }

            // 将统计结果以键值对<数字,个数>的形式提交
            for (int index = 0; index < numberCount.length; index++) {
                context.write(new Text(String.valueOf(index)), new IntWritable(numberCount[index]));
            }
        }
    }

    /**
     * Reducer    : 学生甲、学生乙
     * Text       : 输入数据 Key 类型（对应 Mapper 输出 Key 类型）
     * IntWritable: 输入数据 Value 类型（对应 Mapper 输出 Value 类型）
     * Text       : 输出数据 Key 类型 （输出最终数据的 Key 类型）
     * IntWritable: 输出数据 Value 类型 （输出最终数据的 Value 类型）
     */
    public static class StudentReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

        /**
         * Reducer Task: 汇总来自Mapper的数据。
         * key         : 输入数据 Key （源自 Mapper 输出的 Key）
         * value       : 输入数据 Value （源自 Mapper 输出的一组 Value）
         * context     : 输出对象 Context （用于输出最终数据）
         */
        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

            IntWritable count = new IntWritable();

            // 汇总每个 Key(数字) 对应的 Value(个数) 值之和
            int sum = 0;
            for (IntWritable val : values) {
   			    sum += val.get();
            }
            count.set(sum);

            // 将汇总结果提交
            context.write(key, count);
        }
    }

    /**
     * 入口函数
     */
    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();

        // 创建 Job
        Job job = new Job(conf, "Number Count");

        // 设置 Mapper
        job.setMapperClass(StudentMapper.class);
        // 设置 Reducer
        job.setReducerClass(StudentReducer.class);

        // 设置 Mapper 的输出 Key 类型
        job.setOutputKeyClass(Text.class);
        // 设置 Mapper 的输出 Value 类型
        job.setOutputValueClass(IntWritable.class);

        // 设置 Data 输入路径
        FileInputFormat.addInputPath(job, new Path("/user/hadoop/inputDemo"));
        // 设置 Result 输出路径
        FileOutputFormat.setOutputPath(job, new Path("/user/hadoop/output"));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

2 In-Mapper Combining

在 Job 开始执行时，每个 Mapper 节点会创建一个 StudentMapper 类对象，用于处理各自的 InputSplit 数据。

每次读取一行数据，然后调用一次 map 方法，最后，调用 context.write() 方法，将处理结果提交至 Reducer 节点。

假设：某 Mapper 节点的 InputSplit 共10行字符串，每次调用 map 方法提交10个键值对。

最终，将有100个键值对需要从 Mapper 节点传输至 Reducer 节点。

为了减少键值对传输的数量，我们将 numberCount 改为全局变量，并在 Mapper 节点中使用 cleanup() 方法。

public static class StudentMapper extends Mapper<Object, Text, Text, IntWritable> {

    // 全局变量：用数组记录数字[0-9]个数
    private int[] numberCount = new int[] { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };

    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

        // 将输入字符串中的非数字内容替换为空
        String strNum = value.toString().replaceAll("[^0-9]", "");

        // 遍历替换后的字符串中的每个字符
        for (int index = 0; index < strNum.length(); index++) {
            int arrayIndex = strNum.charAt(index) - 48;
            numberCount[arrayIndex]++;
        }
    }

    /**
     * 介绍：当前节点的所有 map 任务完成后，最终调用一次 cleanup() 方法。
     * 当处理完所有 InputSplit 之后，最终统一提交 numberCount 中的数据至 Reducer 节点。
     */
    public void cleanup(Context context) throws IOException, InterruptedException {

        // 将统计结果以键值对<数字,个数>的形式提交
        for (int index = 0; index < numberCount.length; index++) {
            context.write(new Text(String.valueOf(index)), new IntWritable(numberCount[index]));
        }
    }
}

使用 In-Mapper Combining 设计后，最终每个 Mapper 节点只会传输10个键值对至 Reducer 节点。

提示：该设计需要注意 Mapper 节点的内存使用情况，避免由于本地缓存数据不断增多而导致的内存溢出情况。

3 Combiner

在 MapReduce 中，Mapper 节点将处理后的数据提交至 Reducer 进行归纳汇总。

为了减少二者之间的键值对传输数量，我们使用了 In-Mapper Combining 设计，除此之外，我们还可以通过 Combiner 来达到这一目的。

Combiner 与 Mapper 处于同一节点（同一主机），可以将 Combiner 理解为 Mapper 节点内部的 Reducer。

当 Mapper 节点处理完数据后，Combiner 先进行一次归纳汇总，将可以合并的键值对进行合并，然后再提交至 Reducer 节点，从而减少键值对的传输量。

若 Combiner 的处理规则与 Reducer 的处理规则完全一致，则可以直接将 Reducer 类作为 Combiner 添加至 Job 中。

若二者的处理规则不同，则需要重新定义一个 Combiner 类（同样继承Reducer），然后将其作为 Combiner 添加至 Job 中。

public class NumberCount {

    /**
     * 入口函数
     */
    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();

        // 创建 Job
        Job job = new Job(conf, "Number Count");

        // 此处省略其它代码

        // 设置 Combiner （规则相同，直接将 Reducer 类作为 Combiner 使用）
        job.setCombinerClass(StudentReducer.class);

        // 此处省略其它代码
    }
}

4 Partitioner

Partitioner 是 "Shuffle and Sort" 层的重要部分，用于分割中间数据（Mapper提交的数据），将中间数据映射至相应的 Reducer 节点，即决定每个中间数据应该由哪个 Reducer 来处理。

public static class NumberCount {

    /**
     * Partitioner类
     */
    public static class StudentPartitioner extends Partitioner<Text, Text> {

        /**
         * key            : Mapper 输出的 Key
         * value          : Mapper 输出的 Value
         * numReduceTasks : Reducer 的数量
         */
        @Override
        public int getPartition(Text key, IntWritable value, int numReduceTasks) {

            // 将 Key 值转为 int 类型
            int number = Integer.parseInt(key.toString().trim());

            // 未使用 Reducer 情况
            if(numReduceTasks == 0)
            {
                return 0;
            }

            // 根据 number 值，决定对应的 Reducer 编号。
            if (number <= 1) {
                return 0;                   // 数字0-1的数据交由0号 Reducer 处理
            } else if (number <= 3) {
                return 1 % numReduceTasks;  // 数字2-3的数据交由(1 % numReduceTasks)号 Reducer 处理
            } else if (number <= 5) {
                return 2 % numReduceTasks;  // 数字4-5的数据交由(2 % numReduceTasks)号 Reducer 处理
            } else if (number <= 7) {
                return 3 % numReduceTasks;  // 数字6-7的数据交由(3 % numReduceTasks)号 Reducer 处理
            } else {
                return 4 % numReduceTasks;  // 数字8-9的数据交由(4 % numReduceTasks)号 Reducer 处理
            }
        }
    }

    /**
     * 入口函数
     */
    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();

        // 创建 Job
        Job job = new Job(conf, "Number Count");

        // 此处省略其它代码

        // 设置 Partitioner
        job.setPartitionerClass(StudentPartitioner.class);

        // 此处省略其它代码
    }
}

说明：百分号（%）表示取余操作。

若 Reducer 数量为1个，则 1 % numReduceTasks = 1 % 1 = 0，即交由0号 Reducer 处理；

若 Reducer 数量为2个，则 1 % numReduceTasks = 1 % 2 = 1，即交由1号 Reducer 处理；

若 Reducer 数量为3个，则 1 % numReduceTasks = 1 % 3 = 1，即交由1号 Reducer 处理；

若 Reducer 数量为3个，则 2 % numReduceTasks = 2 % 3 = 2，即交由2号 Reducer 处理；

若 Reducer 数量为3个，则 3 % numReduceTasks = 3 % 3 = 0，即交由0号 Reducer 处理；

5 Comparator

Comparator 也是 "Shuffle and Sort" 层的重要部分，用于对 Mapper 的输出数据，按照 Key 值进行排序，即决定到达 Reducer 的顺序。

在 Mapper 节点输出的数据，会根据 Comparator 规则，进行一次本地排序；

然后，在 Reducer 节点接收数据时，也会根据 Comparator 规则，对来自不同 Mapper 节点的数据，再进行一次整体排序。

public static class NumberCount {

    /**
     * Comparator类
     */
    public static class StudentComparator extends WritableComparator {

        @Override
        public DistrictComparator() {
            // Mapper 输出的 Key 为 Text 类型，即对 Text 类型数据进行排序
            super(Text.class, true);
        }

        /*
         * 实现倒序排序
         */
        public int compare(WritableComparable a, WritableComparable b) {
            int key1 = Integer.parseInt(((Text) a).toString());
            int key2 = Integer.parseInt(((Text) b).toString());
			
            if (key1 < key2)
                return 1;
            else if (key1 > key2)
                return -1;
            else
                return 0;
        }
    }

    /**
     * 入口函数
     */
    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();

        // 创建 Job
        Job job = new Job(conf, "Number Count");

        // 此处省略其它代码

        // 设置 Comparator
        job.setSortComparatorClass(StudentComparator.class);

        // 此处省略其它代码
    }
}