mapreduce概要模式-优快云博客

本文链接：https://blog.youkuaiyun.com/wcandy001/article/details/49647507

该博客深入探讨MapReduce设计模式，重点关注数值概要处理。通过实例讲解如何利用MapReduce解决数据聚合问题，如按ID进行最大值、最小值和计数的计算。在Windows VM上的CentOS系统和Hadoop 2.2.0环境下进行演示，使用Java 1.7开发。数据格式为ID与number以空格分隔，博客内容将详细阐述实现步骤。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本栏目主讲MapReduce设计模式，每天更新…..
数值概要模式
目的：把数据取出进行聚合，最基本的设计模式
主要用于处理数值或者计数，分类等场景，例：sql里面的group by
运行环境：windows下VM虚拟机，centos系统，hadoop2.2.0，三节点，java 1.7
需要处理的数据为
ID number
1 1
2 2
34 6
54 34
2 56
65 12
1 78
65 65
45 45
54 99
34 56
2 76
1 34
54 26
45 34
65 73
求出每个ID的最大number和最小number及每个ID出现的次数
注：以上数据每行的ID与number以“ ”（空格）作为分隔符

1.首先自定义一个Writable对象
（需要实现Writable接口的readField和write方法）

package boke;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.Writable;

public class MaxMinCountWritable implements Writable{
//三个变量，分别记录最大值和最小值及出现次数
    private long Max=0;
    private long Min=0;
    private long Count=0;
    //各个变量的set和get方法
    public void setMax(long Max)
    {
        this.Max=Max;
    }
    public void setMin(long Min)
    {
        this.Min=Min;
    }
    public void setCount(long Count)
    {
        this.Count=Count;
    }
    public long getMin()
    {
        return this.Min;
    }
    public long getCount()
    {
        return this.Count;
    }
    public long getMax()
    {
        return this.Max;
    }
    //这里是实现Writable接口的方法
    public void readFields(DataInput in)throws IOException
    {
        Max=in.readLong();
        Min=in.readLong();
        Count=in.readLong();
    }
    public void write(DataOutput out)throws IOException
    {
        out.writeLong(Max);
        out.writeLong(Min);
        out.writeLong(Count);
    }
    //这里要说一下，MR默认在输出到文件时自动执行toString方法，这里可以自己重写
    public String toString()
    {
        return "Max="+Max+"Min="+Min+"Count="+Count;
    }
}

下面是MR程序的源码
package boke;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
//继承Connfigured，通过下面getCount（）方法可以调用到hadoop
的配置文件，如mapred-site.xml,hdfs-site.xml等，
GenericOptionParser这个类使用解释hadoop命令选项的，一般我
们写的时候都是实现Tool接口，执行ToolRunner.run（），这个方法
会调用GenericOptionParser类，达到可以不用打成jar包就可以运
行mr程序
public class MaxMinCount extends Configured implements Tool{
//这里的map主要是把数据切分，放到自定义的Writable里输出
    public static class Map extends Mapper<LongWritable,Text,Text,MaxMinCountWritable>
    {
        public void map(LongWritable key,Text value,Context context)throws InterruptedException,IOException
        {

            String[] lineSplit=value.toString().split(" ");
            String ID=lineSplit[0];
            int number=Integer.parseInt(lineSplit[1]);
            MaxMinCountWritable MMC =new MaxMinCountWritable();
            MMC.setMax(number);
            MMC.setMin(number);
            MMC.setCount(1);
            context.write(new Text(ID), MMC);
        }
    }
    //reduce是把相同key的value合并到一起，通过迭代器Iterable的方式获取每一个vaule，我们这里遍历所有相同的ID（key）的value（number），找出最大值和最小值并记下一共出现多少个key的value
    public static class Reduce extends Reducer<Text,MaxMinCountWritable,Text,MaxMinCountWritable>
    {

        public void reduce (Text key,Iterable<MaxMinCountWritable> values,Context context)throws InterruptedException,IOException
        {
             MaxMinCountWritable outputMMCWritable =new MaxMinCountWritable();
            for(MaxMinCountWritable mmc : values)
            {
                if(outputMMCWritable.getMax() ==0 || outputMMCWritable.getMax()<mmc.getMax())
                {
                    outputMMCWritable.setMax(mmc.getMax());
                }
                if(outputMMCWritable.getMin() ==0 || outputMMCWritable.getMin() >mmc.getMin())
                {
                    outputMMCWritable.setMin(mmc.getMin());
                }
                outputMMCWritable.setCount(outputMMCWritable.getCount()+1);

            }
            context.write(key, outputMMCWritable);
        }
    }
    //这里是mr的驱动，主要负责各种配置
    public int run(String[] args)throws Exception
    {
    //获取hadoop的配置信息
        Configuration conf=getConf();
    //通过配置信息定义job并起个名字
        Job job=new Job(conf,"MaxMinCount");
    //设定执行的类(map和reduce所在的类)
        job.setJarByClass(MaxMinCount.class);
    //指定这个类中的Map和reduce的类
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
    //这里用的了Combiner，首先map阶段结束后会把（key：value1，value2....）放到本地等待ruduce节点去抽取数据，而combiner的作用是可以再map结束后把相同key的values处理一下再发送给reduce，这样可以减少网络传输(因为reduce抽取数据速度是要看带宽的，所以输送的数据越少，抽取数据到reduce节点越快)，这里的combiner实现了reduce的一样的功能，相当于提前把一个map中的key所对应的值中筛选出最大值和最小值，例如在一个map结果中
    （key=1:values=23,12,45,67）通过combiner可得到
    （key=1：max=67，min=12，count=4），然后reduce抽取combinner的数据根据不同map的相同key做汇总，这样速度会快很多，如
    （key=1：max=67，min=12，count=4）
    （key=1：max=78，min=23，count=10）
    最终reduce的结果就是（key=1：max=78，min=12，count=14）。
    job.setCombinerClass(Reduce.class);
    //这里注意，如果map和reduce的输出类型是不一样的，必须要分开指定map的输出key，vaule的类型 
        job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(MaxMinCountWritable.class);
    //这里是统一的输出类型，如果上面Map没额外指定，那么hadoop默认map也是下面的类型（这里其实是一样的，为了说明写了两遍）
        job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(MaxMinCountWritable.class);
    //指定读取数据的方式，默认不写就是TextInputFormat
//  TextInputFormat源码
//  public class TextInputFormat extends //FileInputFormat<LongWritable, Text> {  
//  
//  public RecordReader<LongWritable, Text>   
//    createRecordReader(InputSplit split,  
//  TaskAttemptContext context) {  
//    return new LineRecordReader();
//RecordReader按行读取，可以通过改写这个类来实现自定义读取方式。  
//}  
job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
    //指定两个参数为输入路径和输出路径
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        //这里是把作业提交到jobTracker
        job.waitForCompletion(true);
        return job.isSuccessful()?1:0;
    }
    public static void main(String[] args)throws Exception
    {
    //通过调用ToolRunner的run方法内部调GenericOptionParser类解释hadoop命令
        int rsa=ToolRunner.run(new Configuration(), new MaxMinCount(), args);
        System.exit(rsa);
    }
}
运行结果
1   Max=78Min=1Count=3
2   Max=76Min=2Count=3
34  Max=56Min=6Count=2
45  Max=45Min=34Count=2
54  Max=99Min=26Count=3
65  Max=73Min=12Count=3

总结：数值概要是最普通的一种MR设计模式，其中combiner对提高效率非常有用，建议大家在能使用combinner的环境多用，同时注意，这里combiner是和reduce一样的代码，所以reudce的代码输入输出的类型要一样，本例中
Reducer<Text,MaxMinCountWritable,Text,MaxMinCountWritable>，因为combiner相当于reduce的预处理，所以接受的和reudce接收的类型要一样，它还要把它的输出结果输出给reduce，所以reduce的输出和输入必须一样，(如图)，当然，我们可以直接自定义combiner也行。
（输入）<LongWritable,Text>Map(输入）<Text,MMC> >（输入）<Text,MMC>combiner（输出）<Text,MMC>--->(输入）<Text,MMC>reduce（输出）<Text,MMC>，
每个阶段的输出一定要是下一个阶段的输入。