MapReduce二次排序

最新推荐文章于 2020-11-24 03:27:46 发布

learningcoder

最新推荐文章于 2020-11-24 03:27:46 发布

阅读量165

点赞数

CC 4.0 BY-SA版权

分类专栏：大数据 hadoop 文章标签： hadoop mapreduce 二次排序

本文链接：https://blog.youkuaiyun.com/learningcoder/article/details/82891822

大数据同时被 2 个专栏收录

10 篇文章

订阅专栏

hadoop

4 篇文章

订阅专栏

默认情况下，Map输出的结果会对Key进行默认的排序，但个别需求要求对Key排序的同时还需要对Value进行排序
这时候就要用到二次排序了。

本章以hadoop权威指南中计算每年最大气温值为例，原始数据杂乱无章

2008 33
2008 23
2008 43
2008 24
2008 25
2008 33
2008 13
2008 22
2008 33
2008 33
2009 23
2009 43
2009 24
2009 25
2009 33
2007 15
2007 22
2007 30
2007 100

1.定义组合key
将Key和Value组合形成新的key(NewKey), NewKey要实现WritableComparable接口

我这边定义了一个新类PairTemp.class，含有year（年份）和temp(气温)这两个字段

必须实现下面3个方法
compareTo：排序（年份升序，气温降序）
write：序列化
readFields：反序列化
序列化和反序列化的顺序必须一致

public class PairTemp implements WritableComparable<PairTemp> {
    private int year;
    private int temp;

    public PairTemp() {
    }

    public PairTemp(int year, int temp) {
        this.year = year;
        this.temp = temp;
    }

    public int getYear() {
        return year;
    }

    public void setYear(int year) {
        this.year = year;
    }

    public int getTemp() {
        return temp;
    }

    public void setTemp(int temp) {
        this.temp = temp;
    }

    public int compareTo(PairTemp o) {
        if (this.year == o.getYear()) {
            //气温降序
            return  -(this.temp-o.getTemp());
        }else {
            return this.year - o.getYear();
        }
    }

    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeInt(year);
        dataOutput.writeInt(temp);
    }

    public void readFields(DataInput dataInput) throws IOException {
        year = dataInput.readInt();
        temp = dataInput.readInt();
    }

    @Override
    public String toString() {
        return "PairTemp{" +
                "year=" + year +
                ", temp=" + temp +
                '}';
    }
}

2.map需要自定义分区，继承Partitioner
重写getPartition方法
使map生成的数据按照自定义分区方法，进入不同的分区

例如：年份year%reduce数量

public class myPartitioner extends Partitioner<PairTemp, NullWritable> {
    @Override
    public int getPartition(PairTemp pairTemp, NullWritable nullWritable, int i) {
        return pairTemp.getYear() % i;
    }
}

这样同一个年份的数据进入了同一个分区，年份升序，气温降序，结果如下

2007   100
2007   30
2007   22
2007   15
2008   43
2008   33
2008   33
2008   33
2008   33
2008   25
2008   24
2008   23
2008   22
2008   13
2009   43
2009   33
2009   25
2009   24
2009   23

此时距离我们获取每年都最大值这一要求更近一步来，看上去，我们只需取每一年的第一条记录就可以了

但上面的记录整体是做为一个NewKey，就要用到分组的思想来，我希望按年分成3组，2007，2008，2009各一组

3.自定义分组，继承WritableComparator
重写了compare方法

public class myGroupComparator extends WritableComparator {

    protected myGroupComparator() {
        super(PairTemp.class,true);
    }

    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        PairTemp p1=(PairTemp)a;
        PairTemp p2=(PairTemp)b;
        
        return p1.getYear()-p2.getYear();
    }
}

分组后的数据发送给reduce处理

4. reduce端处理

在这段代码中输出了第一个key值，也就是按照年份升序，气温降序后的每一组的第一条数据

public class maxReducer extends Reducer<PairTemp, NullWritable, IntWritable, IntWritable> {
    @Override
    protected void reduce(PairTemp key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {

        context.write(new IntWritable(key.getYear()), new IntWritable(key.getTemp()));
        System.out.println("--------reduce---------------");
        System.out.println("第一条key信息:"+key);

        for(NullWritable temp:values){
            System.out.println(key.getYear()+":"+key.getTemp());
        }
        System.out.println("最后一条key信息："+key);

    }
}

看打印结果就能理解了

--------reduce---------------
第一条key信息:PairTemp{year=2007, temp=100}
2007:100
2007:30
2007:22
2007:15
最后一条key信息：PairTemp{year=2007, temp=15}
--------reduce---------------
第一条key信息:PairTemp{year=2008, temp=43}
2008:43
2008:33
2008:33
2008:33
2008:33
2008:25
2008:24
2008:23
2008:22
2008:13
最后一条key信息：PairTemp{year=2008, temp=13}
--------reduce---------------
第一条key信息:PairTemp{year=2009, temp=43}
2009:43
2009:33
2009:25
2009:24
2009:23
最后一条key信息：PairTemp{year=2009, temp=23}

4.驱动类额外增加

setPartitionerClass和setGroupingComparatorClass