5 mapreduce的组合，排序，分组

最新推荐文章于 2020-12-22 18:41:30 发布

原创最新推荐文章于 2020-12-22 18:41:30 发布 · 739 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#mapreduce #数据

hadoop 专栏收录该内容

8 篇文章

订阅专栏

本文介绍了MapReduce在处理数据时如何利用combiner进行组合优化，使用comparable进行排序，以及借助comparator进行分组。Combiner可以减少网络传输开销，comparable用于自定义数据排序，而comparator则帮助按key进行分组，提高处理效率。

部署运行你感兴趣的模型镜像

mapreduce的组合，排序，分组

在使用mapreduce处理数据时，总有种“简单，粗暴”的感觉，就像一个傻大个一样。

为了能更加灵活的处理分析数据，以及将这个傻大个使用的更加得心应手，今天总结下这位傻大个在处理数据时使用的自身的一些组件。

1.combiner（组合）

当然可能大家已经非常熟知了，我在这里就不卖弄了，简单的解释一下使用它的优势，为大家灌个耳音，

combiner的作用是组合map阶段生成的数据，是好像说的根reduce一样，不同点在于它只组合一个map task产生的数据。给大家上一张盗过来的图。

这里写图片描述

是不是一下子就明白了呀。

举个更直观的例子吧：还记得我们的单词统计demo吗？假如现在在map阶段 mapreduce读取到下面这一行数据

hello hello hello hello world

直接使用了例子中的那种解决办法，那么在map阶段将生产如下数据

<hello,1> <hello,1> <hello,1> <hello,1> <world,1>

那么到了reduce阶段 mapreduce会将这些数据从各个节点上收集起来，那么每收集一次就需要一次网络传输，这个是不是也太奢侈了。

那么咋们的 combiner 就配上用场了，combiner的作用就是在map和reduce阶段加入了组合功能，那么应用上面的例子中，combiner会将上面产生的数据做如下整理

<hello,4> <world,1>

一下子清爽多了，是不是。这样可以大大提高网络传输效率。在理解combiner时，只需理解，combiner的整理（组合）发生在一个map task就已经将他摸清楚了。

2.comparable（比较）

comparable 的用途在于，对mapreduce中传输的数据做比较操作，我们可以使用它对mapreduce处理的数据做排序处理，在使用的过程中我们一般是继承 WritableComparable 类来自定义我们自己的比较规则用法和我们在jdk中使用的 Comparable 接口一样，WritableComparable就是继承了该接口，此处不再累赘，直接上代码了。

public class Channel implements WritableComparable<Channel>{

    private int id;
    private String channel;
    private String url;
    private Date addTime;
    private String ip;
    private int glance;
    private int regist;
    private String mac;

    、、、、set get

    @Override
    public void readFields(DataInput in) throws IOException {
        this.id=in.readInt();
        this.channel=in.readUTF();
        this.url=in.readUTF();
        this.addTime=new Date(in.readLong());
        this.ip=in.readUTF();
        this.glance=in.readInt();
        this.regist=in.readInt();
        this.mac=in.readUTF();
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeInt(id);
        out.writeUTF(channel);
        out.writeUTF(url);
        out.writeLong(addTime.getTime());
        out.writeUTF(ip);
        out.writeInt(glance);
        out.writeInt(regist);
        out.writeUTF(mac);
    }

    @Override
    public int compareTo(Channel channel) {
        int ret= this.getId()-channel.getId();
        if(ret!=0){
            return ret;     //先通过id排序
        }
        return this.getChannel().compareTo(channel.getChannel());   //在通过 channel排序
    }
}

着样就对我们的数据先通过id在通过channel排序了。

3.comparator （比较分组）

在reduce阶段，要将map阶段处理的数据按照 key分组，然后将相同key的值传到一个reduce处理者去处理，在mapreduce中就是通过comparator做这种分组识别的，在自定义我们的key对象的时候要按 mapreduce api的规定实现相应的接口，小编在做这种自定义key的时候一般是继承 WritableComparator 的，WritableComparator类实现了RawComparator接口，该接口中的主要方法是：

public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2)

小编在刚接触这个方法的时候是相当疼的，因为这种二进制的处理方式，说时候是并不擅长，当追踪源码的时候也很少有收获，像遇到如下的编码片段

isNegative ? i ^ 0xFFFFFFFFFFFFFFFF : i

特么的，难道是又要去研究进制编码，还好，后来发现下面方法后才松了口气

public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2){
    try{
       this.buffer.reset(b1, s1, l1);
       this.key1.readFields(this.buffer);

       this.buffer.reset(b2, s2, l2);
       this.key2.readFields(this.buffer);
    }catch (IOException e) {
       throw new RuntimeException(e);
    }

    return compare(this.key1, this.key2);
}

上面方法通过反序列化将二进制数据转换成对象，这样处理起来方便多了。

下面贴出 comparator 的简单示例

public class Channel implements WritableComparable<Channel>{


    private int id;
    private String channel;
    private String url;
    private Date addTime;
    private String ip;
    private int glance;
    private int regist;
    private String mac;



    public int getId() {
        return id;
    }

    public void setId(int id) {
        this.id = id;
    }

    public String getChannel() {
        return channel;
    }

    public void setChannel(String channel) {
        this.channel = channel;
    }

    public String getUrl() {
        return url;
    }

    public void setUrl(String url) {
        this.url = url;
    }

    public Date getAddTime() {
        return addTime;
    }

    public void setAddTime(Date addTime) {
        this.addTime = addTime;
    }

    public String getIp() {
        return ip;
    }

    public void setIp(String ip) {
        this.ip = ip;
    }

    public int getGlance() {
        return glance;
    }

    public void setGlance(int glance) {
        this.glance = glance;
    }

    public int getRegist() {
        return regist;
    }

    public void setRegist(int regist) {
        this.regist = regist;
    }

    public String getMac() {
        return mac;
    }

    public void setMac(String mac) {
        this.mac = mac;
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.id=in.readInt();
        this.channel=in.readUTF();
        this.url=in.readUTF();
        this.addTime=new Date(in.readLong());
        this.ip=in.readUTF();
        this.glance=in.readInt();
        this.regist=in.readInt();
        this.mac=in.readUTF();
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeInt(id);
        out.writeUTF(channel);
        out.writeUTF(url);
        out.writeLong(addTime.getTime());
        out.writeUTF(ip);
        out.writeInt(glance);
        out.writeInt(regist);
        out.writeUTF(mac);
    }

    @Override
    public int compareTo(Channel channel) {
        int ret= this.getId()-channel.getId();
        if(ret!=0){
            return ret;     //先通过id排序
        }
        return this.getChannel().compareTo(channel.getChannel());   //在通过 channel排序
    }

    public static class Comparator extends WritableComparator{

        static {
            WritableComparator.define(Channel.class, new Comparator());
        }

        public Comparator(){
            super(Channel.class,true);
        }

        @SuppressWarnings("rawtypes")
        public int compare(WritableComparable a, WritableComparable b) {
            if(a instanceof Channel && b instanceof Channel) {
                return ((Channel) a).getChannel().compareTo(((Channel) b).getChannel());   //将channel相同的交给同一个reduce去处理
            }
            return super.compare(a, b);
        }
    }

}

这样当以channel对象作为map阶段输出key时，数据会以channel进行分组，交给同一个reduce去处理，当然我们在job中也要对使用它进行相关声明

    job.setGroupingComparatorClass(Channel.Comparator.class);

您可能感兴趣的与本文相关的镜像

Anything-LLM

AI应用

AnythingLLM是一个全栈应用程序，可以使用商用或开源的LLM/嵌入器/语义向量数据库模型，帮助用户在本地或云端搭建个性化的聊天机器人系统，且无需复杂设置