hadoop 二次排序

最新推荐文章于 2021-07-15 10:45:07 发布

ZhouSanduo18

最新推荐文章于 2021-07-15 10:45:07 发布

阅读量2k

点赞数

CC 4.0 BY-SA版权

分类专栏： hadoop相关文章标签： hadoop 二次排序分区分组

本文链接：https://blog.youkuaiyun.com/ZhouSanduo18/article/details/50479550

hadoop相关专栏收录该内容

0 篇文章

订阅专栏

本文详细介绍了Hadoop MapReduce环境下如何实现二次排序，包括原理、简单实现、优化比较速度的方法，以及如何定制comparator和分区。通过TextPair类实现键值对的自定义比较，确保在Name相同的情况下按Date排序。同时，文章讨论了数据分组和分区在二次排序中的作用，以及如何根据需求定制Partitioner。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

二次排序原理

写在前面

在MapReduce编程框架下，当我们要对数据进行排序时，如下所示数据，我们希望先根据Name进行排序，然后再Name相同的情况下，根据Date进行排序。这就是所谓的二次排序。

Name    Date    Site    Count
harry   w6d3    v10     1
harry   w6d7    v7      1
harry   w6d1    v1      2
jerry   w6d3    v10     1
jack    w6d3    v1      2
jerry   w6d6    v4      1
jack    w6d6    v9      2

以下数据二次排序的结果为：

Name    Date    Site    Count
harry   w6d1    v1      2
harry   w6d3    v10     1
harry   w6d7    v7      1
jack    w6d3    v1      2
jack    w6d6    v9      2
jerry   w6d3    v10     1
jerry   w6d6    v4      1

二次排序工作原理

简单实现

由于MapReduce处理的是键-值对，在Map阶段读入数据后，在输出时，根据输出定义的键值（key）进行排序。此时，我们需要做的就是定义一个自定义的Writable类型——TextPair，此类型包含两次排序的元素，即Name，Date。
定制TextPair的实现如下代码所示：

public static class TextPair implements WritableComparable<TextPair> {
        private Text Name;
        private Text Date;
        //构造器
        public TextPair() {
            set(new Text(),new Text());
        }
        //set方法
        public void set(Text left, Text right) {
            this.Name = left;
            this.Date = right;
        }
        //get方法
        public Text getName() {
            return this.Name;
        }

        public Text getDate() {
            return this.Date;
        }
        //反序列化
        public void readFields(DataInput in) throws IOException {
            Name.readFields(in);
            Date.readFields(in);
        }
        //序列化
        public void write(DataOutput out) throws IOException {
            Name.write(out);
            Date.write(out);
        }
        //重写hashCode方法
        public int hashCode() {
            return this.Name.hashCode() * 157 + this.Date.hashCode();
        }
        //重写equals方法
        public boolean equals(Object right) {
            if ((right instanceof TextPair)) {
                TextPair r = (TextPair) right;
                return (r.Name.equals(this.Name) && r.Date.equals(this.Date));//注意此处用的是equals方法
            }
            return false;
        }
        //重写compareTo方法
        @Override
        public int compareTo(TextPair o) {
            int cmp =Name.compareTo(o.getName());
            if(cmp!=0){
                return cmp;
            }
            return Date.compareTo(o.getDate());
        }
    }

此处自定义的TextPair的实现第一部分很直观：包括两个Text实例变量（Name和Date）和相关的构造函数，以及Setter、getter方法。然后再调用readFields()函数查看（填充）各个字段的值。TextPair类的write()方法依次对每个Text对象序列化到输出流中。类似的，通过每个Text对象表示，readFields()对来自输入流的字节进行反序列化。
由于MapReduce中默认分区通常用hashCode()方法来选择reduce分区，所以，要确保有一个比较好的hash函数来保证每个reduce分区的大小相当。
TextPair是WritableComparable的一个实现，所以它提供了compareTo()方法，该方法可以强制数据排序。先按照第一个字符(Name)排序，如果第一个字符相同，则按照第二个字符(Date)排序。以上程序完全可以实现二次排序的功能。然而，此种方法并不是最优的方式，当TextPair被用作MapReduce中的键(key)时，需要将数据流反序列化为对象，然后再调用compareTo()方法进行比较，若能在序列化的状态下就直接比较两个TextPair对象，就不需要反序列化后再比较，这样效率就提高了。

优化比较速度

因为TextPair是两个Text对象连接而成，而Text对象的二进制表示是一个长度可变的整数，包含字符串的UTF-8表示的字节数以及UTF-8字节本身。诀窍在于读取该对象的起始长度，由此得知第一个Text对象的字节表示有多长；然后将该对象的长度传给Text对象的RawComparator方法，最后通过计算第一个字符串和第二个字符串恰当的偏移量，这样可以实现对象的比较。详细过程如下（注意，这段代码已嵌入TextPair）：

public static class Comparator extends WritableComparator {
            private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
            public Comparator() {
                super(TextPair.class);
            }
            public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
                try {  
                    /** 
                     * Name是Text类型，Text是标准的UTF-8字节流， 
                     * 由一个变长整数开头表示Text中文本所需要的长度，接下来就是文本本身的字节数组 （文本编码长度+文本编码）
                     * decodeVIntSize返回变长整数的长度，readVInt表示文本字节数组编码，加起来就是第一个成员Name的长度 
                     */  
                    int nameL1=WritableUtils.decodeVIntSize(b1[s1])+readVInt(b1,s1);  
                    int nameL2=WritableUtils.decodeVIntSize(b2[s2])+readVInt(b2,s2);  
                    //和compareTo方法一样，先比较name  

                    int cmp = TEXT_COMPARATOR.compare(b1,s1,nameL1,b2,s2,nameL2);  
                    if(cmp!=0){  
                        return cmp;  
                    }  
                    //再比较Date  
                    return TEXT_COMPARATOR.compare(b1,s1+nameL1,l1-nameL1,b2,s2+nameL2,l2-nameL2);  
                } catch (IOException e) {  
                    throw new IllegalArgumentException();  
                }  
            }
}
static {
        WritableComparator.define(TextPair.class, new Comparator());
}

定制的comparator

从TextPair可以看出，编写原始的comparator需要谨慎，因为必须要处理字节级别的细节。如果真的需要自己编写comparator，必须参考org.apache.hadoop.io包中对Writable接口的实现。WriterableUtils提供的方法也比较好。注意，TextPair有连个字段，我们需要比较两个字段（name和date）

    public static class myComparator extends WriterableComparable {
        private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
        public myComparator() {
            super(TextPair.class);
        }

        public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
            try {  
                int nameL1=WritableUtils.decodeVIntSize(b1[s1])+readVInt(b1,s1);  
                int nameL2=WritableUtils.decodeVIntSize(b2[s2])+readVInt(b2,s2);  
                //和compareTo方法一样，先比较Name  
                int cmp = TEXT_COMPARATOR.compare(b1,s1,nameL1,b2,s2,nameL2);  
                if(cmp!=0){  
                    return cmp;  
                }  
                //再比较Date  
                return TEXT_COMPARATOR.compare(b1,s1+nameL1,l1-nameL1,b2,s2+nameL2,l2-nameL2);  
            } catch (IOException e) {  
                throw new IllegalArgumentException();  
            }  
        }
        static{ 
        /* 注册默认的Comparator，key在比较时，会调用此处的myComparator().compare(),而不是原先的compareTo()*/
            WritableComparator.define(TextPair.class,new myComparator());
        }
    }

定制的comparator也继承Writable。这个comparator定义的排列顺序不同与默认的comparator定义的自然排列顺序。上面代码显示了一个针对TextPair类型的comparator，成为myComparator，它考虑TextPair对象的两个字符串。

分组&分区

1、MapReduce中数据流动
（1）最简单的过程： map -> reduce
（2）定制了partitioner以将map的结果送往指定reducer的过程：　map -> partition -> reduce
（3）增加了在本地先进性一次reduce（优化）过程：　map -> combine(本地reduce) -> partition -> reduce
2、Mapreduce中Partition的概念以及使用。
（1）Partition的原理和作用
map函数开始产生输出时，并不是直接写到磁盘。他首先利用缓冲的方式，将结果写到内存，出于对效率的考虑，将结果进行预排序。每个map任务都有一个环形内存缓冲区用户存储任务输出。（默认情况下，该环形缓冲的大小为100MB）一旦环形缓冲区达到阈值（默认为80%），一个后台程序就开始把缓冲区内容写到磁盘。在此过程中，map任务输出仍然写到环形缓冲区中，若此期间缓冲区被填满，map会被阻塞，直到写磁盘过程完成。
得到map给的记录后，他们该分配给哪些reducer来处理呢？hadoop采用的默认的派发方式是根据散列值(hash值)来派发的，但是实际中，这并不能很高效或者按照我们要求的去执行任务。例如，经过partition处理后，一个节点的reducer分配到了20条记录，另一个却分配道了10W万条，这样大大降低了MapReduce执行效率（木桶原理）。又或者，我们想要处理后得到的文件按照一定的规律进行输出，假设有两个reducer，我们想要最终结果中part-00000中存储的是”h”开头的记录的结果,part-00001中存储其他开头的结果，这些默认的partitioner是做不到的。所以需要我们自己定制partition来根据自己的要求，选择记录的reducer。自定义partitioner很简单，只要自定义一个类，并且继承Partitioner类，重写其getPartition方法就好了，在使用的时候通过调用Job的setPartitionerClass指定一下即可。
Map的结果，会通过partition分发到Reducer上。Mapper的结果，可能送到Combiner做合并，Combiner在系统中并没有自己的基类，而是用Reducer作为Combiner的基类，他们对外的功能是一样的，只是使用的位置和使用时的上下文不太一样而已。Mapper最终处理的键值对

// 分区，根据TextPair第一个字段（name）进行分区
    public static class myFirstPartitioner extends Partitioner<TextPair, Text> {
        @Override
        public int getPartition(TextPair key, Text value, int numPartitions) {
            return Math.abs(key.getFirst().hashCode() * 127) % numPartitions;
        }
    }

输入是Map的结果对

// 分组
    public static class FirstGroupingComparator extends WritableComparator {

        private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();

        public FirstGroupingComparator() {
            super();
        }

        public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
            try {
                /**
                 * name是Text类型，Text是标准的UTF-8字节流，
                 * 由一个变长整形开头表示Text中文本所需要的长度，接下来就是文本本身的字节数组
                 * decodeVIntSize返回变长整形的长度，readVInt表示文本字节数组的长度，加起来就是第一个成员name的长度
                 */
                int nameL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
                int nameL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
                // 和compareTo方法一样，先比较name
                int cmp = TEXT_COMPARATOR.compare(b1, s1, nameL1, b2, s2, nameL2);
                return cmp;//按照第一个字段进行分组
//              if (cmp != 0) {
//                  return cmp;
//              }
//              // 再比较role
//              return TEXT_COMPARATOR.compare(b1, s1 + nameL1, l1 - nameL1, b2, s2 + nameL2, l2 - nameL2);
            } catch (IOException e) {
                throw new IllegalArgumentException();
            }
        }

*注意：以上代码中，分组是根据TextPair中的name字段进行分组。在reducer的迭代器中的数据是name相同，而date有序排列的数据。若此代码中根据两个字段进行分组，此时在reducer的迭代器中的数据是name相同，而且date相同的数据。

完整的二次排序代码

完整代码包括两个class文件：TextPair.class和mrMain.class。

package secondarySort;
//TextPair.class文件
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.io.WritableUtils;
import org.apache.hadoop.mapreduce.Partitioner;

public class TextPair implements WritableComparable<TextPair> {
    private String first;
    private String second;

    public TextPair() {
    }

    public TextPair(String first, String second) {
        this.first = first;
        this.second = second;
    }

    // setter、getter方法
    public String getFirst() {
        return first;
    }

    public String getSecond() {
        return second;
    }

    public void set(String first, String second) {
        this.first = first;
        this.second = second;
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.first = in.readUTF();
        this.second = in.readUTF();
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(first);
        out.writeUTF(second);

    }

    @Override
    public int compareTo(TextPair o) {
        if (!(this.first.equals(o.getFirst()))) {
            return this.first.compareTo(o.getFirst());
        }
        return this.second.compareTo(o.getSecond());
    }

    @Override
    public int hashCode() {
        return this.first.hashCode() * 163 + this.second.hashCode();
    }

    @Override
    public boolean equals(Object obj) {
        if (obj instanceof TextPair) {
            TextPair tmp = (TextPair) obj;
            return this.first.equals(tmp.getFirst()) && this.second.equals(tmp.getSecond());
        }
        return false;
    }

    @Override
    public String toString() {
        return this.first + "\t" + this.second;
    }

    // -----以上代码可以基本实现二次排序，但在比较两个对象时，
    // -----存在反序列化后在比较，此处可进一步优化，在序列化条件下比较 重写comparator
    public static class myComparator extends WritableComparator {
        static final Text.Comparator TEXT_COMPARE = new Text.Comparator();

        public myComparator() {
            super(TextPair.class);
        }

        @Override
        public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
            try {
                int segment1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
                int segment2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
                // 先比较first
                int tmp1 = TEXT_COMPARE.compare(b1, s1, segment1, b2, s2, segment2);
                if (tmp1 != 0) {
                    return tmp1;
                }
                // 再比较second
                int tmp2 = TEXT_COMPARE.compare(b1, s1 + segment1, l1 - segment1, b2, s2 + segment2, l2 - segment2);
                return tmp2;
            } catch (IOException e) {
                throw new IllegalArgumentException();
            }
        }
    }

    static {// 注册默认的Comparator
        WritableComparator.define(TextPair.class, new myComparator());
    }

    // 分区
    public static class myFirstPartitioner extends Partitioner<TextPair, Text> {

        @Override
        public int getPartition(TextPair key, Text value, int numPartitions) {

            return Math.abs(key.getFirst().hashCode() * 127) % numPartitions;
        }
    }

    // 分组
    public static class FirstGroupingComparator extends WritableComparator {

        private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();

        public FirstGroupingComparator() {
            super();
        }

        public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
            try {
                /**
                 * name是Text类型，Text是标准的UTF-8字节流，
                 * 由一个变长整形开头表示Text中文本所需要的长度，接下来就是文本本身的字节数组
                 * decodeVIntSize返回变长整形的长度，readVInt表示文本字节数组的长度，加起来就是第一个成员name的长度
                 */
                int nameL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
                int nameL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
                // 和compareTo方法一样，先比较name
                int cmp = TEXT_COMPARATOR.compare(b1, s1, nameL1, b2, s2, nameL2);
                return cmp;// 按照第一个字段进行分组
                // if (cmp != 0) {
                // return cmp;
                // }
                // // 再比较role
                // return TEXT_COMPARATOR.compare(b1, s1 + nameL1, l1 - nameL1,
                // b2, s2 + nameL2, l2 - nameL2);
            } catch (IOException e) {
                throw new IllegalArgumentException();
            }
        }
    }
}

第二个文件：mrMain.class

package secondarySort;

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
/**
 * @description 二次排序
 * @author JerryZhou
 */
public class mrMain {
    public static class myMapper extends Mapper<LongWritable,Text,TextPair,Text>{

        @Override
        protected void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {
            String line = value.toString();
            String[] fields = line.split("\t");//len =  6  0:time 1:time 2:lac 3:IMEI 4:lng 5:lat
            String outValue="";
            TextPair tp = new TextPair(fields[3],fields[0]);//IMEI time
            outValue = fields[2]+"\t"+fields[4]+"\t"+fields[5];//lac lng lat
            context.write(tp, new Text(outValue));
        }
    }
    public static class myReducer extends Reducer<TextPair,Text,Text,Text>{
        @Override
        protected void reduce(TextPair key, Iterable<Text> value,Context context)throws IOException, InterruptedException {
            String startTime ="";
            String endTime = "";
            boolean flag = true;
            String lng_lat = ""; //lng lat
            String lac = null;
            for(Text t:value){
                String[] valueTmp = t.toString().split("\t");//lac lng lat
                String tmp = valueTmp[1]+"\t"+valueTmp[2];// lng lat
                if(flag){ // 每组第一个数据
                    startTime =key.getSecond(); //time
                    endTime = startTime;
                    lng_lat = tmp;
                    lac = valueTmp[0];
                    flag = false;
                }else{
                    lac = valueTmp[0];
                    if(lng_lat.equals(tmp)){
                        endTime = key.getSecond();
                        continue;
                    }else{
                        lng_lat = tmp;
                        context.write(new Text(key.getFirst()), new Text(lac+"\t"+lng_lat+"\t"+startTime+"\t"+endTime));
                        // IMEI LAC LNG LAT ST ET
                        endTime = key.getSecond();//此语句位置必须在context.write之后！！！
                        startTime = endTime;  //此语句位置必须在context.write之后！！！
                    }
                }
            }
        }
    }

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf,"SecondarySort");
        if(args.length!=2){
            System.err.println("Usage:<inPath><outPath>");
        }
        job.setJarByClass(mrMain.class);
        job.setMapperClass(myMapper.class);
        job.setReducerClass(myReducer.class);

        job.setPartitionerClass(TextPair.myFirstPartitioner.class);
        job.setGroupingComparatorClass(TextPair.FirstGroupingComparator.class);

        job.setMapOutputKeyClass(TextPair.class);
        job.setMapOutputValueClass(Text.class);
        job.setOutputKeyClass(TextPair.class);
        job.setOutputValueClass(Text.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }

}