Hadoop简单实现全排序

最新推荐文章于 2024-12-24 19:37:40 发布

转载最新推荐文章于 2024-12-24 19:37:40 发布 · 366 阅读

__MapReduce 专栏收录该内容

75 篇文章

订阅专栏

本文介绍如何使用Hadoop实现大数据的全排序，包括TeraSort思想解析、取样方法及分区策略，适用于整形和字符串类型的排序。

http://blog.youkuaiyun.com/yeruby/article/details/21233661

做毕设用到Hadoop的全排序处理大数据，接触Hadoop已经2个月了，进展缓慢，深刻认识到进入到一个好的团队、共同研究是多么的重要，以此

纪念我的大四一个人的毕设。废话不多说，我实现了整形和字符串型的全排序。

基础知识：

1. TeraSort思想：

关于terasort的文章很多，我没有找到那篇经典的原创。大体思想可以参看：http://hi.baidu.com/dt_zhangwei/item/c2a80032c7dbc5ff96f88dbf

我的理解：

（1）如果reducer的个数为1，那么输出一定是一个文件（part-r-00000），hadoop内部可以保证输出时已经排序好的。

这时：如果key是Text类型，按字典序排好；

如果key是IntWriteable类型，按整形排好；

（2）如果reducer的个数大于1，那么可以保证的是每一个reducer的输出是排好序的，但是不同reducer的输出不能保证。若想实现全排序，我们只需保证：到第0个reducer的数据的最后一项一定小于到第1个reducer的数据的第一项，以此类推，到第n-1个reducer的数据的最后一项一定小于到第n个reducer的数据的第一项（假设我们job.setNumReduceTasks(n)，即设定reduce任务数为n个，且按升序来排序）。

那么如何实现呢？

分为两步：取样+Partition对每条数据做标记（即发往哪个reducer做处理）

2. 取样

原理：取样工作在JobClient端进行，目的是取出n-1个、排序好的样本（可以划分出n个reducer），在partition的过程中，通过将当前keyvalue对的key跟样本中数据作比较，就可以知道该keyvalue对发往哪个reducer了。

以此我们需要写自己的“取样类”：

[java]view plaincopyprint? 
   
 static class TextSampler implements IndexedSortable {  
   
     public ArrayList<IntWritable> records = new ArrayList<IntWritable>();//全部样本数据  
   
     @Override  
     public int compare(int arg0, int arg1) {  
         IntWritable right = records.get(arg0);  
         IntWritable left = records.get(arg1);  
         return right.compareTo(left);  
     }  
   
     @Override  
     public void swap(int arg0, int arg1) {  
         IntWritable right = records.get(arg0);  
         IntWritable left = records.get(arg1);  
         records.set(arg0, left);  
         records.set(arg1, right);  
     }  
   
     public void addKey(IntWritable key) {  
         records.add(key);  
     }  
   
     public IntWritable[] createPartitions(int numPartitions) {  
         int numRecords = records.size();  
         if (numPartitions > numRecords) {  
             throw new IllegalArgumentException("Requested more partitions than input keys (" + numPartitions +  
                     " > " + numRecords + ")");  
         }  
         new QuickSort().sort(this, 0, records.size());  
         float stepSize = numRecords / (float) numPartitions;//取数的步长  
         IntWritable[] result = new IntWritable[numPartitions - 1];  
         for (int i = 1; i < numPartitions; ++i) {  
             result[i - 1] = records.get(Math.round(stepSize * i));//从全部样本数据中再抽出n-1个样本  
         }  
         return result;  
     }  
 }  

说明：实现了IndexedSortable接口，IndexedSortable接口是Hadoop中的排序器，Hadoop关于可排序的数据集定义了一个抽象接口IndexedSortable，也就是说任何能够排序的数据集必须要实现两个方法，一是能够比较它的数据集中任意两项的大小，二是能够交换它的数据集中任意两项的位置。实现了这个接口我们就可以使用hadoop预定义的快排进行排序。如上：new QuickSort().sort(this, 0, records.size());

那么样本怎么得来的呢？

我们需要从分片中获得，在Job启动前必须得到n-1个取样数据——>需要对输入的数据进行控制——>需要自定义实现InputFormat接口的类。InputFormat做了2件事：

（1）InputSplit[] getSplits(JobConf job, int numSplits) throws IOException; 得到划分

（2）RecordReader<K, V> getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException; 处理每个划分，对每个划分的数据生成KeyValue对

分片不用重写。需要自定义实现RecordReader接口的类。

[java]view plaincopyprint? 
   
 static class TeraRecordReader implements RecordReader<IntWritable, Text> {  
   
         private LineRecordReader in;  
         private LongWritable junk = new LongWritable();  
         private Text line = new Text();  
   
         public TeraRecordReader(Configuration job, FileSplit split) throws IOException {  
             in = new LineRecordReader(job, split);  
         }  
   
         @Override  
         public void close() throws IOException {  
             in.close();  
         }  
   
         @Override  
         public IntWritable createKey() {  
             return new IntWritable();  
         }  
   
         @Override  
         public Text createValue() {  
             return new Text();  
         }  
   
         @Override  
         public long getPos() throws IOException {  
             // TODO Auto-generated method stub  
             return in.getPos();  
         }  
   
         @Override  
         public float getProgress() throws IOException {  
             // TODO Auto-generated method stub  
             return in.getProgress();  
         }  
   
         @Override  
         public boolean next(IntWritable key, Text value) throws IOException {  
             if (in.next(junk, line)) {  
                     key.set(Integer.parseInt(line.toString()));  
                     value.clear();  
                 return true;  
             } else {  
                 return false;  
             }  
         }  
     }//end RecordReader  

默认情况下会对每个分片中的每行数据得到一个形如<Key=该行的起始位置：LongWritable，Value=该行的内容的：Text>的KeyValue对，我们需要将这个KeyValue对转化成我们想要的形式<Key=该行内容：IntWritable，Value=空字符串：Text>，所以如上重写了next函数。

到此我们可以按格式读到RecordReader提供的KeyValue对了。那么接下来我们就要找到读到的数据中你认为可以当做样本的数据：

[java]view plaincopyprint? 
   
 public static void writePartitionFile(JobConf conf, Path partFile) throws IOException {  
     SamplerInputFormat inputFormat = new SamplerInputFormat();  
     TextSampler sampler = new TextSampler();  
     int partitions = conf.getNumReduceTasks(); // Reducer任务的个数  
     long sampleSize = conf.getLong(SAMPLE_SIZE, 100); // 采集数据-键值对的个数  
     InputSplit[] splits = inputFormat.getSplits(conf, conf.getNumMapTasks());// 获得数据分片  
     int samples = Math.min(10, splits.length);// 采集分片的个数  
     long recordsPerSample = sampleSize / samples;// 每个分片采集的键值对个数  
     int sampleStep = splits.length / samples; // 采集分片的步长  
     long records = 0;  
     IntWritable key = new IntWritable();  
     Text value = new Text();  
     for (int i = 0; i < samples; i++) {  
         //to particular split construct a record_reader  
         RecordReader<IntWritable, Text> reader = inputFormat.getRecordReader(splits[sampleStep * i], conf, null);  
         while (reader.next(key, value)) {  
             sampler.addKey(key);  
             key=new IntWritable();  
             value = new Text();  
             records += 1;  
             if ((i + 1) * recordsPerSample <= records) {  
                 break;  
             }  
         }  
     }  
     FileSystem outFs = partFile.getFileSystem(conf);  
     if (outFs.exists(partFile)) {  
         outFs.delete(partFile, false);  
     }  
     SequenceFile.Writer writer = SequenceFile.createWriter(outFs, conf, partFile, IntWritable.class, NullWritable.class);  
     NullWritable nullValue = NullWritable.get();  
     for (IntWritable split : sampler.createPartitions(partitions)) {  
         writer.append(split, nullValue);  
     }  
     writer.close();  
 }  

如上所示，我们通过writer将（n-1）个样本写入到了临时的样本文件中。接下来可以启动Job了。

3. Partition对每条数据做标记（即发往哪个reducer做处理）

在map-reduce流程中，partitioner会负责“告知”每条数据的归属地reducer，这里我们要根据上面写好的临时样本文件判断每天数据的归属，因此需要自定义实现Partitioner接口的类：

[java]view plaincopyprint? 
   
 // 自定义的Partitioner    
 public static class TotalOrderPartitioner implements Partitioner<IntWritable, NullWritable> {    
       
     private IntWritable[] splitPoints;    
       
     public TotalOrderPartitioner() {    
     }    
       
     @Override    
     public int getPartition(IntWritable key, NullWritable value, int numReduceTasks) {    
         // TODO Auto-generated method stub    
         return findPartition(key);    
     }    
       
     public void configure(JobConf conf) {    
         try {    
             FileSystem fs = FileSystem.get(conf);  
             Path partFile = new Path(SamplerInputFormat.PARTITION_FILENAME);    
             splitPoints = readPartitions(fs, partFile, conf,splitPoints); // 读取采集文件   
         } catch (IOException ie) {    
             throw new IllegalArgumentException("can't read paritions file", ie);    
         }    
     }  
     //通过找区间的方式定位partition  
     public int findPartition(IntWritable key) {    
         int len = splitPoints.length;    
         for (int i = 0; i < len; i++) {    
             int res = key.compareTo(splitPoints[i]);    
             if (res > 0 && i < len - 1) {    
                 continue;    
             } else if (res == 0) {    
                 return i;    
             } else if (res < 0) {    
                 return i;    
             } else if (res > 0 && i == len - 1) {    
                 return i + 1;    
             }    
         }   
         return 0;    
     }    
       
     private static IntWritable[] readPartitions(FileSystem fs, Path p, JobConf job, IntWritable[] splitPoints) throws IOException {   
         URI[] uris = DistributedCache.getCacheFiles(fs.getConf());  
         SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(uris[0]), job);    
         ArrayList<IntWritable> parts = new ArrayList<IntWritable>();    
         IntWritable key = new IntWritable();             
         NullWritable value = NullWritable.get();   
         while (reader.next(key, value)) {    
             parts.add(key);     
             key=new IntWritable();  
             value = NullWritable.get();  
         }    
         reader.close();    
         splitPoints = new IntWritable[parts.size()];  
         for(int i=0;i<parts.size();i++) {  
             splitPoints[i] = parts.get(i);  
         }  
         return splitPoints;  
     }    
 }