MapReduce编程之二次排序

最新推荐文章于 2022-06-01 15:44:08 发布

碣石观海

最新推荐文章于 2022-06-01 15:44:08 发布

阅读量1.2k

点赞数 4

CC 4.0 BY-SA版权

分类专栏： MapReduce

本文链接：https://blog.youkuaiyun.com/weixin_39469127/article/details/89436524

MapReduce 专栏收录该内容

11 篇文章

订阅专栏

本文深入探讨了Hadoop中的二次排序技术，详细介绍了四种不同的排序实现方式，包括单区间正序、倒序及多区间倒序排序。通过具体示例和代码实现，展示了如何在MapReduce框架中对数据进行高效排序。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

------------本文笔记整理自《Hadoop海量数据处理：技术详解与项目实战》范东来

一、二次排序

二次排序就是先按某一列先进行排序，然后在此基础上再对另一列排序。

（参看如下表数据）

--待排序数据：      --第一种排序结果：   --第二种排序结果：  --第三种排序结果：    --第四种排序结果：
4 2                0 5                0 5                55 5                55 5（分区三）
4 3                1 5                1 5                11 2                11 2
4 1                1 7                1 7                10 4                10 4
3 4                2 7                2 3                7 6                 7 6
2 7                2 3                2 7                6 8                 6 8
2 3                3 4                3 4                6 4                 6 4
1 7                3 9                3 5                6 3                 6 3
3 9                3 5                3 9                4 3                 4 3（分区二）
3 5                4 2                4 1                4 2                 4 2
1 5                4 1                4 2                4 1                 4 1
0 5                4 3                4 3                3 9                 3 9
6 4                6 3                6 3                3 5                 3 5
6 8                6 4                6 4                3 4                 3 4
6 3                6 8                6 8                2 7                 2 7（分区一）
7 6                7 6                7 6                2 3                 2 3
10 4               10 4               10 4               1 7                 1 7
11 2               11 2               11 2               1 5                 1 5
55 5               55 5               55 5               0 5                 0 5

1.第一种排序结果：只对第一列进行了排序，目的是测试MapReduce计算框架默认对key排序还是对value排序。实现方式是，将第一列数据值设为key，第二列数据值设为value，作为map处理后的中间结果，再交由reduce直接输出。结果证明：MapReduce计算框架默认只对key排序。

2.第二种排序结果：对第一列和第二列都实现了排序（正序）。实现方式是，将每一行的第一列和第二列组合后作为key，value置null，作为map处理后的中间结果，再交由reduce直接输出。结果表明：MapReduce计算框架默认对key排序时规则按照字符ASCII码值从小到大排序。

3.第三种排序结果：对第一列和第二列都实现了排序（倒序）。实现方式是，每一行的第一列和第二列组合后作为key，value置null，作为map输出，再经过自定义的倒序排序规则（先第一列倒序，再第二列倒序），再交由reduce直接输出（至此，三种实现方式都只设置了一个Reducer任务）。

4.第四种排序结果：对第一列和第二列都实现了排序（分区间倒序排序），实际上是实现了多区间的全局排序。实现方式是，每一行的第一列和第二列组合后作为key，value置null，作为map输出，再经过自定义分区（0~2、3~5、6~+，三个区间分别排序，此分区实现不可用对numReduceTasks求余）和自定义的倒序排序规则（先第一列倒序，再第二列倒序），每个分区再分别交由一个reduce直接输出（三个Reducer任务，输出三个结果排序文件）。

5.多区间排序比单区间排序效率更高，程序的并行性更强。

二、第一种排序代码实现

1.SortMapper类

package com.hadoop.mr.sort;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class SortMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable>{

	@Override
	protected void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {
		
		String[] infos = value.toString().split(" ");
		//将每一行的第一列作为key，第二列作为value
		context.write(new IntWritable(new Integer(infos[0])), new IntWritable(new Integer(infos[1])));
		
	}
}

2.SortReducer类

package com.hadoop.mr.sort;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;

public class SortReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable>{

	@Override
	protected void reduce(IntWritable key, Iterable<IntWritable> values, Context context)
			throws IOException, InterruptedException {
		
		for (IntWritable value : values) {
			context.write(key, value);
		}
		
	}
}

3.驱动类

package com.hadoop.mr.sort;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/*
 * MR_Sort类
 */
public class MR_Sort {

	public static void main(String[] args) throws IOException,
		InterruptedException, ClassNotFoundException {
	
		//加载hadoop配置信息
		Configuration conf = new Configuration();
		
		//初始化作业信息
		Job job = Job.getInstance(conf, "MR_Sort");
		job.setJarByClass(MR_Sort.class);
		//设置Mapper/Reducer类型
		job.setMapperClass(SortMapper.class);
		job.setReducerClass(SortReducer.class);
		//设置输出键值对类型
		job.setOutputKeyClass(IntWritable.class);
		job.setOutputValueClass(IntWritable.class);
		//设置reduce任务个数（即分区数上限）
		job.setNumReduceTasks(1);
		//设置文件输入/输出路径
		FileInputFormat.addInputPath(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		
		//提交作业
		System.exit(job.waitForCompletion(true) ? 0 : 1);
	}
}

三、第二种排序代码实现

1.SecondarySortMapper类

package com.hadoop.mr.secondsort;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/*
 * 二次排序Mapper类
 */
public class SecondarySortMapper extends Mapper<LongWritable, Text, Text, NullWritable>{

	@Override
	protected void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {
		
		//因为MapReduce默认是只对key进行排序，所以待排序字段需放在key中，value中置null
		context.write(value, NullWritable.get());
		
	}
	
}

2.SecondarySortReducer类

package com.hadoop.mr.secondsort;

import java.io.IOException;

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

/*
 * 二次排序Reducer类
 */
public class SecondarySortReducer extends Reducer<Text, NullWritable, Text, NullWritable>{

	@Override
	protected void reduce(Text key, Iterable<NullWritable> values, Context context) 
			throws IOException, InterruptedException {
		
		context.write(key, NullWritable.get());
		
	}
	
}

3.驱动类

package com.hadoop.mr.secondsort;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


/*
 * 二次排序MR_SecondarySort驱动类
 */
public class MR_SecondarySort {

	public static void main(String[] args) throws IOException,
		InterruptedException, ClassNotFoundException {
	
		//加载hadoop配置信息
		Configuration conf = new Configuration();
		
		//初始化作业信息
		Job job = Job.getInstance(conf, "MR_SecondarySort");
		job.setJarByClass(MR_SecondarySort.class);
		//设置Mapper/Reducer类型
		job.setMapperClass(SecondarySortMapper.class);
		job.setReducerClass(SecondarySortReducer.class);
		//设置输出键值对类型
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(NullWritable.class);
			
		//设置reduce任务个数为1（即分区数上限），保证排序全局有序
		job.setNumReduceTasks(1);
		
		//设置文件输入/输出路径
		FileInputFormat.addInputPath(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		
		//提交作业
		System.exit(job.waitForCompletion(true) ? 0 : 1);
	}
	
}

四、第三种排序代码实现

1.在第二种排序代码的基础上，增加自定义排序规则SortComparator类（倒排序）

package com.hadoop.mr.secondsort;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

/*
 * MapReduce排序规则，倒序
 */
public class SortComparator extends WritableComparator{
	
	protected SortComparator() {
		super(Text.class, true);
	}
	
	@Override
	public int compare(WritableComparable key1, WritableComparable key2) {
		
		//获取key1和key2的第一第二字段值
		String[] key1_arr = key1.toString().split(" ");
		int key1_1 = Integer.parseInt(key1_arr[0]);
		int key1_2 = Integer.parseInt(key1_arr[1]);
		
		String[] key2_arr = key2.toString().split(" ");
		int key2_1 = Integer.parseInt(key2_arr[0]);
		int key2_2 = Integer.parseInt(key2_arr[1]);
		
		//如果第一字段值不同，比较第一字段值
		if (key1_1 != key2_1) {
			return key1_1 > key2_1 ? -1 : 1;
		}
		//如果第一字段值相同，比较第二字段值
		else {
			return key1_2 > key2_2 ? -1 : (key1_2 == key2_2 ? 0 : 1);
		}
		
	}
	
}

2.同时在驱动类中增加排序规则设置

//设置排序规则（倒序），默认正序（不需要重新实现排序规则）
job.setSortComparatorClass(SortComparator.class);

五、第四种排序代码实现

1.在第三种排序代码的基础上，增加自定义分区规则KeyPartitioner类（0~2、3~5、6~+，三个区间分别排序）

package com.hadoop.mr.secondsort;

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

/*
 * 键的三分区
 */
public class KeyPartitioner extends Partitioner<Text, NullWritable>{
	
	@Override
	public int getPartition(Text key, NullWritable value, int numReduceTasks) {

                //注：此处三分区的实现不能用 对numReduceTasks取余操作，以免key在取余后循环顺序
		//获取key的第一个排序字段
		int key_1 = Integer.parseInt(key.toString().split(" ")[0]);
		if (key_1 / 3 < numReduceTasks)
			return key_1 / 3;
		else 
			return numReduceTasks - 1;
		
	}
}

2.同时在驱动类中增加Reducer任务数设置、自定义分区规则设置

//设置reduce任务个数为3，分区排序（此时需要对key三分区）
job.setNumReduceTasks(3);
		
//设置分区
job.setPartitionerClass(KeyPartitioner.class);

六、打包Jar，并运行程序

1.将com.hadoop.mr.secondsort包右键导出为 JAR file，命名为："MR_SecondarySort.jar";
2.利用Windows的cmd或者PowerShell（推荐）将JAR文件上传到Linux服务器
  命令如下：（在JAR文件目录下执行）
  > scp MR_SecondarySort.jar root@remoteIP:~/myJars/mapreduce/
 （其中remoteIP为远程服务器IP）
3.启动hadoop
  --创建待排序输入文件
  > cd ~/myJars/mapreduce/
  > touch secondarysortdata
  > vi secondarysortdata
  按键"i"，进入编辑模式，向secondarysortdata文件中输入内容，如下：
  4 2
  4 3
  4 1
  3 4
  2 7
  2 3
  1 7
  3 9
  3 5
  1 5
  0 5
  6 4
  6 8
  6 3
  7 6
  10 4
  11 2
  55 5
  按键"ESC"-->"shift q"-->输入"wq!"，回车，保存
  --查看文件
  > cat secondarysortdata
  --在HDFS中创建输入文件目录
  > hadoop fs -mkdir /user/hadoop/secondinput
  --在HDFS中查看输入文件目录
  > hadoop fs -ls /user/hadoop/secondinput
  --将本地的两个文件拷贝到HDFS的输入目录中（在"~/myJars/mapreduce/"下执行）
  --可以多个文件一起传输
  > hadoop fs -copyFromLocal secondarysortdata /user/hadoop/secondinput/
4.执行JAR，运行程序
  命令如下：（在JAR文件目录"~/myJars/mapreduce/"下执行）
  > hadoop jar MR_SecondarySort.jar com.hadoop.mr.secondsort.MR_SecondarySort /user/hadoop/secondinput /user/hadoop/secondoutput
  运行过程中，屏幕会输出执行过程，直到完成
5.查看二次排序结果
  --5.1.Reducer任务数为1时：
  成功执行完后，目录"/user/hadoop/secondoutput/"下会产生两个文件
  /user/hadoop/secondoutput/_SUCCESS    --成功执行完的空标识文件
  /user/hadoop/secondoutput/part-r-00000 --作业输出结果文件
  --查看输出文件
  > hadoop fs -cat /user/hadoop/secondoutput/part-r-00000

  --5.2.Reducer任务数为3时：
  成功执行完后，目录"/user/hadoop/secondoutput/"下会产生四个文件
  /user/hadoop/secondoutput/_SUCCESS    --成功执行完的空标识文件
  /user/hadoop/secondoutput/part-r-00000 --作业输出结果文件（分区一）
  /user/hadoop/secondoutput/part-r-00001 --作业输出结果文件（分区二）
  /user/hadoop/secondoutput/part-r-00002 --作业输出结果文件（分区三）
  --查看输出文件（各区间整体也有序）
  > hadoop fs -cat /user/hadoop/secondoutput/part-r-00002
  55 5
  11 2
  10 4
  7 6
  6 8
  6 4
  6 3
  > hadoop fs -cat /user/hadoop/secondoutput/part-r-00001
  4 3
  4 2
  4 1
  3 9
  3 5
  3 4
  > hadoop fs -cat /user/hadoop/secondoutput/part-r-00000
  2 7
  2 3
  1 7
  1 5
  0 5