案例:对1901 1902文件中统计的气温进行排序,reduceNum=2
part-r-0001
1901 ...
1901 ...
part-r-0002
1902 ...
1902 ...
public class AirMapper extends Mapper<LongWritable, Text, IntWritable, Text>{
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
//将value转成字符串类型,进行截取年份和温度
String line = value.toString();
//获取年份
String year = line.substring(15, 19);
//定义温度变量
int temp;
if(line.charAt(87)=='+') {
temp =Integer.parseInt(line.substring(88,92));
}else {
temp =Integer.parseInt(line.substring(87,92));
}
String code = line.substring(92,93);
if(temp!=9999&&code.matches("[01459]")) {
//转换年份的类型为输出类型
Text yearT = new Text(year);
//转换温度的类型为输出类型
IntWritable temperature = new IntWritable(temp);
context.write(temperature, yearT);//将数据写出
}
}
}
public class AirReducer extends Reducer<IntWritable, Text, Text, IntWritable>{
@Override
protected void reduce(IntWritable key, Iterable<Text> vs,
Context context) throws IOException, InterruptedException {
for(Text n:vs) {
context.write(n,key);
}
}
}
//自定义分区函数
public class AirPartitioner extends Partitioner<IntWritable, Text>{
@Override
public int getPartition(IntWritable key, Text value, int numPartitions) {
return (value.hashCode()&Integer.MAX_VALUE)%numPartitions;
}
}
//定义驱动类
public class AirDriver {
public static void main(String[] args) throws IOException, Exception, InterruptedException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf,"air sort");
/*自动判断文件air1是否存在,若已存在,将之删除*/
FileSystem fs = FileSystem.get(conf);
Path path = new Path("D:/air1");
if(fs.exists(path)) {
fs.delete(path);
}
job.setJarByClass(AirDriver.class);
job.setMapperClass(AirMapper.class);
job.setReducerClass(AirReducer.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setNumReduceTasks(2);//设置reduce任务的个数为2
job.setPartitionerClass(AirPartitioner.class);
FileInputFormat.addInputPath(job, new Path("file:///D:/data/19*"));
FileOutputFormat.setOutputPath(job, new Path("file:///D:/air1"));
System.exit(job.waitForCompletion(true)?0:1);
}
}
【Combiner函数】: 属于优化方案,不影响结果的情况下。
统计和,最大,最小值等可以使用。
而类似平均值之类的操作,不能使用。
用途:
1.使map输出的数据更加紧凑
2.减少磁盘IO
3.减少网络IO
1)没有combiner函数的shuffle流程:
map输出-->环形缓冲区--->partition(分区)-->sort-->spill(溢写)
-->merge-->sort-->fetch(抓取/复制)-->merge-->sort-->reduce
2)有combiner函数的shuffle流程:
map输出-->环形缓冲区--->partition(分区)-->sort-->combiner(组合器)
-->spill(溢写)-->merge-->sort-->combiner函数
-->fetch(抓取/复制)-->merge-->sort-->combiner(组合器)
-->reduce
实质:就是一个Reducer
1、map端Combiner函数:
sortAndSpill()方法中--->
if (combinerRunner == null) {
..............
} else {
int spstart = spindex;
while (spindex < mend &&
kvmeta.get(offsetFor(spindex % maxRec)
+ PARTITION) == i) {
++spindex;
}
// 注意:如果一个分区的记录少于某个阈值,我们希望避免使用组合器
if (spstart != spindex) {
combineCollector.setWriter(writer);
RawKeyValueIterator kvIter = new MRResultIterator(spstart, spindex);
combinerRunner.combine(kvIter, combineCollector);
}
}
mergeParts()方法中--->
if (combinerRunner == null || numSpills < minSpillsForCombine) {
Merger.writeFile(kvIter, writer, reporter, job);
} else {
combineCollector.setWriter(writer);
combinerRunner.combine(kvIter, combineCollector);
}
***注意:当有combiner函数,并且 numSpills >= minSpillsForCombine( 3 )时:
会调用combiner函数。
2、reduce端Combiner函数:
InMemoryMerger.merge()方法里
............
if (null == combinerClass) {
Merger.writeFile(rIter, writer, reporter, jobConf);
} else {
combineCollector.setWriter(writer);
combineAndSpill(rIter, reduceCombineInputCounter);
}
练习:利用combiner函数,编写统计1901,1902的最高温度,查看日志信息(与没有combiner函数时比较)
-------------------------------------------------------------------------------------------------------------------------------------------------
练习1:求每个学员最高分数的科目是什么 (有无combiner的区别,看滚动日志)
练习2:求每个学员的平均成绩 (有无combiner的区别,看滚动日志)
-----------------------------------------------------------------------------------------------------------------------
score1.txt
学号 数学成绩
1001 98
1002 100
1003 97
1004 63
1005 59
score2.txt
学号 语文成绩
1001 95
1002 91
1003 99
1004 50
1005 70
score3.txt
学号 英语成绩
1001 80
1002 30
1003 78
1004 89
1005 34