目录
MapReduce的分区和分组
MapReduce的运行流程
MapReduce的工作流程大致可以分为5步,具体如下:
分片、格式化数据源
由MR的InputFormat读取到输入的数据源之后,会将数据进行分片操作和格式化操作。InputFormat方法是MR中读取数据源的父抽象类,MR中读取各种数据源的类如FileInputFormat、DBInputFormat都继承InputFormat类。
分片操作由InputFormat抽象类中的getSplits方法实现,此方法对MR的Job对象的输入文件集进行逻辑划分,划分后的每个输入分片将被分配给单独的Mapper进行处理。
注:分片是输入文件的逻辑分片,输入文件没有物理划分成块。例如,一个分片可能是<输入文件路径,起始位置,偏移量>元组。
格式化操作由InputFormat抽象类中的createRecordReader方法实现,此方法的返回值RecordReader<K,V>,其中的键值对就是Mapper的输入键值对。由getSplits返回的分片集合将由createRecordReader方法处理,对给定的分片创建一个记录读取器(RecordReader),MR框架在分片被使用前将会调用返回值RecordReader<K,V>类的RecordReader.initialize(InputSplit,TaskAttemptContext)方法。
执行MapTask
在分片、格式化数据源完成后,对每个分片都会有一个Mapper执行MapTask,并且每个Map任务都有一个内存缓冲区(缓冲区大小100MB),输入的分片(split)数据经过Map任务处理后的结果键值对会写入此内存缓冲区中。
如果写入的数据达到内存缓冲区的阈值——80%(80MB),会启动一个线程将内存缓冲区中的数据溢写(spill)到磁盘小文件中,同时不影响Map将结果键值对继续写入内存缓冲区。
在溢写过程中,MapReduce框架会对Key进行排序,如果Map结果数据量比较大,会形成多个溢写文件,最后的缓冲区数据也会全部溢写入磁盘形成一个溢写文件,如果是多个溢写文件,则最后合并所有的溢写文件为一个文件。
执行Shuffle过程
当MapTask任务执行完成后,Map阶段处理的数据将重新改组传递到Reducer阶段处理,这个过程叫做Shuffle。
Shuffle会将MapTask输出的处理结果数据分发给ReduceTask,并在分发的过程中,对数据按key进行分区和排序。不同分区的数据将会由不同的ReduceTask线程并行处理。
执行ReduceTask
在执行ReduceTask时,Reducer所接收的数据键值对<Rk,Rv>是由Mapper的结果数据键值对<Mk,Mv>形成,其中的Rk是Map的输出键值对中的Mk,其中的Rv是Map输出结果键值对中的相同Mk对应的值Mv组成的集合,即Rv=List<Mv>。相同的Mk对应的Reducer输入数据划分为一个组,调用一次reduce()方法。
由此可见,不同Reduce分区中的数据在执行ReduceTask时,会根据划分的组来执行,有多少组就调用多少次Reducer中的reduce()方法。
输入ReduceTask的数据流是<key,{value list}>形式,用户可以自定义reduce()方法进行逻辑处理,最终以<key,value>的形式输出。
写入文件
MapReduce框架会自动把ReduceTask生成的<key,value>传入OutputFormat的write方法,实现数据写入文件或其他容器(由实现抽象OutputFormat的类来决定)的操作。
MR分区
MR的数据分区发生在Shuffle阶段,当Map阶段的数据处理完成,并经过排序且合并到一个临时文件之前,通过定义好的分区器(Partitioner),对将要写入磁盘的键值对(即,Mapper的<outkey,outvalue>)进行分区。之后每个分区的ReduceTask会从各个MapTask的临时文件中拉取自己分区的数据。
MR默认的ReduceTask只有一个,通过在MR程序启动类中设置,可以自定义ReduceTask的数量。如下:
Job job = Job.getInstance(); //创建MR任务对象
...
job.setNumReduceTasks(4); //定义reduce任务个数为4
一般情况下,分区的数量应当与ReduceTask的数量相同,每个分区有一个ReduceTask线程做数据处理。
数据分区器:Partitioner
数据分区器通过Mapper输出数据中的key完成数据的分区。MR(MapReduce)默认的数据分区器为HashPartitioner。
/** Partition keys by their {@link Object#hashCode()}. */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class HashPartitioner<K, V> extends Partitioner<K, V> {
/** Use {@link Object#hashCode()} to partition. */
public int getPartition(K key, V value,
int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}
参数介绍:
K key:将要做分区的键
V value:键值对对象(entry)的值
numReduceTasks:分区的总数
返回值介绍:
对应key的分区号(索引或下标)
方法描述:
HashPartitioner对Mapper输出键值对数据的key使用散列函数Hash,完成Mapper输出数据的"分散"。再通过key的hash值对输入参数numReduceTask取模运算,得到该键值对所属的分区号。分区的数量和job的ReduceTask的数量相同,因此分区器(HashPartitioner)控制Map的输出结果键(以及值)被发送到哪一个ReduceTask做聚合处理(原文:The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent for reduction.)。
自定义分区器:
如果MR程序中有分区的项目需求,可以通过继承(extends)Partitioner<KEY, VALUE>自定义分区器类,实现其中的getPartition(KEY key, VALUE value, int numPartitions)抽象方法实现分区。
MR分组:
MR排序:
在介绍MR(MapReduce)的分组之前有必要介绍一下MR中对数据的排序问题。排序会发生在MapTask阶段和ReduceTask阶段。
MapTask阶段:
- 在每个MapTask要将内存缓冲区中的数据溢写到小文件的时候,都会对数据按照key值进行排序,使得小文件中的数据是有序的。(注:按照key值排序的前提是key值实现了
WritableComparable<ClassName>
接口中的比较方法或者通过Job.setSortComparatorClass( Class<? extends org.apache.hadoop.io.RawComparator> cls )
设置key的比较器) - 当每个MapTask将溢写的小文件合并成一个完整的数据文件时,这个过程中会不断合并文件以及排序直到完成数据文件的合并。
- 最终每个MapTask将得到排好序的一个完整数据文件。
自定义key比较器类,作为Job.setSortComparatorClass参数:
public class FirstSort extends WritableComparator { public FirstSort() { super(StudentExamRecord.class,true); } @Override public int compare(WritableComparable a, WritableComparable b) { StudentExamRecord left = (StudentExamRecord) a; StudentExamRecord right = (StudentExamRecord) b; return left.compareTo(right); } }
ReduceTask阶段:
Reducer的输入数据是Mapper的输出数据,每个分区的ReduceTask需要将每个MapTask生成文件中的自己分区的数据拉取(Copy)以及做归并排序(Sort),因为MapTask生成文件中的数据是有序的,但是不能保证ReduceTask从各个MapTask生成文件中拉区到的数据依然保持有序,这是由于MapTask生成文件中的有序数据相对于ReduceTask分区所需的有序数据来说只是局部有序。
分组:grouping
通过上面的排序,可以理解到ReduceTask此时的数据是有序的(此处所述有序,以及整个上下文所述有序,都只是针对key,而不针对value)。为了方便理解,做ReduceTask数据抽象概念图如下:
通过上图可以看到,ReduceTask的数据分组依据的是key值,当逐个比较键值对的key值时,如果发现key值不相同,则结束上一个分组,开启下一个分组,如图中,当比较到key_n
和key_n+1
时,两者不同,则结束上一个分组,以key_n
作为键(key),以value_1、value_2、...、value_n
为元素的集合为值(value),输入到Reducer中,作为输入数据<keyIn,valueIn>。依次往下,直到该分区内的所有数据都被分组且传递到Reducer中进行聚合。
分组完全依赖于Reducer的输入数据的key值。因此如何判定组别也就变成了如何确定key值的比较方法。默认的分组依赖于key中定义的比较方法compareTo()。
自定义分组
自定义分组的设置通过MR的工作对象Job
来设置,设置方法如下:
Job job = Job.getInstance();
...
//定义一个比较器控制单次调用Reducer.reduce方法时,哪些键会被分为一组,分为一组的键,其值会被一起存放在一个集合中。
job.setGroupingComparatorClass(CustomerGroupingComparator.class);
其中CustomerGroupingComparator为自定义的比较器,用来划分key值的组别。此比较器的返回结果为0时,说明两个key是相等的,也即是一组中的数据。
CustomerGroupingComparator需要继承WritableComparator并调用其父类构造函数。
代码示例:
输入数据
/input/input1.data
org_001,userId_001,30
org_002,userId_002,12
org_002,userId_002,54
org_001,userId_002,67
org_001,userId_001,83
org_002,userId_001,54
org_002,userId_001,4
......
/input/input2.data
org_001,userId_001,16
org_002,userId_002,36
org_002,userId_002,16
org_001,userId_002,73
org_001,userId_001,83
......
/input/input3.data
org_001,userId_001,70
org_002,userId_002,8
org_002,userId_002,69
org_001,userId_002,79
org_001,userId_001,93
org_002,userId_001,10
org_002,userId_001,48
org_001,userId_002,19
org_001,userId_002,78
......
实体类
以orgId、userId、examScore为属性的实体类
package com.secondarysort;
import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.Objects;
/**
* { @author }: Cui
* <p>
* { @DateTime }: 2023年12月27日
* <p>
* { @ClassName }: com.secondarysort.StudentExamRecoder
* <p>
* { @Description }: 学生考试成绩的实体类
* <p>
* All rights Reserved, Designed By Cui
* { @Copyright }: 2023-2023
*/
public class StudentExamRecord implements WritableComparable<StudentExamRecord> {
private String orgId;
private String userId;
private double examScore;
public StudentExamRecord() {
}
public StudentExamRecord(String orgId, String userId, double examScore) {
this.orgId = orgId;
this.userId = userId;
this.examScore = examScore;
}
public String getOrgId() {
return orgId;
}
public void setOrgId(String orgId) {
this.orgId = orgId;
}
public String getUserId() {
return userId;
}
public void setUserId(String userId) {
this.userId = userId;
}
public double getExamScore() {
return examScore;
}
public void setExamScore(double examScore) {
this.examScore = examScore;
}
@Override
public int hashCode() {
if(this.orgId.equals("org_001") && this.userId.equals("userId_002")){
return Objects.hash(orgId, userId)+1;
}else if(this.orgId.equals("org_002") && this.userId.equals("userId_001")){
return Objects.hash(orgId, userId)%3;
}else if(this.orgId.equals("org_002") && this.userId.equals("userId_002")){
return Objects.hash(orgId, userId)+1;
}
return Objects.hash(orgId, userId);
}
@Override
public String toString() {
return "StudentExamRecord{" +
"ordId='" + orgId + '\'' +
", userId='" + userId + '\'' +
", examScore=" + examScore +
'}';
}
@Override
public int compareTo(StudentExamRecord o) {
if(this.orgId.equals(o.getOrgId())){
if(this.userId.equals(o.getUserId())){
if(this.examScore == o.getExamScore()){
return 0;
}else{
return this.examScore>o.getExamScore()?1:-1;
}
}else{
return this.userId.compareTo(o.getUserId());
}
}else{
return this.orgId.compareTo(o.getOrgId());
}
}
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(this.orgId);
out.writeUTF(this.userId);
out.writeDouble(this.examScore);
}
@Override
public void readFields(DataInput in) throws IOException {
this.orgId = in.readUTF();
this.userId = in.readUTF();
this.examScore = in.readDouble();
}
}
自定义分区器
package com.secondarysort;
import org.apache.hadoop.mapreduce.Partitioner;
/**
* { @author }: Cui
* <p>
* { @DateTime }: 2023年12月27日
* <p>
* { @ClassName }: com.secondarysort.CustomerPartition
* <p>
* { @Description }: 自定义分区器
* <p>
* All rights Reserved, Designed By Cui
* { @Copyright }: 2023-2023
*/
public class CustomerPartition extends Partitioner<StudentExamRecord,StudentExamRecord> {
@Override
public int getPartition(StudentExamRecord key, StudentExamRecord value, int numPartitions) {
return (key.hashCode() & Integer.MAX_VALUE) % numPartitions;
}
}
自定义分组比较器
package com.secondarysort;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
/**
* { @author }: Cui
* <p>
* { @DateTime }: 2023年12月27日
* <p>
* { @ClassName }: com.secondarysort.CustomerGroupingComparator
* <p>
* { @Description }: 自定义分组比较器
* <p>
* All rights Reserved, Designed By Cui
* { @Copyright }: 2023-2023
*/
public class CustomerGroupingComparator extends WritableComparator {
public CustomerGroupingComparator() {
//需要两个参数,第一个为key的类型的Class类,第二个参数决定是否创建对象实例,若要有分组的功能,需要设置为true
super(StudentExamRecord.class,true);
}
@Override
public int compare(WritableComparable a, WritableComparable b) {
StudentExamRecord left = (StudentExamRecord) a;
StudentExamRecord right = (StudentExamRecord) b;
if(left.getOrgId().equals(right.getOrgId())){
return left.getUserId().compareTo(right.getUserId());
}else{
return left.getOrgId().compareTo(right.getOrgId());
}
}
}
MR程序的Mapper
package com.secondarysort;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
* { @author }: Cui
* <p>
* { @DateTime }: 2023年12月27日
* <p>
* { @ClassName }: com.secondarysort.SecondarySortMapper
* <p>
* { @Description }: MR——Mapper
* <p>
* All rights Reserved, Designed By Cui
* { @Copyright }: 2023-2023
*/
public class SecondarySortMapper extends Mapper<LongWritable, Text,StudentExamRecord,StudentExamRecord> {
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, StudentExamRecord, StudentExamRecord>.Context context) throws IOException, InterruptedException {
String[] split = value.toString().split(",");
StudentExamRecord out = new StudentExamRecord(split[0],split[1],Integer.parseInt(split[2]));
context.write(out,out);
}
}
MR程序的Reducer
package com.secondarysort;
import org.apache.commons.beanutils.BeanUtils;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
import java.lang.reflect.InvocationTargetException;
import java.util.ArrayList;
import java.util.List;
/**
* { @author }: Cui
* <p>
* { @DateTime }: 2023年12月27日
* <p>
* { @ClassName }: com.secondarysort.SecondarySortReducer
* <p>
* { @Description }: MR——Reducer
* <p>
* All rights Reserved, Designed By Cui
* { @Copyright }: 2023-2023
*/
public class SecondarySortReducer extends Reducer<StudentExamRecord,StudentExamRecord,StudentExamRecord,StudentExamRecord> {
@Override
protected void reduce(StudentExamRecord key, Iterable<StudentExamRecord> values, Reducer<StudentExamRecord, StudentExamRecord, StudentExamRecord, StudentExamRecord>.Context context) throws IOException, InterruptedException {
List<StudentExamRecord> records = new ArrayList<>();
for (StudentExamRecord value : values) {
StudentExamRecord record = new StudentExamRecord();
try {
BeanUtils.copyProperties(record,value);
} catch (IllegalAccessException e) {
throw new RuntimeException(e);
} catch (InvocationTargetException e) {
throw new RuntimeException(e);
}
records.add(record);
}
System.out.println("***********key="+key+",************values="+records);
for (StudentExamRecord record : records) {
context.write(key,record);
}
}
}
MR程序的启动类
package com.secondarysort;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.log4j.BasicConfigurator;
import java.io.IOException;
/**
* { @author }: Cui
* <p>
* { @DateTime }: 2023年12月27日
* <p>
* { @ClassName }: com.secondarysort.SecondarySortLaunch
* <p>
* { @Description }: MR——启动类
* <p>
* All rights Reserved, Designed By Cui
* { @Copyright }: 2023-2023
*/
public class SecondarySortLaunch {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
BasicConfigurator.configure();
Job job = Job.getInstance();
job.setJobName("secondarySort");
job.setJarByClass(SecondarySortLaunch.class);
job.setMapperClass(SecondarySortMapper.class);
job.setReducerClass(SecondarySortReducer.class);
job.setMapOutputKeyClass(StudentExamRecord.class);
job.setMapOutputValueClass(StudentExamRecord.class);
job.setOutputKeyClass(StudentExamRecord.class);
job.setOutputValueClass(StudentExamRecord.class);
job.setPartitionerClass(CustomerPartition.class);
job.setNumReduceTasks(4);
//定义一个比较器控制单次调用Reducer.reduce方法时,哪些键会被分为一组,分为一组的键,其值会被一起存放在一个集合中。
job.setGroupingComparatorClass(CustomerGroupingComparator.class);
FileInputFormat.addInputPath(job,new Path("input"));
FileSystem fs = FileSystem.get(job.getConfiguration());
Path out = new Path("output");
if (fs.exists(out)) {
fs.delete(out,true);
}
FileOutputFormat.setOutputPath(job,out);
boolean b = job.waitForCompletion(true);
System.exit(b?0:1);
}
}
输出数据
/output/part-r-00000
StudentExamRecord{ordId='org_002', userId='userId_002', examScore=100.0} StudentExamRecord{ordId='org_002', userId='userId_002', examScore=3.0}
StudentExamRecord{ordId='org_002', userId='userId_002', examScore=100.0} StudentExamRecord{ordId='org_002', userId='userId_002', examScore=8.0}
StudentExamRecord{ordId='org_002', userId='userId_002', examScore=100.0} StudentExamRecord{ordId='org_002', userId='userId_002', examScore=11.0}
StudentExamRecord{ordId='org_002', userId='userId_002', examScore=100.0} StudentExamRecord{ordId='org_002', userId='userId_002', examScore=12.0}
StudentExamRecord{ordId='org_002', userId='userId_002', examScore=100.0} StudentExamRecord{ordId='org_002', userId='userId_002', examScore=14.0}
StudentExamRecord{ordId='org_002', userId='userId_002', examScore=100.0} StudentExamRecord{ordId='org_002', userId='userId_002', examScore=16.0}
......
/output/part-r-00001
StudentExamRecord{ordId='org_001', userId='userId_002', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_002', examScore=5.0}
StudentExamRecord{ordId='org_001', userId='userId_002', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_002', examScore=9.0}
StudentExamRecord{ordId='org_001', userId='userId_002', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_002', examScore=9.0}
StudentExamRecord{ordId='org_001', userId='userId_002', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_002', examScore=11.0}
StudentExamRecord{ordId='org_001', userId='userId_002', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_002', examScore=14.0}
StudentExamRecord{ordId='org_001', userId='userId_002', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_002', examScore=15.0}
......
/output/part-r-00002
StudentExamRecord{ordId='org_002', userId='userId_001', examScore=95.0} StudentExamRecord{ordId='org_002', userId='userId_001', examScore=3.0}
StudentExamRecord{ordId='org_002', userId='userId_001', examScore=95.0} StudentExamRecord{ordId='org_002', userId='userId_001', examScore=4.0}
StudentExamRecord{ordId='org_002', userId='userId_001', examScore=95.0} StudentExamRecord{ordId='org_002', userId='userId_001', examScore=9.0}
StudentExamRecord{ordId='org_002', userId='userId_001', examScore=95.0} StudentExamRecord{ordId='org_002', userId='userId_001', examScore=10.0}
StudentExamRecord{ordId='org_002', userId='userId_001', examScore=95.0} StudentExamRecord{ordId='org_002', userId='userId_001', examScore=12.0}
......
/output/part-r-00003
StudentExamRecord{ordId='org_001', userId='userId_001', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_001', examScore=2.0}
StudentExamRecord{ordId='org_001', userId='userId_001', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_001', examScore=8.0}
StudentExamRecord{ordId='org_001', userId='userId_001', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_001', examScore=16.0}
StudentExamRecord{ordId='org_001', userId='userId_001', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_001', examScore=28.0}
StudentExamRecord{ordId='org_001', userId='userId_001', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_001', examScore=30.0}
StudentExamRecord{ordId='org_001', userId='userId_001', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_001', examScore=30.0}
......
参考内容:
https://blog.youkuaiyun.com/Shockang/article/details/117793238
https://blog.youkuaiyun.com/Shockang/article/details/117970151
写在后面:有许多需要改善的地方,欢迎留言相互交流学习;