MapReduce的分区和分组

MapReduce的分区和分组

MapReduce的运行流程

MapReduce的工作流程大致可以分为5步,具体如下:

在这里插入图片描述

分片、格式化数据源

由MR的InputFormat读取到输入的数据源之后,会将数据进行分片操作和格式化操作。InputFormat方法是MR中读取数据源的父抽象类,MR中读取各种数据源的类如FileInputFormat、DBInputFormat都继承InputFormat类。

分片操作由InputFormat抽象类中的getSplits方法实现,此方法对MR的Job对象的输入文件集进行逻辑划分,划分后的每个输入分片将被分配给单独的Mapper进行处理。

注:分片是输入文件的逻辑分片,输入文件没有物理划分成块。例如,一个分片可能是<输入文件路径,起始位置,偏移量>元组。

格式化操作由InputFormat抽象类中的createRecordReader方法实现,此方法的返回值RecordReader<K,V>,其中的键值对就是Mapper的输入键值对。由getSplits返回的分片集合将由createRecordReader方法处理,对给定的分片创建一个记录读取器(RecordReader),MR框架在分片被使用前将会调用返回值RecordReader<K,V>类的RecordReader.initialize(InputSplit,TaskAttemptContext)方法。

执行MapTask

在分片、格式化数据源完成后,对每个分片都会有一个Mapper执行MapTask,并且每个Map任务都有一个内存缓冲区(缓冲区大小100MB),输入的分片(split)数据经过Map任务处理后的结果键值对会写入此内存缓冲区中。

如果写入的数据达到内存缓冲区的阈值——80%(80MB),会启动一个线程将内存缓冲区中的数据溢写(spill)到磁盘小文件中,同时不影响Map将结果键值对继续写入内存缓冲区。

在溢写过程中,MapReduce框架会对Key进行排序,如果Map结果数据量比较大,会形成多个溢写文件,最后的缓冲区数据也会全部溢写入磁盘形成一个溢写文件,如果是多个溢写文件,则最后合并所有的溢写文件为一个文件。

执行Shuffle过程

当MapTask任务执行完成后,Map阶段处理的数据将重新改组传递到Reducer阶段处理,这个过程叫做Shuffle。

Shuffle会将MapTask输出的处理结果数据分发给ReduceTask,并在分发的过程中,对数据按key进行分区和排序。不同分区的数据将会由不同的ReduceTask线程并行处理。

执行ReduceTask

在执行ReduceTask时,Reducer所接收的数据键值对<Rk,Rv>是由Mapper的结果数据键值对<Mk,Mv>形成,其中的Rk是Map的输出键值对中的Mk,其中的Rv是Map输出结果键值对中的相同Mk对应的值Mv组成的集合,即Rv=List<Mv>。相同的Mk对应的Reducer输入数据划分为一个组,调用一次reduce()方法。

由此可见,不同Reduce分区中的数据在执行ReduceTask时,会根据划分的组来执行,有多少组就调用多少次Reducer中的reduce()方法。

输入ReduceTask的数据流是<key,{value list}>形式,用户可以自定义reduce()方法进行逻辑处理,最终以<key,value>的形式输出。

写入文件

MapReduce框架会自动把ReduceTask生成的<key,value>传入OutputFormat的write方法,实现数据写入文件或其他容器(由实现抽象OutputFormat的类来决定)的操作。

MR分区

MR的数据分区发生在Shuffle阶段,当Map阶段的数据处理完成,并经过排序且合并到一个临时文件之前,通过定义好的分区器(Partitioner),对将要写入磁盘的键值对(即,Mapper的<outkey,outvalue>)进行分区。之后每个分区的ReduceTask会从各个MapTask的临时文件中拉取自己分区的数据。

MR默认的ReduceTask只有一个,通过在MR程序启动类中设置,可以自定义ReduceTask的数量。如下:

Job job = Job.getInstance(); //创建MR任务对象
...
job.setNumReduceTasks(4); //定义reduce任务个数为4

一般情况下,分区的数量应当与ReduceTask的数量相同,每个分区有一个ReduceTask线程做数据处理。

数据分区器:Partitioner

数据分区器通过Mapper输出数据中的key完成数据的分区。MR(MapReduce)默认的数据分区器为HashPartitioner。

/** Partition keys by their {@link Object#hashCode()}. */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class HashPartitioner<K, V> extends Partitioner<K, V> {

  /** Use {@link Object#hashCode()} to partition. */
  public int getPartition(K key, V value,
                          int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }

}
参数介绍:

K key:将要做分区的键

V value:键值对对象(entry)的值

numReduceTasks:分区的总数

返回值介绍:

对应key的分区号(索引或下标)

方法描述:

HashPartitioner对Mapper输出键值对数据的key使用散列函数Hash,完成Mapper输出数据的"分散"。再通过key的hash值对输入参数numReduceTask取模运算,得到该键值对所属的分区号。分区的数量和job的ReduceTask的数量相同,因此分区器(HashPartitioner)控制Map的输出结果键(以及值)被发送到哪一个ReduceTask做聚合处理(原文:The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent for reduction.)。

自定义分区器:

如果MR程序中有分区的项目需求,可以通过继承(extends)Partitioner<KEY, VALUE>自定义分区器类,实现其中的getPartition(KEY key, VALUE value, int numPartitions)抽象方法实现分区。

MR分组:

MR排序:

在介绍MR(MapReduce)的分组之前有必要介绍一下MR中对数据的排序问题。排序会发生在MapTask阶段和ReduceTask阶段。

MapTask阶段:
  1. 在每个MapTask要将内存缓冲区中的数据溢写到小文件的时候,都会对数据按照key值进行排序,使得小文件中的数据是有序的。(注:按照key值排序的前提是key值实现了WritableComparable<ClassName>接口中的比较方法或者通过Job.setSortComparatorClass( Class<? extends org.apache.hadoop.io.RawComparator> cls )设置key的比较器)
  2. 当每个MapTask将溢写的小文件合并成一个完整的数据文件时,这个过程中会不断合并文件以及排序直到完成数据文件的合并。
  3. 最终每个MapTask将得到排好序的一个完整数据文件。

自定义key比较器类,作为Job.setSortComparatorClass参数:

public class FirstSort extends WritableComparator {
    public FirstSort() {
        super(StudentExamRecord.class,true);
    }

    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        StudentExamRecord left = (StudentExamRecord) a;
        StudentExamRecord right = (StudentExamRecord) b;
        return left.compareTo(right);
    }
}
ReduceTask阶段:

Reducer的输入数据是Mapper的输出数据,每个分区的ReduceTask需要将每个MapTask生成文件中的自己分区的数据拉取(Copy)以及做归并排序(Sort),因为MapTask生成文件中的数据是有序的,但是不能保证ReduceTask从各个MapTask生成文件中拉区到的数据依然保持有序,这是由于MapTask生成文件中的有序数据相对于ReduceTask分区所需的有序数据来说只是局部有序。

分组:grouping

通过上面的排序,可以理解到ReduceTask此时的数据是有序的(此处所述有序,以及整个上下文所述有序,都只是针对key,而不针对value)。为了方便理解,做ReduceTask数据抽象概念图如下:

分组数据描述

通过上图可以看到,ReduceTask的数据分组依据的是key值,当逐个比较键值对的key值时,如果发现key值不相同,则结束上一个分组,开启下一个分组,如图中,当比较到key_nkey_n+1时,两者不同,则结束上一个分组,以key_n作为键(key),以value_1、value_2、...、value_n为元素的集合为值(value),输入到Reducer中,作为输入数据<keyIn,valueIn>。依次往下,直到该分区内的所有数据都被分组且传递到Reducer中进行聚合。

分组完全依赖于Reducer的输入数据的key值。因此如何判定组别也就变成了如何确定key值的比较方法。默认的分组依赖于key中定义的比较方法compareTo()。

自定义分组

自定义分组的设置通过MR的工作对象Job来设置,设置方法如下:

Job job = Job.getInstance();
...
//定义一个比较器控制单次调用Reducer.reduce方法时,哪些键会被分为一组,分为一组的键,其值会被一起存放在一个集合中。
job.setGroupingComparatorClass(CustomerGroupingComparator.class);

其中CustomerGroupingComparator为自定义的比较器,用来划分key值的组别。此比较器的返回结果为0时,说明两个key是相等的,也即是一组中的数据。

CustomerGroupingComparator需要继承WritableComparator并调用其父类构造函数。

代码示例:

输入数据

/input/input1.data

org_001,userId_001,30
org_002,userId_002,12
org_002,userId_002,54
org_001,userId_002,67
org_001,userId_001,83
org_002,userId_001,54
org_002,userId_001,4
......

/input/input2.data

org_001,userId_001,16
org_002,userId_002,36
org_002,userId_002,16
org_001,userId_002,73
org_001,userId_001,83
......

/input/input3.data

org_001,userId_001,70
org_002,userId_002,8
org_002,userId_002,69
org_001,userId_002,79
org_001,userId_001,93
org_002,userId_001,10
org_002,userId_001,48
org_001,userId_002,19
org_001,userId_002,78
......

实体类

以orgId、userId、examScore为属性的实体类

package com.secondarysort;

import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.Objects;

/**
 * { @author }: Cui
 * <p>
 * { @DateTime }:  2023年12月27日
 * <p>
 * { @ClassName }:  com.secondarysort.StudentExamRecoder
 * <p>
 * { @Description }: 学生考试成绩的实体类
 * <p>
 * All rights Reserved, Designed By Cui
 * { @Copyright }:  2023-2023
 */

public class StudentExamRecord implements WritableComparable<StudentExamRecord> {
    private String orgId;
    private String userId;
    private double examScore;

    public StudentExamRecord() {
    }

    public StudentExamRecord(String orgId, String userId, double examScore) {
        this.orgId = orgId;
        this.userId = userId;
        this.examScore = examScore;
    }

    public String getOrgId() {
        return orgId;
    }

    public void setOrgId(String orgId) {
        this.orgId = orgId;
    }

    public String getUserId() {
        return userId;
    }

    public void setUserId(String userId) {
        this.userId = userId;
    }

    public double getExamScore() {
        return examScore;
    }

    public void setExamScore(double examScore) {
        this.examScore = examScore;
    }


    @Override
    public int hashCode() {
        if(this.orgId.equals("org_001") && this.userId.equals("userId_002")){
            return Objects.hash(orgId, userId)+1;
        }else if(this.orgId.equals("org_002") && this.userId.equals("userId_001")){
            return Objects.hash(orgId, userId)%3;
        }else if(this.orgId.equals("org_002") && this.userId.equals("userId_002")){
            return Objects.hash(orgId, userId)+1;
        }
        return Objects.hash(orgId, userId);
    }

    @Override
    public String toString() {
        return "StudentExamRecord{" +
                "ordId='" + orgId + '\'' +
                ", userId='" + userId + '\'' +
                ", examScore=" + examScore +
                '}';
    }

    @Override
    public int compareTo(StudentExamRecord o) {
        if(this.orgId.equals(o.getOrgId())){
            if(this.userId.equals(o.getUserId())){
                if(this.examScore == o.getExamScore()){
                    return 0;
                }else{
                    return this.examScore>o.getExamScore()?1:-1;
                }
            }else{
                return this.userId.compareTo(o.getUserId());
            }
        }else{
            return this.orgId.compareTo(o.getOrgId());
        }
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(this.orgId);
        out.writeUTF(this.userId);
        out.writeDouble(this.examScore);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.orgId = in.readUTF();
        this.userId = in.readUTF();
        this.examScore = in.readDouble();
    }
}

自定义分区器

package com.secondarysort;

import org.apache.hadoop.mapreduce.Partitioner;

/**
 * { @author }: Cui
 * <p>
 * { @DateTime }:  2023年12月27日
 * <p>
 * { @ClassName }:  com.secondarysort.CustomerPartition
 * <p>
 * { @Description }: 自定义分区器
 * <p>
 * All rights Reserved, Designed By Cui
 * { @Copyright }:  2023-2023
 */

public class CustomerPartition extends Partitioner<StudentExamRecord,StudentExamRecord> {
    @Override
    public int getPartition(StudentExamRecord key, StudentExamRecord value, int numPartitions) {
        return (key.hashCode() & Integer.MAX_VALUE) % numPartitions;
    }
}

自定义分组比较器

package com.secondarysort;

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

/**
 * { @author }: Cui
 * <p>
 * { @DateTime }:  2023年12月27日
 * <p>
 * { @ClassName }:  com.secondarysort.CustomerGroupingComparator
 * <p>
 * { @Description }: 自定义分组比较器
 * <p>
 * All rights Reserved, Designed By Cui
 * { @Copyright }:  2023-2023
 */

public class CustomerGroupingComparator extends WritableComparator {
    public CustomerGroupingComparator() {
        //需要两个参数,第一个为key的类型的Class类,第二个参数决定是否创建对象实例,若要有分组的功能,需要设置为true
        super(StudentExamRecord.class,true);
    }

    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        StudentExamRecord left = (StudentExamRecord) a;
        StudentExamRecord right = (StudentExamRecord) b;
        if(left.getOrgId().equals(right.getOrgId())){
            return left.getUserId().compareTo(right.getUserId());
        }else{
            return left.getOrgId().compareTo(right.getOrgId());
        }
    }
}

MR程序的Mapper

package com.secondarysort;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * { @author }: Cui
 * <p>
 * { @DateTime }:  2023年12月27日
 * <p>
 * { @ClassName }:  com.secondarysort.SecondarySortMapper
 * <p>
 * { @Description }: MR——Mapper
 * <p>
 * All rights Reserved, Designed By Cui
 * { @Copyright }:  2023-2023
 */

public class SecondarySortMapper extends Mapper<LongWritable, Text,StudentExamRecord,StudentExamRecord> {
    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, StudentExamRecord, StudentExamRecord>.Context context) throws IOException, InterruptedException {
        String[] split = value.toString().split(",");
        StudentExamRecord out = new StudentExamRecord(split[0],split[1],Integer.parseInt(split[2]));
        context.write(out,out);
    }
}

MR程序的Reducer

package com.secondarysort;

import org.apache.commons.beanutils.BeanUtils;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
import java.lang.reflect.InvocationTargetException;
import java.util.ArrayList;
import java.util.List;

/**
 * { @author }: Cui
 * <p>
 * { @DateTime }:  2023年12月27日
 * <p>
 * { @ClassName }:  com.secondarysort.SecondarySortReducer
 * <p>
 * { @Description }: MR——Reducer
 * <p>
 * All rights Reserved, Designed By Cui
 * { @Copyright }:  2023-2023
 */

public class SecondarySortReducer extends Reducer<StudentExamRecord,StudentExamRecord,StudentExamRecord,StudentExamRecord> {
    @Override
    protected void reduce(StudentExamRecord key, Iterable<StudentExamRecord> values, Reducer<StudentExamRecord, StudentExamRecord, StudentExamRecord, StudentExamRecord>.Context context) throws IOException, InterruptedException {
        List<StudentExamRecord> records = new ArrayList<>();
        for (StudentExamRecord value : values) {
            StudentExamRecord record = new StudentExamRecord();
            try {
                BeanUtils.copyProperties(record,value);
            } catch (IllegalAccessException e) {
                throw new RuntimeException(e);
            } catch (InvocationTargetException e) {
                throw new RuntimeException(e);
            }
            records.add(record);
        }
        System.out.println("***********key="+key+",************values="+records);
        for (StudentExamRecord record : records) {
            context.write(key,record);
        }
    }
}

MR程序的启动类

package com.secondarysort;

import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.log4j.BasicConfigurator;

import java.io.IOException;

/**
 * { @author }: Cui
 * <p>
 * { @DateTime }:  2023年12月27日
 * <p>
 * { @ClassName }:  com.secondarysort.SecondarySortLaunch
 * <p>
 * { @Description }: MR——启动类
 * <p>
 * All rights Reserved, Designed By Cui
 * { @Copyright }:  2023-2023
 */

public class SecondarySortLaunch {
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        BasicConfigurator.configure();
        Job job = Job.getInstance();
        job.setJobName("secondarySort");
        job.setJarByClass(SecondarySortLaunch.class);
        job.setMapperClass(SecondarySortMapper.class);
        job.setReducerClass(SecondarySortReducer.class);

        job.setMapOutputKeyClass(StudentExamRecord.class);
        job.setMapOutputValueClass(StudentExamRecord.class);

        job.setOutputKeyClass(StudentExamRecord.class);
        job.setOutputValueClass(StudentExamRecord.class);

        job.setPartitionerClass(CustomerPartition.class);
        job.setNumReduceTasks(4);


        //定义一个比较器控制单次调用Reducer.reduce方法时,哪些键会被分为一组,分为一组的键,其值会被一起存放在一个集合中。
        job.setGroupingComparatorClass(CustomerGroupingComparator.class);

        FileInputFormat.addInputPath(job,new Path("input"));
        FileSystem fs = FileSystem.get(job.getConfiguration());
        Path out = new Path("output");
        if (fs.exists(out)) {
            fs.delete(out,true);
        }
        FileOutputFormat.setOutputPath(job,out);

        boolean b = job.waitForCompletion(true);
        System.exit(b?0:1);
    }
}

输出数据

/output/part-r-00000

StudentExamRecord{ordId='org_002', userId='userId_002', examScore=100.0}    StudentExamRecord{ordId='org_002', userId='userId_002', examScore=3.0}
StudentExamRecord{ordId='org_002', userId='userId_002', examScore=100.0}    StudentExamRecord{ordId='org_002', userId='userId_002', examScore=8.0}
StudentExamRecord{ordId='org_002', userId='userId_002', examScore=100.0}    StudentExamRecord{ordId='org_002', userId='userId_002', examScore=11.0}
StudentExamRecord{ordId='org_002', userId='userId_002', examScore=100.0}    StudentExamRecord{ordId='org_002', userId='userId_002', examScore=12.0}
StudentExamRecord{ordId='org_002', userId='userId_002', examScore=100.0}    StudentExamRecord{ordId='org_002', userId='userId_002', examScore=14.0}
StudentExamRecord{ordId='org_002', userId='userId_002', examScore=100.0}    StudentExamRecord{ordId='org_002', userId='userId_002', examScore=16.0}
......

/output/part-r-00001

StudentExamRecord{ordId='org_001', userId='userId_002', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_002', examScore=5.0}
StudentExamRecord{ordId='org_001', userId='userId_002', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_002', examScore=9.0}
StudentExamRecord{ordId='org_001', userId='userId_002', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_002', examScore=9.0}
StudentExamRecord{ordId='org_001', userId='userId_002', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_002', examScore=11.0}
StudentExamRecord{ordId='org_001', userId='userId_002', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_002', examScore=14.0}
StudentExamRecord{ordId='org_001', userId='userId_002', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_002', examScore=15.0}
......

/output/part-r-00002

StudentExamRecord{ordId='org_002', userId='userId_001', examScore=95.0} StudentExamRecord{ordId='org_002', userId='userId_001', examScore=3.0}
StudentExamRecord{ordId='org_002', userId='userId_001', examScore=95.0} StudentExamRecord{ordId='org_002', userId='userId_001', examScore=4.0}
StudentExamRecord{ordId='org_002', userId='userId_001', examScore=95.0} StudentExamRecord{ordId='org_002', userId='userId_001', examScore=9.0}
StudentExamRecord{ordId='org_002', userId='userId_001', examScore=95.0} StudentExamRecord{ordId='org_002', userId='userId_001', examScore=10.0}
StudentExamRecord{ordId='org_002', userId='userId_001', examScore=95.0} StudentExamRecord{ordId='org_002', userId='userId_001', examScore=12.0}
......

/output/part-r-00003

StudentExamRecord{ordId='org_001', userId='userId_001', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_001', examScore=2.0}
StudentExamRecord{ordId='org_001', userId='userId_001', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_001', examScore=8.0}
StudentExamRecord{ordId='org_001', userId='userId_001', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_001', examScore=16.0}
StudentExamRecord{ordId='org_001', userId='userId_001', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_001', examScore=28.0}
StudentExamRecord{ordId='org_001', userId='userId_001', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_001', examScore=30.0}
StudentExamRecord{ordId='org_001', userId='userId_001', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_001', examScore=30.0}
......

参考内容:
https://blog.youkuaiyun.com/Shockang/article/details/117793238
https://blog.youkuaiyun.com/Shockang/article/details/117970151
写在后面:有许多需要改善的地方,欢迎留言相互交流学习;

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值