MapReduce的分区和分组

最新推荐文章于 2025-03-31 19:42:40 发布

不止755

最新推荐文章于 2025-03-31 19:42:40 发布

阅读量1.5k

点赞数 24

文章标签： mapreduce java 大数据

本文链接：https://blog.youkuaiyun.com/weixin_44021971/article/details/135307200

版权

MapReduce的分区和分组

MapReduce的运行流程

MapReduce的工作流程大致可以分为5步，具体如下：

在这里插入图片描述

分片、格式化数据源

由MR的InputFormat读取到输入的数据源之后，会将数据进行分片操作和格式化操作。InputFormat方法是MR中读取数据源的父抽象类，MR中读取各种数据源的类如FileInputFormat、DBInputFormat都继承InputFormat类。

分片操作由InputFormat抽象类中的getSplits方法实现，此方法对MR的Job对象的输入文件集进行逻辑划分，划分后的每个输入分片将被分配给单独的Mapper进行处理。

注：分片是输入文件的逻辑分片，输入文件没有物理划分成块。例如，一个分片可能是<输入文件路径，起始位置，偏移量>元组。

格式化操作由InputFormat抽象类中的createRecordReader方法实现，此方法的返回值RecordReader<K,V>，其中的键值对就是Mapper的输入键值对。由getSplits返回的分片集合将由createRecordReader方法处理，对给定的分片创建一个记录读取器（RecordReader），MR框架在分片被使用前将会调用返回值RecordReader<K,V>类的RecordReader.initialize(InputSplit,TaskAttemptContext)方法。

执行MapTask

在分片、格式化数据源完成后，对每个分片都会有一个Mapper执行MapTask，并且每个Map任务都有一个内存缓冲区（缓冲区大小100MB），输入的分片（split）数据经过Map任务处理后的结果键值对会写入此内存缓冲区中。

如果写入的数据达到内存缓冲区的阈值——80%（80MB），会启动一个线程将内存缓冲区中的数据溢写（spill）到磁盘小文件中，同时不影响Map将结果键值对继续写入内存缓冲区。

在溢写过程中，MapReduce框架会对Key进行排序，如果Map结果数据量比较大，会形成多个溢写文件，最后的缓冲区数据也会全部溢写入磁盘形成一个溢写文件，如果是多个溢写文件，则最后合并所有的溢写文件为一个文件。

执行Shuffle过程

当MapTask任务执行完成后，Map阶段处理的数据将重新改组传递到Reducer阶段处理，这个过程叫做Shuffle。

Shuffle会将MapTask输出的处理结果数据分发给ReduceTask，并在分发的过程中，对数据按key进行分区和排序。不同分区的数据将会由不同的ReduceTask线程并行处理。

执行ReduceTask

在执行ReduceTask时，Reducer所接收的数据键值对<Rk,Rv>是由Mapper的结果数据键值对<Mk,Mv>形成，其中的Rk是Map的输出键值对中的Mk，其中的Rv是Map输出结果键值对中的相同Mk对应的值Mv组成的集合，即Rv=List<Mv>。相同的Mk对应的Reducer输入数据划分为一个组，调用一次reduce()方法。

由此可见，不同Reduce分区中的数据在执行ReduceTask时，会根据划分的组来执行，有多少组就调用多少次Reducer中的reduce()方法。

输入ReduceTask的数据流是<key,{value list}>形式，用户可以自定义reduce()方法进行逻辑处理，最终以<key,value>的形式输出。

写入文件

MapReduce框架会自动把ReduceTask生成的<key,value>传入OutputFormat的write方法，实现数据写入文件或其他容器（由实现抽象OutputFormat的类来决定）的操作。

MR分区

MR的数据分区发生在Shuffle阶段，当Map阶段的数据处理完成，并经过排序且合并到一个临时文件之前，通过定义好的分区器（Partitioner），对将要写入磁盘的键值对（即，Mapper的<outkey,outvalue>）进行分区。之后每个分区的ReduceTask会从各个MapTask的临时文件中拉取自己分区的数据。

MR默认的ReduceTask只有一个，通过在MR程序启动类中设置，可以自定义ReduceTask的数量。如下：

Job job = Job.getInstance(); //创建MR任务对象
...
job.setNumReduceTasks(4); //定义reduce任务个数为4

一般情况下，分区的数量应当与ReduceTask的数量相同，每个分区有一个ReduceTask线程做数据处理。

数据分区器：Partitioner

数据分区器通过Mapper输出数据中的key完成数据的分区。MR（MapReduce）默认的数据分区器为HashPartitioner。

/** Partition keys by their {@link Object#hashCode()}. */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class HashPartitioner<K, V> extends Partitioner<K, V> {

  /** Use {@link Object#hashCode()} to partition. */
  public int getPartition(K key, V value,
                          int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }

}

参数介绍：

K key：将要做分区的键

V value：键值对对象（entry）的值

numReduceTasks：分区的总数

返回值介绍：

对应key的分区号（索引或下标）

方法描述：

HashPartitioner对Mapper输出键值对数据的key使用散列函数Hash，完成Mapper输出数据的"分散"。再通过key的hash值对输入参数numReduceTask取模运算，得到该键值对所属的分区号。分区的数量和job的ReduceTask的数量相同，因此分区器（HashPartitioner）控制Map的输出结果键（以及值）被发送到哪一个ReduceTask做聚合处理（原文：The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent for reduction.）。

自定义分区器：

如果MR程序中有分区的项目需求，可以通过继承(extends)Partitioner<KEY, VALUE>自定义分区器类，实现其中的getPartition(KEY key, VALUE value, int numPartitions)抽象方法实现分区。

MR分组：

MR排序：

在介绍MR（MapReduce）的分组之前有必要介绍一下MR中对数据的排序问题。排序会发生在MapTask阶段和ReduceTask阶段。

MapTask阶段：

在每个MapTask要将内存缓冲区中的数据溢写到小文件的时候，都会对数据按照key值进行排序，使得小文件中的数据是有序的。（注：按照key值排序的前提是key值实现了WritableComparable<ClassName>接口中的比较方法或者通过Job.setSortComparatorClass( Class<? extends org.apache.hadoop.io.RawComparator> cls )设置key的比较器）
当每个MapTask将溢写的小文件合并成一个完整的数据文件时，这个过程中会不断合并文件以及排序直到完成数据文件的合并。
最终每个MapTask将得到排好序的一个完整数据文件。

自定义key比较器类，作为Job.setSortComparatorClass参数：

public class FirstSort extends WritableComparator {
    public FirstSort() {
        super(StudentExamRecord.class,true);
    }

    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        StudentExamRecord left = (StudentExamRecord) a;
        StudentExamRecord right = (StudentExamRecord) b;
        return left.compareTo(right);
    }
}

ReduceTask阶段：

Reducer的输入数据是Mapper的输出数据，每个分区的ReduceTask需要将每个MapTask生成文件中的自己分区的数据拉取（Copy）以及做归并排序（Sort），因为MapTask生成文件中的数据是有序的，但是不能保证ReduceTask从各个MapTask生成文件中拉区到的数据依然保持有序，这是由于MapTask生成文件中的有序数据相对于ReduceTask分区所需的有序数据来说只是局部有序。

分组：grouping

通过上面的排序，可以理解到ReduceTask此时的数据是有序的（此处所述有序，以及整个上下文所述有序，都只是针对key，而不针对value）。为了方便理解，做ReduceTask数据抽象概念图如下：

分组数据描述

通过上图可以看到，ReduceTask的数据分组依据的是key值，当逐个比较键值对的key值时，如果发现key值不相同，则结束上一个分组，开启下一个分组，如图中，当比较到key_n和key_n+1时，两者不同，则结束上一个分组，以key_n作为键（key），以value_1、value_2、...、value_n为元素的集合为值（value），输入到Reducer中，作为输入数据<keyIn,valueIn>。依次往下，直到该分区内的所有数据都被分组且传递到Reducer中进行聚合。

分组完全依赖于Reducer的输入数据的key值。因此如何判定组别也就变成了如何确定key值的比较方法。默认的分组依赖于key中定义的比较方法compareTo()。

自定义分组

自定义分组的设置通过MR的工作对象Job来设置，设置方法如下：

Job job = Job.getInstance();
...
//定义一个比较器控制单次调用Reducer.reduce方法时，哪些键会被分为一组，分为一组的键，其值会被一起存放在一个集合中。
job.setGroupingComparatorClass(CustomerGroupingComparator.class);

其中CustomerGroupingComparator为自定义的比较器，用来划分key值的组别。此比较器的返回结果为0时，说明两个key是相等的，也即是一组中的数据。

CustomerGroupingComparator需要继承WritableComparator并调用其父类构造函数。

代码示例：

输入数据

/input/input1.data

org_001,userId_001,30
org_002,userId_002,12
org_002,userId_002,54
org_001,userId_002,67
org_001,userId_001,83
org_002,userId_001,54
org_002,userId_001,4
......

/input/input2.data

org_001,userId_001,16
org_002,userId_002,36
org_002,userId_002,16
org_001,userId_002,73
org_001,userId_001,83
......

/input/input3.data

org_001,userId_001,70
org_002,userId_002,8
org_002,userId_002,69
org_001,userId_002,79
org_001,userId_001,93
org_002,userId_001,10
org_002,userId_001,48
org_001,userId_002,19
org_001,userId_002,78
......

实体类

以orgId、userId、examScore为属性的实体类

package com.secondarysort;

import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.Objects;

/**
 * { @author }: Cui
 * <p>
 * { @DateTime }:  2023年12月27日
 * <p>
 * { @ClassName }:  com.secondarysort.StudentExamRecoder
 * <p>
 * { @Description }: 学生考试成绩的实体类
 * <p>
 * All rights Reserved, Designed By Cui
 * { @Copyright }:  2023-2023
 */

public class StudentExamRecord implements WritableComparable<StudentExamRecord> {
    private String orgId;
    private String userId;
    private double examScore;

    public StudentExamRecord() {
    }

    public StudentExamRecord(String orgId, String userId, double examScore) {
        this.orgId = orgId;
        this.userId = userId;
        this.examScore = examScore;
    }

    public String getOrgId() {
        return orgId;
    }

    public void setOrgId(String orgId) {
        this.orgId = orgId;
    }

    public String getUserId() {
        return userId;
    }

    public void setUserId(String userId) {
        this.userId = userId;
    }

    public double getExamScore() {
        return examScore;
    }

    public void setExamScore(double examScore) {
        this.examScore = examScore;
    }


    @Override
    public int hashCode() {
        if(this.orgId.equals("org_001") && this.userId.equals("userId_002")){
            return Objects.hash(orgId, userId)+1;
        }else if(this.orgId.equals("org_002") && this.userId.equals("userId_001")){
            return Objects.hash(orgId, userId)%3;
        }else if(this.orgId.equals("org_002") && this.userId.equals("userId_002")){
            return Objects.hash(orgId, userId)+1;
        }
        return Objects.hash(orgId, userId);
    }

    @Override
    public String toString() {
        return "StudentExamRecord{" +
                "ordId='" + orgId + '\'' +
                ", userId='" + userId + '\'' +
                ", examScore=" + examScore +
                '}';
    }

    @Override
    public int compareTo(StudentExamRecord o) {
        if(this.orgId.equals(o.getOrgId())){
            if(this.userId.equals(o.getUserId())){
                if(this.examScore == o.getExamScore()){
                    return 0;
                }else{
                    return this.examScore>o.getExamScore()?1:-1;
                }
            }else{
                return this.userId.compareTo(o.getUserId());
            }
        }else{
            return this.orgId.compareTo(o.getOrgId());
        }
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(this.orgId);
        out.writeUTF(this.userId);
        out.writeDouble(this.examScore);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.orgId = in.readUTF();
        this.userId = in.readUTF();
        this.examScore = in.readDouble();
    }
}

自定义分区器

package com.secondarysort;

import org.apache.hadoop.mapreduce.Partitioner;

/**
 * { @author }: Cui
 * <p>
 * { @DateTime }:  2023年12月27日
 * <p>
 * { @ClassName }:  com.secondarysort.CustomerPartition
 * <p>
 * { @Description }: 自定义分区器
 * <p>
 * All rights Reserved, Designed By Cui
 * { @Copyright }:  2023-2023
 */

public class CustomerPartition extends Partitioner<StudentExamRecord,StudentExamRecord> {
    @Override
    public int getPartition(StudentExamRecord key, StudentExamRecord value, int numPartitions) {
        return (key.hashCode() & Integer.MAX_VALUE) % numPartitions;
    }
}

自定义分组比较器

package com.secondarysort;

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

/**
 * { @author }: Cui
 * <p>
 * { @DateTime }:  2023年12月27日
 * <p>
 * { @ClassName }:  com.secondarysort.CustomerGroupingComparator
 * <p>
 * { @Description }: 自定义分组比较器
 * <p>
 * All rights Reserved, Designed By Cui
 * { @Copyright }:  2023-2023
 */

public class CustomerGroupingComparator extends WritableComparator {
    public CustomerGroupingComparator() {
        //需要两个参数，第一个为key的类型的Class类，第二个参数决定是否创建对象实例，若要有分组的功能，需要设置为true
        super(StudentExamRecord.class,true);
    }

    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        StudentExamRecord left = (StudentExamRecord) a;
        StudentExamRecord right = (StudentExamRecord) b;
        if(left.getOrgId().equals(right.getOrgId())){
            return left.getUserId().compareTo(right.getUserId());
        }else{
            return left.getOrgId().compareTo(right.getOrgId());
        }
    }
}

MR程序的Mapper

package com.secondarysort;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * { @author }: Cui
 * <p>
 * { @DateTime }:  2023年12月27日
 * <p>
 * { @ClassName }:  com.secondarysort.SecondarySortMapper
 * <p>
 * { @Description }: MR——Mapper
 * <p>
 * All rights Reserved, Designed By Cui
 * { @Copyright }:  2023-2023
 */

public class SecondarySortMapper extends Mapper<LongWritable, Text,StudentExamRecord,StudentExamRecord> {
    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, StudentExamRecord, StudentExamRecord>.Context context) throws IOException, InterruptedException {
        String[] split = value.toString().split(",");
        StudentExamRecord out = new StudentExamRecord(split[0],split[1],Integer.parseInt(split[2]));
        context.write(out,out);
    }
}

MR程序的Reducer

package com.secondarysort;

import org.apache.commons.beanutils.BeanUtils;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
import java.lang.reflect.InvocationTargetException;
import java.util.ArrayList;
import java.util.List;

/**
 * { @author }: Cui
 * <p>
 * { @DateTime }:  2023年12月27日
 * <p>
 * { @ClassName }:  com.secondarysort.SecondarySortReducer
 * <p>
 * { @Description }: MR——Reducer
 * <p>
 * All rights Reserved, Designed By Cui
 * { @Copyright }:  2023-2023
 */

public class SecondarySortReducer extends Reducer<StudentExamRecord,StudentExamRecord,StudentExamRecord,StudentExamRecord> {
    @Override
    protected void reduce(StudentExamRecord key, Iterable<StudentExamRecord> values, Reducer<StudentExamRecord, StudentExamRecord, StudentExamRecord, StudentExamRecord>.Context context) throws IOException, InterruptedException {
        List<StudentExamRecord> records = new ArrayList<>();
        for (StudentExamRecord value : values) {
            StudentExamRecord record = new StudentExamRecord();
            try {
                BeanUtils.copyProperties(record,value);
            } catch (IllegalAccessException e) {
                throw new RuntimeException(e);
            } catch (InvocationTargetException e) {
                throw new RuntimeException(e);
            }
            records.add(record);
        }
        System.out.println("***********key="+key+",************values="+records);
        for (StudentExamRecord record : records) {
            context.write(key,record);
        }
    }
}

MR程序的启动类

package com.secondarysort;

import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.log4j.BasicConfigurator;

import java.io.IOException;

/**
 * { @author }: Cui
 * <p>
 * { @DateTime }:  2023年12月27日
 * <p>
 * { @ClassName }:  com.secondarysort.SecondarySortLaunch
 * <p>
 * { @Description }: MR——启动类
 * <p>
 * All rights Reserved, Designed By Cui
 * { @Copyright }:  2023-2023
 */

public class SecondarySortLaunch {
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        BasicConfigurator.configure();
        Job job = Job.getInstance();
        job.setJobName("secondarySort");
        job.setJarByClass(SecondarySortLaunch.class);
        job.setMapperClass(SecondarySortMapper.class);
        job.setReducerClass(SecondarySortReducer.class);

        job.setMapOutputKeyClass(StudentExamRecord.class);
        job.setMapOutputValueClass(StudentExamRecord.class);

        job.setOutputKeyClass(StudentExamRecord.class);
        job.setOutputValueClass(StudentExamRecord.class);

        job.setPartitionerClass(CustomerPartition.class);
        job.setNumReduceTasks(4);


        //定义一个比较器控制单次调用Reducer.reduce方法时，哪些键会被分为一组，分为一组的键，其值会被一起存放在一个集合中。
        job.setGroupingComparatorClass(CustomerGroupingComparator.class);

        FileInputFormat.addInputPath(job,new Path("input"));
        FileSystem fs = FileSystem.get(job.getConfiguration());
        Path out = new Path("output");
        if (fs.exists(out)) {
            fs.delete(out,true);
        }
        FileOutputFormat.setOutputPath(job,out);

        boolean b = job.waitForCompletion(true);
        System.exit(b?0:1);
    }
}

输出数据

/output/part-r-00000

StudentExamRecord{ordId='org_002', userId='userId_002', examScore=100.0}    StudentExamRecord{ordId='org_002', userId='userId_002', examScore=3.0}
StudentExamRecord{ordId='org_002', userId='userId_002', examScore=100.0}    StudentExamRecord{ordId='org_002', userId='userId_002', examScore=8.0}
StudentExamRecord{ordId='org_002', userId='userId_002', examScore=100.0}    StudentExamRecord{ordId='org_002', userId='userId_002', examScore=11.0}
StudentExamRecord{ordId='org_002', userId='userId_002', examScore=100.0}    StudentExamRecord{ordId='org_002', userId='userId_002', examScore=12.0}
StudentExamRecord{ordId='org_002', userId='userId_002', examScore=100.0}    StudentExamRecord{ordId='org_002', userId='userId_002', examScore=14.0}
StudentExamRecord{ordId='org_002', userId='userId_002', examScore=100.0}    StudentExamRecord{ordId='org_002', userId='userId_002', examScore=16.0}
......

/output/part-r-00001

StudentExamRecord{ordId='org_001', userId='userId_002', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_002', examScore=5.0}
StudentExamRecord{ordId='org_001', userId='userId_002', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_002', examScore=9.0}
StudentExamRecord{ordId='org_001', userId='userId_002', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_002', examScore=9.0}
StudentExamRecord{ordId='org_001', userId='userId_002', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_002', examScore=11.0}
StudentExamRecord{ordId='org_001', userId='userId_002', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_002', examScore=14.0}
StudentExamRecord{ordId='org_001', userId='userId_002', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_002', examScore=15.0}
......

/output/part-r-00002

StudentExamRecord{ordId='org_002', userId='userId_001', examScore=95.0} StudentExamRecord{ordId='org_002', userId='userId_001', examScore=3.0}
StudentExamRecord{ordId='org_002', userId='userId_001', examScore=95.0} StudentExamRecord{ordId='org_002', userId='userId_001', examScore=4.0}
StudentExamRecord{ordId='org_002', userId='userId_001', examScore=95.0} StudentExamRecord{ordId='org_002', userId='userId_001', examScore=9.0}
StudentExamRecord{ordId='org_002', userId='userId_001', examScore=95.0} StudentExamRecord{ordId='org_002', userId='userId_001', examScore=10.0}
StudentExamRecord{ordId='org_002', userId='userId_001', examScore=95.0} StudentExamRecord{ordId='org_002', userId='userId_001', examScore=12.0}
......

/output/part-r-00003

StudentExamRecord{ordId='org_001', userId='userId_001', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_001', examScore=2.0}
StudentExamRecord{ordId='org_001', userId='userId_001', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_001', examScore=8.0}
StudentExamRecord{ordId='org_001', userId='userId_001', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_001', examScore=16.0}
StudentExamRecord{ordId='org_001', userId='userId_001', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_001', examScore=28.0}
StudentExamRecord{ordId='org_001', userId='userId_001', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_001', examScore=30.0}
StudentExamRecord{ordId='org_001', userId='userId_001', examScore=99.0} StudentExamRecord{ordId='org_001', userId='userId_001', examScore=30.0}
......

参考内容：
https://blog.youkuaiyun.com/Shockang/article/details/117793238
https://blog.youkuaiyun.com/Shockang/article/details/117970151
写在后面：有许多需要改善的地方，欢迎留言相互交流学习；