二十三 MapReduce Job提交的源码解析

最新推荐文章于 2021-11-30 15:28:07 发布

原创最新推荐文章于 2021-11-30 15:28:07 发布 · 330 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#大数据 #hadoop #MapReduce #源码

Hadoop 专栏收录该内容

33 篇文章

订阅专栏

本文深入解析MapReduce Job的提交过程，包括宏观流程、运行模式以及源码分析。在源码层面，详细探讨了Job的预处理阶段，如WordCount案例、提交前的步骤，如确保状态、设置API使用、连接集群等。此外，文章还介绍了Map和Reduce阶段的执行细节，如MapTask的运行、数据分区、排序和序列化，以及ReduceTask的执行过程。

1、MR的宏观流程

两个阶段 Map阶段和Reduce阶段
一个MapReduce任务为一个Job，一个Job在执行不同的阶段时，启动若干Task
Map阶段启动的进程称为MapTask
MapTask启动的数量取决于切片数，切N片，启动N个MapTask
Reduce阶段启动的进程称为ReduceTask
ReduceTask启动的进程数量由开发人员自己设置

Job.setNumReduceTask(int n);

在学习MR时，在Map和Reduce之间，有一个讲Map输出的数据进行分区和排序和传输的过程
这个过程很重要，因此将这个阶段，单独命名为shuffle!

在两个阶段的划分的基础上，再细分为 Map----------shuffle------------Reduce

shuffle阶段不会启动单独的进程来完成，shuffle横跨了MapTask和ReduceTask!

4.官方的阶段划分
Map阶段： map,sort
Reduce阶段： copy,sort,reduce

	map(map阶段)---sort|copy|sort(shuffle阶段)---reduce(reduce阶段)

2、MapReduce有两种运行模式

local（本地模式）：使用LocalJobRunner提交时，Job就在本地运行！
在本地以多线程模拟MapTask和ReduceTask
(我们就是以该模式来讲解)
YARN(在YARN上运行）: 使用YARNRunner提交时，此时Job在运行之前，会初始化
MRAppMaster进程，由这个进程向RM申请运行所有Task的资源！

RM将申请分配给NM，NM提供的资源会封装到Container中，此时启动Task进程！
(所以在local模式是看不到MRAppMaster)

3、源代码分析

我们就以WordCount案例来说明

3.1 WordCount案例代码

如下:
Mappe 类:

package andy.mywc;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * LongWritable 行的偏移量
 * Text 表示输入数据的value, 在这里表示的一行内容;
 * Text 表示map阶段的输出的key
 * IntWritable 表示map阶段输出的value
 */
public class WCMapper extends Mapper<LongWritable, Text, Text, IntWritable>  {
    Text k = new Text();
    IntWritable v = new IntWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // 1 获取一行的的字符(把Text转换为String)
        String line = value.toString();

        //2 分割一行字符,成为单词数组
        String [] worlds = line.split(" ");

        //3 遍历单词数组
        for (String world : worlds) {

            //4 设置输出的key为单词,value为1
            k.set(world);

            //5 输出
            context.write(k,v);
        }

    }
}

Reducer 类:

package andy.mywc;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * reducer输入 和 mapper的输出是一致的
 * Text reduce输入的 key
 * LongWritable reduce输入的value
 * Text reduce输出的 key
 * LongWritable reduce输出的value
 */
public class WCReducer extends Reducer<Text, IntWritable, Text, IntWritable> {


    /**
     * 每个相同的key只会进来一次reduce函数
     */
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;

        // 1 遍历values
        for (IntWritable value : values) {
            sum += value.get();
        }

        IntWritable v = new IntWritable();
        v.set(sum);

        //2 写出
        context.write(key, v);
    }
}

Driver类:

package andy.mywc;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.CombineTextInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class WCDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        if(args == null ||  args.length < 2)
            args = new String[] {"E:\\temp\\input","E:\\temp\\output3"};
        //1. 创建配置
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        //2. 设置jarclass
        job.setJarByClass(WCDriver.class);

        //3. 关联mapper 和 reducer
        job.setMapperClass(WCMapper.class);
        job.setReducerClass(WCReducer.class);

        //4. 设置Mapper 和 Reducer输出的key和values的类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //5. 设置input和output的路径
        FileInputFormat.setInputPaths(job,args[0]);
        FileOutputFormat.setOutputPath(job,new Path(args[1]));

        //6. 提交程序
        boolean b = job.waitForCompletion(true);

        System.exit( b ? 0 : 1);
    }
}

3.2 Job在提交之前的预处理阶段

(1) 开始

程序是从driver的main函数开始的,在main函数前面的都是设置job的,真正提交的是

boolean b = job.waitForCompletion(true);

那我们就从这里开始.

(2) waitForCompletion

public boolean waitForCompletion(boolean verbose
                                   ) throws IOException, InterruptedException,
                                            ClassNotFoundException {
    if (state == JobState.DEFINE) {
      submit(); 
    }
    if (verbose) {
      monitorAndPrintJob();
    } else {
      // get the completion poll interval from the client.
      int completionPollIntervalMillis = 
        Job.getCompletionPollInterval(cluster.getConf());
      while (!isComplete()) {
        try {
          Thread.sleep(completionPollIntervalMillis);
        } catch (InterruptedException ie) {
        }
      }
    }
    return isSuccessful();
  }

以下的代码是根据verbose变量来决定是否输出运行的信息的并定时检查job是否运行完毕,执行到这部分说明,job已经提交上去.

if (verbose) {
      monitorAndPrintJob();
    } else {
      // get the completion poll interval from the client.
      int completionPollIntervalMillis = 
        Job.getCompletionPollInterval(cluster.getConf());
      while (!isComplete()) {
        try {
          Thread.sleep(completionPollIntervalMillis);
        } catch (InterruptedException ie) {
        }
      }
    }

if (state == JobState.DEFINE)
这个代码块是检查job是否运行,如果不是运行状态,就调用submit函数.

(3) submit

public void submit() 
         throws IOException, InterruptedException, ClassNotFoundException {
    ensureState(JobState.DEFINE);
    setUseNewAPI();
    connect();
    final JobSubmitter submitter = 
        getJobSubmitter(cluster.getFileSystem(), cluster.getClient());
    status = ugi.doAs(new PrivilegedExceptionAction<JobStatus>() {
      public JobStatus run() throws IOException, InterruptedException, 
      ClassNotFoundException {
        return submitter.submitJobInternal(Job.this, cluster);
      }
    });
    state = JobState.RUNNING;
    LOG.info("The url to track the job: " + getTrackingURL());
   }

1) ensureState(JobState.DEFINE);

这行代码再次检查job的运行状态.

2) setUseNewAPI();

我们的hadoop的jar包邮两个,一个是1.X版本的,一个是2.X版本的,这里设置使用2.X版本.
往下我们来看一下connect函数.

(4) connect

在这里插入图片描述

在connect函数中会根据config文件来创建一个集群对象cluster(这里是决定是本地模式还是yarn模式地方,如果是是yarn模式会返回yarn cluster对象),代表的集群的所有资源.

创建完了集群对象我们返回到submit函数.

(5) getJobSubmitter

执行完了submit函数,就执行getJobSubmitter函数
在这里插入图片描述
这里会生产一个提交器,用来提交job的,这个对象非常多的信息.

(6) submitJobInternal

然后就执行submitJobInternal函数,job提交的核心都在这个函数里面.这个函数非常长.
在这里插入图片描述

1) checkSpecs(job);

在这里插入图片描述
这里获取Reducer的个数,是否等于0(相当没有reducer阶段).
然后在检查是否用的new API,如果是
走下面else这行代码

ReflectionUtils.newInstance(job.getOutputFormatClass(),
          job.getConfiguration());

获取OutputFormat 输出类,

 public Class<? extends OutputFormat<?,?>> getOutputFormatClass() 
     throws ClassNotFoundException {
    return (Class<? extends OutputFormat<?,?>>) 
      conf.getClass(OUTPUT_FORMAT_CLASS_ATTR, TextOutputFormat.class);
  }

如果没有设置,将使用TextOutputFormat.class类,所以默认情况下输出的格式为TextOutputFormat,获取完输出格式类,然后检查输出目录是否存在和是否为空,空间是否足够等(如果输出目录为空或者该目录已经存在,会报异常).

2) 回到submitJobInternal函数

addMRFrameworkToDistributedCache(conf); 创建分布式缓存.

Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster, conf);
Path submitJobDir = new Path(jobStagingArea, jobId.toString());
//设置job临时文件目录,该目录一般是在:
如果是本地文件系统，默认在idea所在的磁盘的tmp目录下
如果是HDFS，默认在hdfs的/tmp下

Path submitJobDir = new Path(jobStagingArea, jobId.toString()); //创建临时目录

3) writeSplits

line 196 : int maps = writeSplits(job, submitJobDir);

在这里插入图片描述
这是切片函数,切片就是在这里面切的.maps就是切片的个数.

通常情况下，如果是从文件中获取切片，那么使用FileSplit作为切片对象！
这个会在后面的流程会用到.
FileSplit有三个关键的属性：

Path p: 当前切片属于哪个文件

int start: 当前切片的起始位置是从这个文件的哪个位置开始

int length: 当前切片的长度

从指定文件的start位置开始，读取length的长度，就属于当前切片！

切片和块没有任何关系！只不过每次从指定文件中读取每片数据时，数据实际上以块的形式存储在HDFS，

读取数据时，会访问到指定的块！具体哪个切片对应哪些块，取决于切片如何切，没有直接的对应关系！

注意：默认情况下，使用FileInputFormat，在切片时，默认以块大小作为片大小！

片刚好是读取一个块！

回到submitJobInternal函数.

4) writeConf

line 234 : writeConf(conf, submitJobFile);
这个地方会把config文件和切片文件写入到上面的tmp 临时文件.
执行这个函数之后,tmp目录如下:
在这里插入图片描述
job.xml: 记录了Job所有的配置，这些配置来自于xxx-default.xml和xxx-site.xml
job.split:切片对象
job.splitmetainfo:切片对象的属性说明

5) submitClient.submitJob

line 240
在这里插入图片描述
这正式提交job.

这里新建一个Job的对象.

line 150 -151 创建必要的配置文件对象,就是在tmp目录中xml
line 152 到 188 都是对job的一些配置.
line 190 this.start();启动提交job.
说明一下:
因为现在是local模式,所以是没有MRAppMaster,只能用Job来模拟.
LocalJobRunner.Job: 功能类似于MRAppMaster，负责整个Job的运行的申请，提交等操作！

a)new Job()

b)执行Job.start() 相当于启动了MRAppMaster

c) 进入Job.run() 相当于开始让MRAppMaster干活
所以在start之后,会执行Job的run方法.这个run方法我们暂时保留.后面再说.

假如说这行完了start之后,会回到waitForCompletion,然后监听job运行完毕,然后打印信息.提交job的前期工作就完成了,接下来就是MapReduce阶段了.

3.2 Map阶段

(1) 我们说local模式用Job来模拟MRAppMaster的,那我们来看Job的run方法.

public void run() {
      
      
      try {
          //根据之前生成的job.split和job.splitinfo文件创建TaskSplitMetaInfo[]
          // 之前切几片，TaskSplitMetaInfo[]中就有几个切片的元数据对象
        TaskSplitMetaInfo[] taskSplitMetaInfos = 
          SplitMetaInfoReader.readSplitMetaInfo(jobId, localFs, conf, systemJobDir);

       
		// Map用来保存所有MapTask生成的结果文件的影响信息
        Map<TaskAttemptID, MapOutputFile> mapOutputFiles =
            Collections.synchronizedMap(new HashMap<TaskAttemptID, MapOutputFile>());
        
          // 根据有几个切片，就创建几个MapTask的线程
        List<RunnableWithThrowable> mapRunnables = getMapTaskRunnables(
            taskSplitMetaInfos, jobId, mapOutputFiles);
              
        initCounters(mapRunnables.size(), numReduceTasks);
        ExecutorService mapService = createMapExecutor();
         // 运行MapTask
        runTasks(mapRunnables, mapService, "map");

          // 如果有reduce阶段，运行ReduceTasks
        try {
          if (numReduceTasks > 0) {
            List<RunnableWithThrowable> reduceRunnables = getReduceTaskRunnables(
                jobId, mapOutputFiles);
            ExecutorService reduceService = createReduceExecutor();
              //运行ReduceTask
            runTasks(reduceRunnables, reduceService, "reduce");
          }
        } finally {
          for (MapOutputFile output : mapOutputFiles.values()) {
            output.removeAll();
          }
        }
        
    }

(2) runTasks

在这里插入图片描述

line 439 - 441 变量创建的Map线程,挨个启动(启动时按顺序的,但各个线程的执行时并行的).

我们看这个runables对象的实例是LocalJobRunner 里面的Job 里面的MapTaskRunnable,
其实是执行了submit之后,会到MapTaskRunnable里面执行run函数.
我们来看run函数

(3)MapTaskRunnable.run

查看LocalJobRunner.Job.MapTaskRunnable,类似于封装了Container，在其中运行了MapTask!

public void run() {
        try {
          TaskAttemptID mapId = new TaskAttemptID(new TaskID(
              jobId, TaskType.MAP, taskId), 0);
          LOG.info("Starting task: " + mapId);
          mapIds.add(mapId);
            //创建一个 MapTask对象，这个对象代表当前的MapTask,负责这个Task的总的运行流程
          MapTask map = new MapTask(systemJobFile.toString(), mapId, taskId,
            info.getSplitIndex(), 1);
         ......
             //运行MapTask
            map.run(localConf, Job.this);
           
    }

(4) MapTask.run()

@Override
  public void run(final JobConf job, final TaskUmbilicalProtocol umbilical)
    throws IOException, ClassNotFoundException, InterruptedException {
    this.umbilical = umbilical;
	//定义了整个 MapTask的阶段划分
    if (isMapTask()) {
      // If there are no reducers then there won't be any sort. Hence the map 
      // phase will govern the entire attempt's progress.
      if (conf.getNumReduceTasks() == 0) {
        mapPhase = getProgress().addPhase("map", 1.0f);
      } else {
        // If there are reducers then the entire attempt's progress will be 
        // split between the map phase (67%) and the sort phase (33%).
        mapPhase = getProgress().addPhase("map", 0.667f);
        sortPhase  = getProgress().addPhase("sort", 0.333f);
      }
    }
    .......

    if (useNewApi) {
        // 开始运行Mapper
      runNewMapper(job, splitMetaInfo, umbilical, reporter);
    } else {
      runOldMapper(job, splitMetaInfo, umbilical, reporter);
    }
    done(umbilical, reporter);
  }

注意：如果有reduce,那么Map阶段分为两个阶段

map------67%

sort-------33%

如果没有reduce，Map阶段只有map一个阶段

只有有reduce阶段时，数据才会排序！如果没有reduce阶段数据是按照读入的顺序处理后，输出！

(5) runNewMapper

private <INKEY,INVALUE,OUTKEY,OUTVALUE>
  void runNewMapper(final JobConf job,
                    final TaskSplitIndex splitIndex,
                    final TaskUmbilicalProtocol umbilical,
                    TaskReporter reporter
                    ) throws IOException, ClassNotFoundException,
                             InterruptedException {
    // make a task context so we can get the classes
    org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =
      new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job, 
                                                                  getTaskID(),
                                                                  reporter);
    // make a mapper 一个MapTask只会创建一个Mapper对象            
    org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE> mapper =
      (org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>)
        ReflectionUtils.newInstance(taskContext.getMapperClass(), job);
    // make the input format 创建输入格式对象
    org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =
      (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
        ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);
    // rebuild the input split 重建当前MapTask的切片
    org.apache.hadoop.mapreduce.InputSplit split = null;
    split = getSplitDetails(new Path(splitIndex.getSplitLocation()),
        splitIndex.getStartOffset());
    LOG.info("Processing split: " + split);

     //构建MapTask的输入对象，负责整个MapTask的输入工作，RecordReader由input负责进行调用读取数据
    org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =
      new NewTrackingRecordReader<INKEY,INVALUE>
        (split, inputFormat, reporter, taskContext);
    
    job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
    org.apache.hadoop.mapreduce.RecordWriter output = null;
    
    //构建MapTask的输出对象
    // get an output object
    if (job.getNumReduceTasks() == 0) {
        //如果没有reduce阶段，由Map收集输出的数据，直接输出
      output = 
        new NewDirectOutputCollector(taskContext, job, umbilical, reporter);
    } else {
        // 创建记录收集器
      output = new NewOutputCollector(taskContext, job, umbilical, reporter);
    }

    org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY, OUTVALUE> 
    mapContext = 
      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, getTaskID(), 
          input, output, 
          committer, 
          reporter, split);

          //构建Mapper中使用的context对象，代表MapTask的上下文(来龙去脉)，
    org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context 
        mapperContext = 
          new WrappedMapper<INKEY, INVALUE, OUTKEY, OUTVALUE>().getMapContext(
              mapContext);

    try {
        // 会执行输入过程中所需要组件的一系列初始化
        // 调用RecordReader.initialize（）
      input.initialize(split, mapperContext);
        // 调用自己编写的Mapper的run()
      mapper.run(mapperContext);
      mapPhase.complete();
      setPhase(TaskStatus.Phase.SORT);
      statusUpdate(umbilical);
      input.close();
      input = null;
      output.close(mapperContext);
      output = null;
    } finally {
      closeQuietly(input);
      closeQuietly(output, mapperContext);
    }
  }

这个函数比较麻烦,我我们分开来说

1) inputformat

    org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =
      (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
        ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);

public Class<? extends InputFormat<?,?>> getInputFormatClass() 
     throws ClassNotFoundException {
    return (Class<? extends InputFormat<?,?>>) 
      conf.getClass(INPUT_FORMAT_CLASS_ATTR, TextInputFormat.class);
  }

这里获取inputformat,如果有设置是获取我们设置的inputformat,不然使用TextInputFormat.class.

org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =
      new NewTrackingRecordReader<INKEY,INVALUE>
        (split, inputFormat, reporter, taskContext);

NewTrackingRecordReader

    NewTrackingRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
        org.apache.hadoop.mapreduce.InputFormat<K, V> inputFormat,
        TaskReporter reporter,
        org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)
        throws InterruptedException, IOException {
      this.reporter = reporter;
      this.inputRecordCounter = reporter
          .getCounter(TaskCounter.MAP_INPUT_RECORDS);
      this.fileInputByteCounter = reporter
          .getCounter(FileInputFormatCounter.BYTES_READ);

      List <Statistics> matchedStats = null;
      if (split instanceof org.apache.hadoop.mapreduce.lib.input.FileSplit) {
        matchedStats = getFsStatistics(((org.apache.hadoop.mapreduce.lib.input.FileSplit) split)
            .getPath(), taskContext.getConfiguration());
      }
      fsStats = matchedStats;

      long bytesInPrev = getInputBytes(fsStats);
      this.real = inputFormat.createRecordReader(split, taskContext);
      long bytesInCurr = getInputBytes(fsStats);
      fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
    }

这行代码

this.real = inputFormat.createRecordReader(split, taskContext);

  public RecordReader<LongWritable, Text> 
    createRecordReader(InputSplit split,
                       TaskAttemptContext context) {
    String delimiter = context.getConfiguration().get(
        "textinputformat.record.delimiter");
    byte[] recordDelimiterBytes = null;
    if (null != delimiter)
      recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
    return new LineRecordReader(recordDelimiterBytes);
  }

默认TextInputFormat使用LineRecordReader!
一般我们自定义输入格式要重写createRecordReader方法,就是这个原因.

2) map阶段的输出

    //构建MapTask的输出对象
    // get an output object
    if (job.getNumReduceTasks() == 0) {
        //如果没有reduce阶段，由Map收集输出的数据，直接输出
      output = 
        new NewDirectOutputCollector(taskContext, job, umbilical, reporter);
    } else {
        // 创建记录收集器
      output = new NewOutputCollector(taskContext, job, umbilical, reporter);
    }

NewOutputCollector(org.apache.hadoop.mapreduce.JobContext jobContext,
                       JobConf job,
                       TaskUmbilicalProtocol umbilical,
                       TaskReporter reporter
                       ) throws IOException, ClassNotFoundException {
    //创建输出中的缓冲区对象，这个缓冲区不仅可以用来收集数据还会对数据进行排序
      collector = createSortingCollector(job, reporter);
   
 // 根据reduceTask的数量，确定Map阶段总的分区数（不等于实际上数据的分区）
      partitions = jobContext.getNumReduceTasks();
// reduceTask个数>1，就使用用户配置的Partitioner
      if (partitions > 1) {
        partitioner = (org.apache.hadoop.mapreduce.Partitioner<K,V>)
          ReflectionUtils.newInstance(jobContext.getPartitionerClass(), job);
      } else {
        partitioner = new org.apache.hadoop.mapreduce.Partitioner<K,V>() {
          @Override
          public int getPartition(K key, V value, int numPartitions) {
            return partitions - 1;
          }
        };

a) Partitioner分区器的获取

从配置中获取用户定义的Partitioner,如果没有,使用HashPartitioner.class.

public Class<? extends Partitioner<?,?>> getPartitionerClass() 
     throws ClassNotFoundException {
    //获取mapreduce.job.partitioner.class值，如果没有设置，使用HashPartitioner作为默认
    return (Class<? extends Partitioner<?,?>>) 
      conf.getClass(PARTITIONER_CLASS_ATTR, HashPartitioner.class);
  }

HashPartitioner的工作原理: key相同的会分到同一个区

public class HashPartitioner<K, V> extends Partitioner<K, V> {

  /** Use {@link Object#hashCode()} to partition. */
  public int getPartition(K key, V value,
                          int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }

}

所以我们在自定义Partitioner时:
继承Partitioner，实现public int getPartition(Text key, FlowBean value, int numPartitions)！

注意： 分区号必须为int型的值，且必须符合 0<= partitionNum < numPartitions

b) 缓冲区的构建

collector = createSortingCollector(job, reporter);

collector.init(context);

public void init(MapOutputCollector.Context context
                    ) throws IOException, ClassNotFoundException {
     
     
      partitions = job.getNumReduceTasks();
      rfs = ((LocalFileSystem)FileSystem.getLocal(job)).getRaw();

      //sanity checks
    // 从配置中获取溢写的百分比，默认读取mapreduce.map.sort.spill.percent，如果没有配置，使用
     // 0.8作为百分比
      final float spillper =
        job.getFloat(JobContext.MAP_SORT_SPILL_PERCENT, (float)0.8);
    // 缓冲区的初始化大小，默认读取mapreduce.task.io.sort.mb，如果没有配置，默认使用100
      final int sortmb = job.getInt(JobContext.IO_SORT_MB, 100);
    
      indexCacheMemoryLimit = job.getInt(JobContext.INDEX_CACHE_MEMORY_LIMIT,
                                         INDEX_CACHE_MEMORY_LIMIT_DEFAULT);
    
      // 默认使用快排
      sorter = ReflectionUtils.newInstance(job.getClass("map.sort.class",
            QuickSort.class, IndexedSorter.class), job);
     

      // k/v serialization
    // 确定key的比较器
      comparator = job.getOutputKeyComparator();
    // 获取Mapper输出的key-value的类型
      keyClass = (Class<K>)job.getMapOutputKeyClass();
      valClass = (Class<V>)job.getMapOutputValueClass();
    
      serializationFactory = new SerializationFactory(job);
    //根据key的类型返回指定的序列化器
      keySerializer = serializationFactory.getSerializer(keyClass);
      keySerializer.open(bb);
      valSerializer = serializationFactory.getSerializer(valClass);
      valSerializer.open(bb);

     

      // compression 在mapper的输出阶段使用压缩
      if (job.getCompressMapOutput()) {
        Class<? extends CompressionCodec> codecClass =
          job.getMapOutputCompressorClass(DefaultCodec.class);
        codec = ReflectionUtils.newInstance(codecClass, job);
      } else {
        codec = null;
      }

      // combiner  设置combiner
      final Counters.Counter combineInputCounter =
        reporter.getCounter(TaskCounter.COMBINE_INPUT_RECORDS);
      combinerRunner = CombinerRunner.create(job, getTaskID(), 
                                             combineInputCounter,
                                             reporter, null);
      if (combinerRunner != null) {
        final Counters.Counter combineOutputCounter =
          reporter.getCounter(TaskCounter.COMBINE_OUTPUT_RECORDS);
        combineCollector= new CombineOutputCollector<K,V>(combineOutputCounter, reporter, job);
      } else {
        combineCollector = null;
      }
    }

c) 确定key的比较器

 public RawComparator getOutputKeyComparator() {
     //尝试获取参数中配置的mapreduce.job.output.key.comparator.class，作为比较器，
     //如果没有定义，默认为null，定义的话必须是RawComparator类型
    Class<? extends RawComparator> theClass = getClass(
      JobContext.KEY_COMPARATOR, null, RawComparator.class);
     //如果用户配置，就实例化此类型的一个对象
    if (theClass != null)
      return ReflectionUtils.newInstance(theClass, this);
     // 判断Mapper输出的key是否是writableComparable类型的子类，
     //如果是，就默认由系统提供比较器，如果不是就抛异常！
    return WritableComparator.get(getMapOutputKeyClass().asSubclass(WritableComparable.class), this);
  }

如何在Map阶段定义key的排序：

①要比较的字段，设置为key

②提供一个RawComparator的key的比较器！

或让key实现WritableComparable接口！

根据比较器的compareTo()对key进行比较！

如果比较的条件是多个，称为二次排序！

d) 序列化

①什么时候需要序列化

有reduce阶段时，需要Map输出的key-value实现序列化

②怎么实现

实现Writable接口！

③是否必须实现？

不是！

④什么情况下，可以不实现Writable接口，为什么一定要实现Writable接口

只有实现了Writable接口，hadoop才会自动提供基于Writable接口的序列化器！

如果自己提供序列化器，就可以不是先Writable接口

3) mapper.run(mapperContext)

这个是调用自己编写的Mapper的run()

public void run(Context context) throws IOException, InterruptedException {
    //在map()之前只被调用一次
    setup(context);
    try {
        //调用RecoredReader的nextKeyValue()
      while (context.nextKeyValue()) {
          //每读取一对输入的KEYIN-VALUEIN,执行一次map记录
        map(context.getCurrentKey(), context.getCurrentValue(), context);
      }
    } finally {
        //在map()之后被调用1次
      cleanup(context);
    }
  }

这个部分就是调用RecordReader的nextKeyValue,getCurrentKey,getCurrentValue,然后送给自定义的map方法处理.

    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // 1 获取一行的的字符(把Text转换为String)
        String line = value.toString();

        //2 分割一行字符,成为单词数组
        String [] worlds = line.split(" ");

        //3 遍历单词数组
        for (String world : worlds) {

            //4 设置输出的key为单词,value为1
            k.set(world);

            //5 输出
            context.write(k,v);
        }

    }

    public void write(KEYOUT key, VALUEOUT value) throws IOException,
        InterruptedException {
      mapContext.write(key, value);
    }

    public void write(K key, V value) throws IOException, InterruptedException {
      collector.collect(key, value,
                        partitioner.getPartition(key, value, partitions));
    }

最后是来到 collector.collect

    public synchronized void collect(K key, V value, final int partition
                                     ) throws IOException {
      reporter.progress();
      //检查key和value的class是否和你设置的一样
      if (key.getClass() != keyClass) {
        throw new IOException("Type mismatch in key from map: expected "
                              + keyClass.getName() + ", received "
                              + key.getClass().getName());
      }
      if (value.getClass() != valClass) {
        throw new IOException("Type mismatch in value from map: expected "
                              + valClass.getName() + ", received "
                              + value.getClass().getName());
      }
      //检查分区是否合法
      if (partition < 0 || partition >= partitions) {
        throw new IOException("Illegal partition for " + key + " (" +
            partition + ")");
      }
      //检查溢写是否异常
      checkSpillException();
      //检查缓冲区的空间
      bufferRemaining -= METASIZE;
      if (bufferRemaining <= 0) {
        // start spill if the thread is not running and the soft limit has been
        // reached
        spillLock.lock();
        try {
          do {
            if (!spillInProgress) {
              final int kvbidx = 4 * kvindex;
              final int kvbend = 4 * kvend;
              // serialized, unspilled bytes always lie between kvindex and
              // bufindex, crossing the equator. Note that any void space
              // created by a reset must be included in "used" bytes
              final int bUsed = distanceTo(kvbidx, bufindex);
              final boolean bufsoftlimit = bUsed >= softLimit;
              if ((kvbend + METASIZE) % kvbuffer.length !=
                  equator - (equator % METASIZE)) {
                // spill finished, reclaim space
                resetSpill();
                bufferRemaining = Math.min(
                    distanceTo(bufindex, kvbidx) - 2 * METASIZE,
                    softLimit - bUsed) - METASIZE;
                continue;
              } else if (bufsoftlimit && kvindex != kvend) {
                // spill records, if any collected; check latter, as it may
                // be possible for metadata alignment to hit spill pcnt
                //开始溢写
                startSpill();
                final int avgRec = (int)
                  (mapOutputByteCounter.getCounter() /
                  mapOutputRecordCounter.getCounter());
                // leave at least half the split buffer for serialization data
                // ensure that kvindex >= bufindex
                final int distkvi = distanceTo(bufindex, kvbidx);
                final int newPos = (bufindex +
                  Math.max(2 * METASIZE - 1,
                          Math.min(distkvi / 2,
                                   distkvi / (METASIZE + avgRec) * METASIZE)))
                  % kvbuffer.length;
                setEquator(newPos);
                bufmark = bufindex = newPos;
                final int serBound = 4 * kvend;
                // bytes remaining before the lock must be held and limits
                // checked is the minimum of three arcs: the metadata space, the
                // serialization space, and the soft limit
                bufferRemaining = Math.min(
                    // metadata max
                    distanceTo(bufend, newPos),
                    Math.min(
                      // serialization max
                      distanceTo(newPos, serBound),
                      // soft limit
                      softLimit)) - 2 * METASIZE;
              }
            }
          } while (false);
        } finally {
          spillLock.unlock();
        }
      }

      try {
        // serialize key bytes into buffer
        int keystart = bufindex;
        keySerializer.serialize(key);
        if (bufindex < keystart) {
          // wrapped the key; must make contiguous
          bb.shiftBufferedKey();
          keystart = 0;
        }
        // serialize value bytes into buffer
        final int valstart = bufindex;
        valSerializer.serialize(value);
        // It's possible for records to have zero length, i.e. the serializer
        // will perform no writes. To ensure that the boundary conditions are
        // checked and that the kvindex invariant is maintained, perform a
        // zero-length write into the buffer. The logic monitoring this could be
        // moved into collect, but this is cleaner and inexpensive. For now, it
        // is acceptable.
        bb.write(b0, 0, 0);

        // the record must be marked after the preceding write, as the metadata
        // for this record are not yet written
        int valend = bb.markRecord();

        mapOutputRecordCounter.increment(1);
        mapOutputByteCounter.increment(
            distanceTo(keystart, valend, bufvoid));

        // write accounting info
        kvmeta.put(kvindex + PARTITION, partition);
        kvmeta.put(kvindex + KEYSTART, keystart);
        kvmeta.put(kvindex + VALSTART, valstart);
        kvmeta.put(kvindex + VALLEN, distanceTo(valstart, valend));
        // advance kvindex
        kvindex = (kvindex - NMETA + kvmeta.capacity()) % kvmeta.capacity();
      } catch (MapBufferTooSmallException e) {
        LOG.info("Record too large for in-memory buffer: " + e.getMessage());
        spillSingleRecord(key, value, partition);
        mapOutputRecordCounter.increment(1);
        return;
      }
    }

写完数据到环形缓冲区之后会回到runNewMapper函数

      setPhase(TaskStatus.Phase.SORT);
      statusUpdate(umbilical);
      input.close();
      input = null;
      //这里这个close里面会去对缓冲区的数据进行排序
      output.close(mapperContext);
      output = null;
    } finally {
      closeQuietly(input);
      closeQuietly(output, mapperContext);
    }

4)collector.flush();

最后会走到sortAndSpill这个函数里面.这就是真正排序的地方.而且这个函数函数里面进行combiner的动作.

private void sortAndSpill() throws IOException, ClassNotFoundException,
                                       InterruptedException {
      //approximate the length of the output file to be the length of the
      //buffer + header lengths for the partitions
      try {
        // create spill file
        //排序
        sorter.sort(MapOutputBuffer.this, mstart, mend, reporter);
        int spindex = mstart;
        final IndexRecord rec = new IndexRecord();
        final InMemValBytes value = new InMemValBytes();
        for (int i = 0; i < partitions; ++i) {
          IFile.Writer<K, V> writer = null;
          try {
            long segmentStart = out.getPos();
            FSDataOutputStream partitionOut = CryptoUtils.wrapIfNecessary(job, out);
            writer = new Writer<K, V>(job, partitionOut, keyClass, valClass, codec,
                                      spilledRecordsCounter);
            //判断是否定义了combiner                          
            if (combinerRunner == null) {
              // spill directly
              DataInputBuffer key = new DataInputBuffer();
              while (spindex < mend &&
                  kvmeta.get(offsetFor(spindex % maxRec) + PARTITION) == i) {
                final int kvoff = offsetFor(spindex % maxRec);
                int keystart = kvmeta.get(kvoff + KEYSTART);
                int valstart = kvmeta.get(kvoff + VALSTART);
                key.reset(kvbuffer, keystart, valstart - keystart);
                getVBytesForOffset(kvoff, value);
                writer.append(key, value);
                ++spindex;
              }
            } else {
              int spstart = spindex;
              while (spindex < mend &&
                  kvmeta.get(offsetFor(spindex % maxRec)
                            + PARTITION) == i) {
                ++spindex;
              }
              // Note: we would like to avoid the combiner if we've fewer
              // than some threshold of records for a partition
              //进行combiner
              if (spstart != spindex) {
                combineCollector.setWriter(writer);
                RawKeyValueIterator kvIter =
                  new MRResultIterator(spstart, spindex);
                combinerRunner.combine(kvIter, combineCollector);
              }
            }

            // close the writer
            writer.close();

            // record offsets
            rec.startOffset = segmentStart;
            rec.rawLength = writer.getRawLength() + CryptoUtils.cryptoPadding(job);
            rec.partLength = writer.getCompressedLength() + CryptoUtils.cryptoPadding(job);
            spillRec.putIndex(rec, i);

            writer = null;
          } finally {
            if (null != writer) writer.close();
          }
        }

        if (totalIndexCacheMemory >= indexCacheMemoryLimit) {
          // create spill index file
          Path indexFilename =
              mapOutputFile.getSpillIndexFileForWrite(numSpills, partitions
                  * MAP_OUTPUT_INDEX_RECORD_LENGTH);
          spillRec.writeToFile(indexFilename, job);
        } else {
          indexCacheList.add(spillRec);
          totalIndexCacheMemory +=
            spillRec.size() * MAP_OUTPUT_INDEX_RECORD_LENGTH;
        }
        LOG.info("Finished spill " + numSpills);
        ++numSpills;
      } finally {
        if (out != null) out.close();
      }
    }

5) 这部分有点乱,我们做一个总结

①先分区，每个输出的key-value在写出时，先调用partitioner计算分区号，再收集到缓冲区中

②数据被收集进入缓冲区中，当缓冲区达到溢写的条件时，会调用 Sorter对当前缓冲区中的所有的数据进行排序

只排索引（记录排好序的索引的信息）

③按照分区号，从0号开始依次溢写，每次溢写之前，如果设置了Combiner，先Conbine再溢写。

每次溢写会产生一个spillx.out的文件

④所有的数据全部收集到缓冲区后，会执行最后一次flush()，将不满足溢写条件的缓存中的残余数据再次溢写

⑤flush()之后，会调用mergeParts()进行合并

在合并时，先将多个文件的同一个分区的数据进行合并，合并后再排序！

之后讲所有分区的数据溢写为一个final.out文件！

在溢写之前，如果设置了Combiner，并且之前溢写的片段个数>=3，此时会再次调用Combiner，Combine后再溢写

3.3 Reduce 阶段

执行完了map阶段,就会进行Reduce阶段.
我们回到LocalJobRunner.java的run方法.

public void run() {
     
        Map<TaskAttemptID, MapOutputFile> mapOutputFiles =
            Collections.synchronizedMap(new HashMap<TaskAttemptID, MapOutputFile>());
        
        List<RunnableWithThrowable> mapRunnables = getMapTaskRunnables(
            taskSplitMetaInfos, jobId, mapOutputFiles);
              
        initCounters(mapRunnables.size(), numReduceTasks);
        ExecutorService mapService = createMapExecutor();
        runTasks(mapRunnables, mapService, "map"); //这里面执行完了map之后
        //下面这里是执行reduce阶段的
        try {
          if (numReduceTasks > 0) {
            List<RunnableWithThrowable> reduceRunnables = getReduceTaskRunnables(
                jobId, mapOutputFiles);
            ExecutorService reduceService = createReduceExecutor();
            //这里开始执行
            runTasks(reduceRunnables, reduceService, "reduce");
          }
        }

(1) runTasks

 private void runTasks(List<RunnableWithThrowable> runnables,
        ExecutorService service, String taskType) throws Exception {
      // Start populating the executor with work units.
      // They may begin running immediately (in other threads).
      for (Runnable r : runnables) {
        service.submit(r);
      }

      try {
        service.shutdown(); // Instructs queue to drain.

        // Wait for tasks to finish; do not use a time-based timeout.
        // (See http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6179024)
        LOG.info("Waiting for " + taskType + " tasks");
        service.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
      } catch (InterruptedException ie) {
        // Cancel all threads.
        service.shutdownNow();
        throw ie;
      }

(2) service.submit®

这是还是执行到ReduceTaskRunnable类中的run方法:

public void run() {
        try {
          TaskAttemptID reduceId = new TaskAttemptID(new TaskID(
              jobId, TaskType.REDUCE, taskId), 0);
          
			//创建reduceTask对象
          ReduceTask reduce = new ReduceTask(systemJobFile.toString(),
              reduceId, taskId, mapIds.size(), 1);
          ......
            try {
             // 运行reduceTask.run()
              reduce.run(localConf, Job.this);

    }

(3) ReduceTask.run()

public void run(JobConf job, final TaskUmbilicalProtocol umbilical)
    throws IOException, InterruptedException, ClassNotFoundException {
    job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
    //阶段定义划分
    if (isMapOrReduce()) {
      copyPhase = getProgress().addPhase("copy");
      sortPhase  = getProgress().addPhase("sort");
      reducePhase = getProgress().addPhase("reduce");
    }
   .....
    
    // Initialize the codec  如果MapTask写出有使用压缩，在此时获取压缩格式的编解码器进行解压缩
    codec = initCodec();
    RawKeyValueIterator rIter = null;
    //定义shuffle阶段的消费者线程，从MapTask输出的结果中将指定分区的数据拷贝到ReduceTask
    ShuffleConsumerPlugin shuffleConsumerPlugin = null;
    
    // 定义combiner，在reduceTask端合并多个MapTask同一分区的数据时，如果reduceTask内存不够
    //会发生溢写，在每次溢写前，还会调用combiner!
    Class combinerClass = conf.getCombinerClass();
    CombineOutputCollector combineCollector = 
      (null != combinerClass) ? 
     new CombineOutputCollector(reduceCombineOutputCounter, reporter, conf) : null;

    
   //初始化shuffle线程，调用其run()
    rIter = shuffleConsumerPlugin.run();

    // free up the data structures   shuffle阶段已经完成了copy过程
    mapOutputFilesOnDisk.clear();
    
    // 在shuffle中，sort也已经完成。已经按照Mapper的输出，对所有的数据进行了整体的排序
    sortPhase.complete();                         // sort is complete
    setPhase(TaskStatus.Phase.REDUCE); 
    statusUpdate(umbilical);
    
    //获取Mapper输出的key-value的类型
    Class keyClass = job.getMapOutputKeyClass();
    Class valueClass = job.getMapOutputValueClass();
    
    //定义分组比较器
    RawComparator comparator = job.getOutputValueGroupingComparator();

    if (useNewApi) {
        //运行Reducer
      runNewReducer(job, umbilical, reporter, rIter, comparator, 
                    keyClass, valueClass);
    } else {
      runOldReducer(job, umbilical, reporter, rIter, comparator, 
                    keyClass, valueClass);
    }

    shuffleConsumerPlugin.close();
    done(umbilical, reporter);
  }

(4) 获取分组比较器

 public RawComparator getOutputValueGroupingComparator() {
     //先获取用户定义的比较器，从配置中获取mapreduce.job.output.group.comparator.class
     //参数，必须是RawComparator类型
    Class<? extends RawComparator> theClass = getClass(
      JobContext.GROUP_COMPARATOR_CLASS, null, RawComparator.class);
     //如果用户没有定义，默认使用Map阶段key的比较器
    if (theClass == null) {
      return getOutputKeyComparator();
    }
    //如果用户定义了，就使用用户的比较器
    return ReflectionUtils.newInstance(theClass, this);
  }

(5) runNewReducer

private <INKEY,INVALUE,OUTKEY,OUTVALUE>
  void runNewReducer(JobConf job,
                     final TaskUmbilicalProtocol umbilical,
                     final TaskReporter reporter,
                     RawKeyValueIterator rIter,
                     RawComparator<INKEY> comparator,
                     Class<INKEY> keyClass,
                     Class<INVALUE> valueClass
                     ) throws IOException,InterruptedException, 
                              ClassNotFoundException {
    // wrap value iterator to report progress.
   //封装key-value的迭代器，让迭代器可以报告进度
    final RawKeyValueIterator rawIter = rIter;
    rIter = new RawKeyValueIterator() {
      public void close() throws IOException {
        rawIter.close();
      }
        //每次迭代key-value时，并不是将数据读取后直接封装为key-value
        //而是获取当前key-value的byte[]的内容，再使用反序列化
        //把这部分内容的属性设置到key-value的实例中
      public DataInputBuffer getKey() throws IOException {
        return rawIter.getKey();
      }
      public Progress getProgress() {
        return rawIter.getProgress();
      }
      public DataInputBuffer getValue() throws IOException {
        return rawIter.getValue();
      }
      public boolean next() throws IOException {
        boolean ret = rawIter.next();
        reporter.setProgress(rawIter.getProgress().getProgress());
        return ret;
      }
    };
    // 创建上下文
    org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =
      new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job,
          getTaskID(), reporter);
    // 实例化reducer对象，一个reduceTask只会创建一个reducer对象
    org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE> reducer =
      (org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE>)
        ReflectionUtils.newInstance(taskContext.getReducerClass(), job);
    org.apache.hadoop.mapreduce.RecordWriter<OUTKEY,OUTVALUE> trackedRW = 
      new NewTrackingRecordWriter<OUTKEY, OUTVALUE>(this, taskContext);
    job.setBoolean("mapred.skip.on", isSkipping());
    job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
                                  //reduce中的context
    org.apache.hadoop.mapreduce.Reducer.Context 
         reducerContext = createReduceContext(reducer, job, getTaskID(),
                                               rIter, reduceInputKeyCounter, 
                                               reduceInputValueCounter, 
                                               trackedRW,
                                               committer,
                                               reporter, comparator, keyClass,
                                               valueClass);
    try {
        //运行reducer.run()
      reducer.run(reducerContext);
    } finally {
      trackedRW.close(reducerContext);
    }
  }

(6) Reducer.run()

public void run(Context context) throws IOException, InterruptedException {
    //在reduce()之前，调用一次setUp()
    setup(context);
    try {
        //判断数据中是否有和当前读取的key相同的key-value,如果读到了相同的key-value,进入一次reduce()
      while (context.nextKey()) {
        reduce(context.getCurrentKey(), context.getValues(), context);
        // If a back up store is used, reset it
        Iterator<VALUEIN> iter = context.getValues().iterator();
        if(iter instanceof ReduceContext.ValueIterator) {
          ((ReduceContext.ValueIterator<VALUEIN>)iter).resetBackupStore();        
        }
      }
    } finally {
         //在reduce()之前，调用一次cleanUp()
      cleanup(context);
    }
  }

在这里面调用我们定义的reduce方法,然后把k-v写出去.

(7) reduce

protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;

        // 1 遍历values
        for (IntWritable value : values) {
            sum += value.get();
        }

        IntWritable v = new IntWritable();
        v.set(sum);

        //2 写出
        context.write(key, v);
    }