本文不涉及MapReduce的原理介绍,只是从源代码的层面讲讲我对Hadoop的MapReduce的执行过程、数据流的一点理解。
首先贴上一张来之于Yahoo Hadoop 教程 的图片
由上图可以看出,在进入Map之前,InputFormat把存储在HDFS的文件进行读取和分割,形成和任务相关的InputSplits,然后RecordReader负责读取这些Splits,并把读取出来的内容作为Map函数的输入参数。下面我就从代码执行的角度来看,数据是如何一步步从HDFS的file到Map函数的。在Yahoo Hadoop 教程 中已经详细讲解了这一过程。但我作为一个细节控,更想从源代码的级别去理清这一过程,这样我才觉得踏实,才觉得自己真真切切地掌握了这个知识点,因此我仔细阅读了这部分的源代码,写篇博客记录下来,以便以后自己查看。
首先,在Mapper类的run方法中,map函数被循环调用:
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
...................................
/**
* Expert users can override this method for more complete control over the
* execution of the Mapper.
* @param context
* @throws IOException
*/
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
cleanup(context);
}
在run方法中,每调用一次context.nextKeyValue(),就执行一遍map方法,而此处的context实际上是实现了Context接口的MapContextImpl(这一点可以在MultithreadedMapper的run方法看出来),其nextKeyValue,getCurrentKey,getCurrentValue方法为:
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
return reader.nextKeyValue();
}
@Override
public KEYIN getCurrentKey() throws IOException, InterruptedException {
return reader.getCurrentKey();
}
@Override
public VALUEIN getCurrentValue() throws IOException, InterruptedException {
return reader.getCurrentValue();
}
org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =
new NewTrackingRecordReader<INKEY,INVALUE>
(inputFormat.createRecordReader(split, taskContext), reporter);
job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
org.apache.hadoop.mapreduce.RecordWriter output = null;
..............
org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY, OUTVALUE>
mapContext =
new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, getTaskID(),
input, output,
committer,
reporter, split);
public abstract class InputFormat<K, V> {
public abstract
List<InputSplit> getSplits(JobContext context
) throws IOException, InterruptedException;
public abstract
RecordReader<K,V> createRecordReader(InputSplit split,
TaskAttemptContext context
) throws IOException,
InterruptedException;
}
@SuppressWarnings("unchecked")
public Class<? extends InputFormat<?,?>> getInputFormatClass()
throws ClassNotFoundException {
return (Class<? extends InputFormat<?,?>>)
conf.getClass(INPUT_FORMAT_CLASS_ATTR, TextInputFormat.class);
}
public class TextInputFormat extends FileInputFormat<LongWritable, Text> {
@Override
public RecordReader<LongWritable, Text>
createRecordReader(InputSplit split,
TaskAttemptContext context) {
String delimiter = context.getConfiguration().get(
"textinputformat.record.delimiter");
byte[] recordDelimiterBytes = null;
if (null != delimiter)
recordDelimiterBytes = delimiter.getBytes();
return new LineRecordReader(recordDelimiterBytes);
}
@Override
protected boolean isSplitable(JobContext context, Path file) {
......................
}
}
public void initialize(InputSplit genericSplit,
TaskAttemptContext context) throws IOException {
FileSplit split = (FileSplit) genericSplit;
......
start = split.getStart();
end = start + split.getLength();
final Path file = split.getPath();
// open the file and seek to the start of the split
final FileSystem fs = file.getFileSystem(job);
fileIn = fs.open(file);
if (isCompressedInput()) {
decompressor = CodecPool.getDecompressor(codec);
if (codec instanceof SplittableCompressionCodec) {
final SplitCompressionInputStream cIn =
((SplittableCompressionCodec)codec).createInputStream(
fileIn, decompressor, start, end,
SplittableCompressionCodec.READ_MODE.BYBLOCK);
if (null == this.recordDelimiterBytes){
in = new LineReader(cIn, job);
} else {
in = new LineReader(cIn, job, this.recordDelimiterBytes);
}
start = cIn.getAdjustedStart();
end = cIn.getAdjustedEnd();
filePosition = cIn;
} else {
if (null == this.recordDelimiterBytes) {
in = new LineReader(codec.createInputStream(fileIn, decompressor),
job);
} else {
in = new LineReader(codec.createInputStream(fileIn,
decompressor), job, this.recordDelimiterBytes);
}
filePosition = fileIn;
}
} else {
fileIn.seek(start);
if (null == this.recordDelimiterBytes){
in = new LineReader(fileIn, job);
} else {
in = new LineReader(fileIn, job, this.recordDelimiterBytes);
}
filePosition = fileIn;
}
}
public boolean nextKeyValue() throws IOException {
if (key == null) {
key = new LongWritable();
}
key.set(pos);
if (value == null) {
value = new Text();
}
int newSize = 0;
// We always read one extra line, which lies outside the upper
// split limit i.e. (end - 1)
while (getFilePosition() <= end) {
newSize = in.readLine(value, maxLineLength,
Math.max(maxBytesToConsume(pos), maxLineLength));
if (newSize == 0) {
break;
}
pos += newSize;
inputByteCounter.increment(newSize);
if (newSize < maxLineLength) {
break;
}
}
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
}
@Override
public LongWritable getCurrentKey() {
return key;
}
@Override
public Text getCurrentValue() {
return value;
}
首先在initialize方法里,根据传入的FileSplit来获取到当前读取文件的path,起始位置,并以此创建真正的文件读取流in,我们可以看见在nextKeyValue方法里,就是由in来读取文件,更新key和value的值。