目录
一、分⽚的概念
MapReduce在进⾏作业提交时,会预先对将要分析的原始数据进⾏划分处理,形成⼀个个等⻓的逻辑数据对象,称之为输⼊分⽚(inputSplit
),简称
“
分⽚
”
。
MapReduce为每⼀个分⽚构建⼀个单独的
MapTask
,并由该任务来运⾏⽤户⾃定义 的map
⽅法,从⽽处理分⽚中的每⼀条记录。
二、分片大小的选择
1.
拥有许多分⽚,意味着处理每个分⽚所需要的时间要⼩于处理整个输⼊数据所花的时间(
分⽽治之的优势
)
。
2.
并⾏处理分⽚,且每个分⽚⽐较⼩。负载平衡,好的计算机处理的更快,可以腾出时间,做别的任务
3.
如果分⽚太⼩,管理分⽚的总时间和构建
map
任务的总时间将决定作业的整个执⾏时间。
4.
如果分⽚跨越两个数据块,那么分⽚的部分数据需要通过⽹络传输到
map
任务运⾏的节点,占⽤⽹络带宽,效率更低
5.
因此最佳分⽚⼤⼩应该和
HDFS
上的块⼤⼩⼀致。
hadoop2.x
默认
128M.
三、源码解析
1)FileSplit源码解析
public class FileSplit extends InputSplit implements Writable {
private Path file; //要处理的⽂件名
private long start; //当前逻辑分⽚的偏移量
private long length; //当前逻辑分⽚的字节⻓度
private String[] hosts; //当前逻辑分⽚对应的块数据所在的主机名
private SplitLocationInfo[] hostInfos;
public FileSplit() {}
public FileSplit(Path file, long start, long length,String[] hosts) {
this.file = file; //创建逻辑分⽚对象时调⽤的构造器
this.start = start;
this.length = length;
this.hosts = hosts;
}
//.....
}
2)FileInputFormat源码解析
public abstract class FileInputFormat<K, V> implements InputFormat<K, V> {
public static final String NUM_INPUT_FILES;
public static final String INPUT_DIR_RECURSIVE;
private static final double SPLIT_SLOP = 1.1;
private long minSplitSize = 1;
//........
protected FileSplit makeSplit(Path file, long start, long length, String[] hosts) {
return new FileSplit(file, start, length, hosts);
}
//.....
public InputSplit[] getSplits(JobConf job, int numSplits)throws IOException {
// 获取⽂件的状态信息
FileStatus[] files = listStatus(job);
// Save the number of input files for metrics/loadgen
job.setLong(NUM_INPUT_FILES, files.length);
long totalSize = 0; // compute total size
for (FileStatus file: files) { // check we have valid files
//.....
totalSize += file.getLen();
}
long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
long minSize = Math.max(job.getLong(org.apache.hadoop.mapreduce.lib.input.
FileInputFormat.SPLIT_MINSIZE, 1), minSplitSize);
// generate splits
ArrayList<FileSplit> splits = new ArrayList<FileSplit> (numSplits);
for (FileStatus file: files) {
Path path = file.getPath();
long length = file.getLen();
if (length != 0) {
FileSystem fs = path.getFileSystem(job);
BlockLocation[] blkLocations;
//.......
if (isSplitable(fs, path)) {
long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(goalSize,minSize, blockSize);
long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
//......
splits.add(makeSplit(path, lengthbytesRemaining, splitSize,splitHosts[0],splitHosts[1]));
bytesRemaining -= splitSize;
}
if (bytesRemaining != 0) {
String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations, length - bytesRemaining, bytesRemaining, clusterMap);
splits.add(makeSplit(path, length - bytesRemaining, bytesRemaining,splitHosts[0],splitHosts[1]));
}
} else {
//......
}
} else {
//......
}
}
//......
return splits.toArray(new FileSplit[splits.size()]);
}
/**
* 计算分⽚⼤⼩⽅法
*/
protected long computeSplitSize(long blockSize, long minSize,long maxSize) {
return Math.max(minSize, Math.min(maxSize, blockSize));
}
}
3)TextInputFormat源码解析
public class TextInputFormat extends FileInputFormat<LongWritable, Text> {
@Override
public RecordReader<LongWritable, Text> createRecordReader(InputSplit split,TaskAttemptContext context) {
String delimiter = context.getConfiguration().get("textinputformat.record.delimiter");
byte[] recordDelimiterBytes = null;
if (null != delimiter)
recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
return new LineRecordReader(recordDelimiterBytes);
}
@Override
protected boolean isSplitable(JobContext context, Path file) {
final CompressionCodec codec = new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
if (null == codec) {
return true;
}
return codec instanceof SplittableCompressionCodec;
}
}
4) LineRecordReader源码解析
public class LineRecordReader extends RecordReader<LongWritable, Text> {
private static final Log LOG = LogFactory.getLog(LineRecordReader.class);
public static final String MAX_LINE_LENGTH = "mapreduce.input.linerecordreader.line.maxlength";
private long start;
private long pos;
private long end;
private SplitLineReader in;
private FSDataInputStream fileIn;
private Seekable filePosition;
private int maxLineLength;
private LongWritable key;
private Text value;
private boolean isCompressedInput;
private Decompressor decompressor;
private byte[] recordDelimiterBytes;
public LineRecordReader() {
}
public LineRecordReader(byte[] recordDelimiter) {
this.recordDelimiterBytes = recordDelimiter;
}
public void initialize(InputSplit genericSplit,TaskAttemptContext context) throws IOException {
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
this.maxLineLength = job.getInt(MAX_LINE_LENGTH,Integer.MAX_VALUE);
start = split.getStart();
end = start + split.getLength();
final Path file = split.getPath();
// open the file and seek to the start of the split
final FileSystem fs = file.getFileSystem(job);
fileIn = fs.open(file);
CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);
if (null!=codec) {
//...................
} else {
fileIn.seek(start);
in = new UncompressedSplitLineReader(
fileIn, job, this.recordDelimiterBytes,
split.getLength());
filePosition = fileIn;
}
// If this is not the first split, we always throw away first record
// because we always (except the last split) read one extra line in
// next() method.
if (start != 0) {
start += in.readLine(new Text(), 0,maxBytesToConsume(start));
}
this.pos = start;
}
//.....
public boolean nextKeyValue() throws IOException {
if (key == null) {
key = new LongWritable();
}
key.set(pos);
if (value == null) {
value = new Text();
}
int newSize = 0;
// We always read one extra line, which lies outside the upper
// split limit i.e. (end - 1)
while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {
if (pos == 0) {
newSize = skipUtfByteOrderMark();
} else {
newSize = in.readLine(value, maxLineLength,maxBytesToConsume(pos));
pos += newSize;
}
if ((newSize == 0) || (newSize < maxLineLength)) {
break;
}
// line too long. try again
LOG.info("Skipped line of size " + newSize + " at pos " + (pos - newSize));
}
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
}
@Override
public LongWritable getCurrentKey() {
return key;
}
@Override
public Text getCurrentValue() {
return value;
}
}
四、分⽚总结
1) 分⽚⼤⼩参数
通过分析源码,在FileInputFormat
中,计算切⽚⼤⼩的逻辑:
Math.max(minSize,Math.min(maxSize, blockSize)); 切⽚主要由这⼏个值来运算决定
参数
|
默认值
|
属性
|
minsize
|
1
|
mapreduce.input.fileinputformat.split.minsize
|
maxsize
|
Long.MAXVALUE
|
mapreduce.input.fileinputformat.split.maxsize
|
blocksize
|
块⼤⼩
|
dfs.blocksize:
|
可以看出,就是取
minsize
、
maxsize
、
blocksize
三者的中间的那个值。
1. 将maxsize调得⽐blocksize⼩,则会让切⽚变⼩,⽽且就等于配置的这个参数的值.
2. 将
minsize
调得⽐
blockSize
⼤,则会让切⽚变得⽐
blocksize
还⼤
3. 但是,不论怎么调参数,都不能让多个⼩⽂件
"
划⼊
"
⼀个
split
2) 创建过程总结
1. 获取⽂件⼤⼩及位置
2. 判断⽂件是否可以分⽚(压缩格式有的可以进⾏分⽚,有的不可以)
3. 获取分⽚的⼤⼩
4. 剩余⽂件的⼤⼩
/
分⽚⼤⼩
>1.1
时,循环执⾏封装分⽚信息的⽅法,具体如下
: 封装⼀个分⽚信息(
包含⽂件的路径,分⽚的起始偏移量,要处理的⼤⼩,分⽚包含 的块的信息,分⽚中包含的块存在哪⼉些机器上)
5. 剩余⽂件的⼤⼩
/
分⽚⼤⼩
<=1.1
且不等于
0
时,封装⼀个分⽚信息
(
包含⽂件的路径,分⽚的起始偏移量,要处理的⼤⼩,分⽚包含的块的信息,分⽚中包含的块存在哪⼉些机器上)
分⽚的注意事项
:
1.1
倍的冗余。
参考下图:

3) 分⽚细节问题总结
如果有多个分⽚
-1 第⼀个分⽚读到末尾再多读⼀⾏
- 2 既不是第⼀个分⽚也不是最后⼀个分⽚第⼀⾏数据舍弃,末尾多读⼀⾏
-3 最后⼀个分⽚舍弃第⼀⾏,末尾多读⼀⾏
- 4 为什么:前⼀个物理块不能正好是⼀⾏结束的位置啊
4) 分⽚与块的区别
看完源码就知道了
1. 分⽚是逻辑数据,记录的是要处理的物理块的信息⽽已
2. 块是物理的,是真实存储在⽂件系统上的原始数据⽂件。
列如:
⽐如待处理数据有两个⽂件:
file1.txt 260Mfile2.txt 10M
经过
FileInputFormat
的切⽚机制运算后,形成的切⽚信息如下:
file1.txt.split1-- 0~128file1.txt.split2-- 128~260file2.txt.split1-- 0~10M