sources study-part 3-mapreduce - what is a split?-优快云博客

本文详细介绍了 Hadoop 中的 Job 分割机制，包括如何计算分割大小、如何避免文件被再次分割，以及如何从分割文件中恢复记录等内容。此外还讨论了不同模式下（如本地模式）的处理方式。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

"split" which is a logical concept relatives to a "block" whis is real store unit.

when a client submit a job to JT,it will compute the splits by file,than the TT will generate InputSplit to map task.

so splits are used for spawn mappers ,if you use FileINputformat and set isSplitable() to false,that means this file will NOT be splitted,so this file is as a file to come to a mapper.

RecordReader is used to recover to file data that splited by client before submitting to JT in Reducer .so if u can read a split as a record.

intergrated FileInputFormat and RecordReader,u can get a only record for a whole file :

a set isSplitable() to false;

b rewrite the next() in RecordReader to read the whlole split once.

how to compute a split size?

new verion computing formula:

split size = max(min-split-size,min(max-split-size,blocksize))

note the final number of split is not simply to divide file length by split size,it use a split slot to optimize.

that is it will consider the positon seeking performance?

old version formula:

split size = max(min-split-size,min(goalsize,blocksize))

the goalsize is generated from dividing the total size of all files by numMapTasks.

of course there is a split slop in it also.

finally,the client will generate a split file which summary all the splits info to the dfs.so it is a logical to let the app have a second chance to adjust to inpput size when running into mapper.

how to restore records from split file?

yes, it is excited to talk about this subject. as the split file is not considered in case of line length(maybe exceed the threshold of mapred.linerecordreader.length) and whether it is breaked in a non-ascii char when generated by client before submiting a job.

in Local mode,this is LocalJobRunner to process tasks running.of course ,a LineReader is used to recover every split file(fragment actually) to push to a mapper.there are the import things to do it :

A each split file have it's raw file (parent file) as it's property.and

it keep a pair of current data offset(relate to raw file) and current data lengh of split file

B a CR and LF both are ascii -codes(that means they are not splittable to avoid affecting to process split proglems)

and this is the style of loca mode,what about real cluster? TODO :)

by the way,there is a trick to avoid resplit the raw file in LocalJobRunner,go to see in job.run() of it:

if (job.getUseNewMapper()) {
...
}else{
..
}

you can use the JobClient.getSplits() to instead of it,mabe this is a "optimization" :)