sources study-part 3-mapreduce - what is a split?

本文详细介绍了 Hadoop 中的 Job 分割机制,包括如何计算分割大小、如何避免文件被再次分割,以及如何从分割文件中恢复记录等内容。此外还讨论了不同模式下(如本地模式)的处理方式。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

"split" which is a logical concept relatives to a "block" whis is real store unit.

when a client submit a job to JT,it will compute the splits by file,than the TT will generate InputSplit to map task.

so splits are used for spawn mappers ,if you use FileINputformat and set isSplitable() to false,that means this file will NOT be splitted,so this file is as a file to come to a mapper.

RecordReader is used to recover to file data that splited by client before submitting to JT in Reducer .so if u can read a split as a record.

 

intergrated FileInputFormat and RecordReader,u can get a only record for a whole file :

a set isSplitable() to false;

b rewrite the next() in RecordReader to read the whlole split once.


how to compute a split size?

new verion computing formula:

split size = max(min-split-size,min(max-split-size,blocksize))

note the final number of split is not simply to divide file length by split size,it use a split slot to optimize.

that is it will consider the positon seeking performance?


old version formula:

split size = max(min-split-size,min(goalsize,blocksize))

the goalsize is generated from dividing the total size of all files by numMapTasks.

of course there is a split slop in it also.

 

finally,the client will generate a split file which summary all the splits info to the dfs.so it is a logical to let the app have a second chance to adjust to inpput size when running into mapper.

 

how to restore records from split file?

yes, it is excited to talk about this subject. as the split file is not considered in case of line length(maybe exceed the threshold of mapred.linerecordreader.length) and whether it is breaked in a non-ascii char when generated by client before submiting a job.

in Local mode,this is LocalJobRunner to process tasks running.of course ,a LineReader is used to recover every split file(fragment actually) to push to a mapper.there are the import things to do it :

A each split file have it's raw file (parent file) as it's property.and

it keep a pair of current data offset(relate to raw file) and current data lengh of split file

 

B a CR and LF  both are ascii -codes(that means they are not splittable to avoid affecting to process split proglems)

 

and this is the style of loca mode,what about real cluster? TODO :)

 

 

by the way,there is a trick to avoid resplit the raw file in LocalJobRunner,go to see in job.run() of it:

if (job.getUseNewMapper()) {
...
}else{
..
}

 you can use the JobClient.getSplits() to instead of it,mabe this is a "optimization" :)

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值