Enter CombineFileInputFormat, it packs many files into each split so that each mapper has more to process. CombineFileInputFormat takes node and rack locality into account when deciding which blocks to place in the same split so it doesn't suffer from the same problem of simply having a big split size.
public class MyCombineFileInputFormat extends CombineFileInputFormat { public static class MyKeyValueLineRecordReader implements RecordReader { private final KeyValueLineRecordReader delegate; public MyKeyValueLineRecordReader( CombineFileSplit split, Configuration conf, Reporter reporter, Integer idx) throws IOException { FileSplit fileSplit = new FileSplit( split.getPath(idx), split.getOffset(idx), split.getLength(idx), split.getLocations()); delegate = new KeyValueLineRecordReader(conf, fileSplit); } @Override public boolean next(Text key, Text value) throws IOException { return delegate.next(key, value); } @Override public Text createKey() { return delegate.createKey(); } @Override public Text createValue() { return delegate.createValue(); } @Override public long getPos() throws IOException { return delegate.getPos(); } @Override public void close() throws IOException { delegate.close(); } @Override public float getProgress() throws IOException { return delegate.getProgress(); } } @Override public RecordReader getRecordReader( InputSplit split, JobConf job, Reporter reporter) throws IOException { return new CombineFileRecordReader( job, (CombineFileSplit) split, reporter, (Class) MyKeyValueLineRecordReader.class); } }
CombineFileInputFormat is an abstract class that you need to extend and override getRecordReader method. CombineFileRecordReader manages multiple input splits in CombineFileSplit simply by constructing new RecordReader for each input split within. MyKeyValueLineRecordReader creates a KeyValueLineRecordReader to delegate operations to.
Remember to set mapred.max.split.size to a small multiple of block size in bytes as otherwise there will be no split at all.