跟着教程写了一段自定义inputformat的代码,看了有一段有疑问debug了一下确实和我想的一样。先把代码粘出来:
自定义的inputformat:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
public class InputFormat extends FileInputFormat<NullWritable, BytesWritable>{
@Override
protected boolean isSplitable(JobContext context, Path filename) {
// TODO Auto-generated method stub
return false;
}
@Override
public RecordReader<NullWritable, BytesWritable> createRecordReader(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
WholeRecordReader recordReader = new WholeRecordReader();
recordReader.initialize(split, context);
return recordReader;
}
}
RecordReader对象:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
public class WholeRecordReader extends RecordReader<NullWritable, BytesWritable>{
private FileSplit split;
private Configuration configuration;
private BytesWritable value = new BytesWritable();
boolean processed = false;
@Override
public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
// TODO Auto-generated method stub
this.split = (FileSplit) split ;
configuration = new Configuration();
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
// TODO Auto-generated method stub
FSDataInputStream fis = null;
if(!processed) {
// 0创建缓冲区
byte[] buffer = new byte [(int) split.getLength()];
try {
// 1获取filiesystem
Path path = split.getPath();
FileSystem fs = path.getFileSystem(configuration);
// 2读取数据
fis = fs.open(path);
// 3读取一个文件内容
IOUtils.readFully(fis, buffer, 0,buffer.length);
// 4输出文件内容
value.set(buffer,0,buffer.length);
}catch (Exception e) {
// TODO: handle exception
}finally {
IOUtils.closeStream(fis);
}
processed = true;
return true;
}
return false;
}
@Override
public NullWritable getCurrentKey() throws IOException, InterruptedException {
// TODO Auto-generated method stub
return NullWritable.get();
}
@Override
public BytesWritable getCurrentValue() throws IOException, InterruptedException {
// TODO Auto-generated method stub
return value;
}
@Override
public float getProgress() throws IOException, InterruptedException {
// TODO Auto-generated method stub
return processed? 1:0;
}
@Override
public void close() throws IOException {
// TODO Auto-generated method stub
}
}
map类reduce类driver类基本和wordcount基本一样就不拿出来了。
这里当时有个小疑惑:在WholeRecordReader类里面的nextKeyValue()方法中,processed变量在执行完if语句之后被赋值为true,而要知道执行一次nextKeyValue()方法只读取了一个文件,其他文件要再次读取processed变量要为false,这是咋回事?开始我以为是我把java的类型变量记混淆,重新看了下三种变量类型的生命周期,没毛病,如果每次读取文件调用的是同一个WholeRecordReader实例,那么processed在赋值false之后第二次应该是无法读取成功。推测可能是mapper的类的在读取完第一个文件之后重新创建了WholeRecordReader对象,所以processed变量的值仍为fasle。
在InputFormat类中recordReade初始化打了断点,debug一下,发现确实是重新创建的对象,并且与文件个数相对应。
之后想着看一下源码里面到底是哪段重新创建,但是我的源码设置貌似有问题,跑到最关键的一步就无法显示,坑,下次有机会再看。
debug真的太好用了。