一步一步学习hadoop（六）

最新推荐文章于 2025-08-20 22:58:53 发布

lldustc

最新推荐文章于 2025-08-20 22:58:53 发布

阅读量985

点赞数

CC 4.0 BY-SA版权

文章标签： hadoop linux Linux LINUX 定制作业输入格式

本文链接：https://blog.youkuaiyun.com/lldustc_blog/article/details/8171481

本文详细介绍了如何根据特定需求自定义MapReduce作业的输入格式，包括理解默认配置、实现自定义InputFormat类以及实际案例分析。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

hadoop框架设计得非常好，你只需要完成很少的工作就可以让自己的mapreduce作业运行起来，但是很多时候我们想要有更多的掌控权，以完成特定任务的特殊要求。这就需要我们去修改hadoop框架的默认配置。
先介绍一下mapreduce作业相关的一些默认配置
1.输入格式（InputFormat）默认为TextInputFormat，key/value类型分别为LongWritable/Text，key值代表该行的全局字节偏移量，value代表该行的文本
2.mapper类，默认为IdentityMapper，该类不做工作，将输入原样输出
3.reduce类，默认为IdentityReducer，该类不做工作，将输入原样输出
4.mapper输出键格式，mapper输出值格式，reduce输出键格式，reduce输出值格式，hadoop已经实现了很多常用的数据类型，但我们也可以使用自己定义的.
4.partitioner类，该类实现从map任务的输出到reduce任务的一个映射，将map任务的输出映射到哪一个reduce任务，默认为HashPartitioner，该类在大都数的情况下可以很好的工作。
5.Combiner类，可以看成本地化的reduce任务，对作业进行优化，减轻从map端到reduce端的网络传输压力
6.Comparator类，传输到reduce的map输出结果怎样进行排序
7.GroupComparator类，决定那些key要分成一组
7.输出格式（OutputFormat），决定输出应该改怎样保存，默认为TextOutputFormat，将键值对转化为字符串，用\t分隔，可以通过设置key.value.separator.in.input.line将分隔符设置为自己的分隔符。

定制mapreduce作业输入格式（InputFormat)
InputFormat是一个抽象类，第一个函数是获取处理数据的所有的分片（InputSplit），一个函数创建一个读取分片数据的浏览器。
public abstract class InputFormat<K, V> {
public abstract
    List<InputSplit> getSplits(JobContext context) throws IOException, InterruptedException;

public abstract
    RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException,InterruptedException;
}
对于大多数以文件作为数据源的输入，我们不用从头开始去构建自己的InputFormat，只需继承FileInputFormat类，实现自己的createRecordReader函数即可。
1.一个最简单的例子，假设我们希望不要切分我们的文件，则只需要重载FileInputFormat类的isSplitable函数，该函数默认返回true,即要进行切分，我们将它改为返回false即可。
protected boolean isSplitable(JobContext context, Path filename) {
    return true;
}

2.对于其它较复杂的实现，主要工作在于实现自己的RecordReader类，该类决定该怎样去读取分片数据。
下面是将一行记录解析为时间戳和URL对的TimeUrlTextInputFomat的实现。

public class TimeUrlTextInputFormat extends FileInputFormat<Text, URLWritable> {
public RecordReader<Text, URLWritable> getRecordReader( InputSplit input, JobConf job, Reporter reporter) throws IOException {
return new TimeUrlLineRecordReader(job, (FileSplit)input);
}
}

class TimeUrlLineRecordReader implements RecordReader<Text, URLWritable> {
private KeyValueLineRecordReader lineReader;
private Text lineKey, lineValue;
public TimeUrlLineRecordReader(JobConf job, FileSplit split) throws IOException {
lineReader = new KeyValueLineRecordReader(job, split);
lineKey = lineReader.createKey();
lineValue = lineReader.createValue();
}
public boolean next(Text key, URLWritable value) throws IOException {
if (!lineReader.next(lineKey, lineValue)) {
return false;
}
key.set(lineKey);
value.set(lineValue.toString());
return true;
}
public Text createKey() {
return new Text("");
}
public URLWritable createValue() {
return new URLWritable();
}
public long getPos() throws IOException {
return lineReader.getPos();
}
public float getProgress() throws IOException {
return lineReader.getProgress();
}
public void close() throws IOException {
lineReader.close();
}
}

ublic class URLWritable implements Writable {
protected URL url;
public URLWritable() { }
public URLWritable(URL url) {
this.url = url;
}
public void write(DataOutput out) throws IOException {
out.writeUTF(url.toString());
}
public void readFields(DataInput in) throws IOException {
url = new URL(in.readUTF());
}
public void set(String s) throws MalformedURLException {
url = new URL(s);
}
}

然后就可以在mapreduce作业中调用setInputFomatClass设置自定义输入格式