Hadoop的InputFormat和OutputFormat

最新推荐文章于 2021-02-01 08:29:58 发布

原创最新推荐文章于 2021-02-01 08:29:58 发布 · 1.2k 阅读

0 ·

CC 4.0 BY-SA版权

hadoop 专栏收录该内容

15 篇文章

订阅专栏

本文深入探讨了MapReduce框架中InputFormat和OutputFormat的作用，详细解释了InputSplit的概念及其在数据输入过程中的角色，同时介绍了如何将不同类型的输入数据转化为Map能处理的键值对形式。此外，文章还概述了OutputFormat的功能，强调了其在数据输出阶段的重要性，包括验证输出规格和提供用于写入输出文件的RecordWriter实现。

一 InputFormat概述：

(1)InputFormat类：InputFormat用来描述和控制MapReduce Job的数据输入。

(2)InputSplit(输入分片)：代表分配给单个map任务的数据。InputSplit存储的并非数据本身，而是一个分片长度和一个记录数据位置的数组，生成InputSplit的方法可通过InputFormat来设置。InputFormat的getSplits方法可以生成InputSplit相关信息，包括两部分：InputSplit元数据信息和原始InputSplit信息。InputSplit元数据信息将被JobTracker使用，用以生成Task本地性相关数据结构；原始InputSplit信息将被Map Task初始化时使用，用以获取自己要处理的数据。

(3)map任务处理的数据是由InputFormat分解过的数据，InputFormat将数据集分割为输入分片(InputSplit)。map会将分片传送给InputFormat，InputFormat调用getRecordReader方法生成RecordReader，RecordReader再通过createKey、createValue方法创建可供Map处理的<key,value>。

(4)Hadoop预定义了多种方法将不同类型输入数据转化为Map能处理的<key,value>对(也可自定义)，它们都继承自InputFormat，分别是：

*DBInputFormat

*DelegatingInputFormat

*FIleInputFormat：CombineFileInputFormat, KeyValueTextInputFormat, NLineInputFormat,SequenceFileInputFormat, TextInputFormat。

二 OutputFormat概述：

(1)OutputFormat类：OutputFormat类描述和控制MapReduce Job的数据输出。

(2)MapReduce框架需要OutputFormat做的工作：

*Validate the output-specification of the job. For e.g. check that the output directory doesn't already exist.
*Provide the RecordWriter implementation to be used to write out the output files of the job. Output files are stored in a FileSystem.