InputFormat接口实现类

佳讯好

于 2024-03-17 20:13:26 发布

阅读量1.1k

点赞数 22

分类专栏：大数据离线开发文章标签： mapreduce 分布式大数据

本文链接：https://blog.youkuaiyun.com/qq_15304885/article/details/136788013

版权

大数据同时被 2 个专栏收录

14 篇文章

订阅专栏

离线开发

14 篇文章

订阅专栏

本文详细介绍了MapReduce任务中如何处理存储在HDFS的大文件，重点讲解了TextInputFormat、KeyValueTextInputFormat和NLineInputFormat的区别，以及Mapper和Reducer在处理过程中的作用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

MapReduce任务的输入文件一般是存储在HDFS里面。输入的文件格式包括：基于行的日志文件、二进制格式文件等。这些文件一般会很大，达到数十GB，甚至更大。那么MapReduce是如何读取这些数据的呢？下面我们首先学习InputFormat接口。

InputFormat常见的接口实现类包括：TextInputFormat、KeyValueTextInputFormat、NLineInputFormat、CombineTextInputFormat和自定义InputFormat等。

1）TextInputFormat

TextInputFormat是默认的InputFormat。每条记录是一行输入。键K是LongWritable类型，存储该行在整个文件中的字节偏移量。V值是这行的内容，不包括任何行终止符（换行符和回车符）。

以下是一个示例，比如，一个分片包含了如下4条文本记录。

Rich learning form

Intelligent learning engine

Learning more convenient

From the real demand for more close to the enterprise

每条记录表示为以下键/值对：

(0,Rich learning form)

(20,Intelligent learning engine)

(49,Learning more convenient)

(75,From the real demand for more close to the enterprise)

很明显，键并不是行号。一般情况下，很难取得行号，因为文件按字节而不是按行切分为分片。

计算公式：字符个数+符号+换行符

2）KeyValueTextInputFormat

每一行均为一条记录，被分隔符分割为key，value。可以通过在驱动类中设置conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR, " ");来设定分隔符。默认分隔符是tab（\t）。

job.setInputFormatClass(KeyValueTextInputFormat.class);

以下是一个示例，输入是一个包含4条记录的分片。其中——>表示一个（水平方向的）制表符。

line1 ——>Rich，learning form

line2 ——>Intelligent，learning engine

line3 ——>Learning，more convenient

line4 ——>From，the real demand for more close to the enterprise

每条记录表示为以下键/值对：

(Rich，learning form)

(Intelligent，learning engine)

(Learning，more convenient)

(From the，real demand for more close to the enterprise)

此时的键是每行排在制表符之前的Text序列。

3）NLineInputFormat

如果使用NlineInputFormat，代表每个map进程处理的InputSplit不再按block块去划分，而是按NlineInputFormat指定的行数N来划分。即输入文件的总行数/N=切片数(20)，如果不整除，切片数=商+1。

以下是一个示例，仍然以上面的4行输入为例。

Rich learning form

Intelligent learning engine

Learning more convenient

From the real demand for more close to the enterprise

例如，如果N是2，则每个输入分片包含两行。开启2个maptask。

(0,Rich learning form)

(19,Intelligent learning engine)

另一个 mapper 则收到后两行：

(47,Learning more convenient)

(72,From the real demand for more close to the enterprise)

这里的键和值与TextInputFormat生成的一样。

package bigdata.b10;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * Mapper类
 * 数据输入的类型K，V-->
 */
public class Map extends Mapper<LongWritable, Text,Text, IntWritable> {
    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
//        context.write(new Text(key.toString()),new IntWritable(1));
        //打印key
        System.out.println(key.toString());

        //获取一行数据
        String string = value.toString();
        //切分数据，按照空格切分
        String[] s = string.split(" ");
        //遍历获取每个单词
        for (String ss:s){
            //输出，每个单词拼接1（标记）
            context.write(new Text(ss),new IntWritable(1));
        }
    }
}

package bigdata.b10;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * Reduce类
 *
 */
public class Reduce extends Reducer<Text, IntWritable,Text,IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
        //定义一个计数器
        int count = 0;
        //累加计数
        for (IntWritable intWritable:values){
            //intWritable转化成int
            count+=intWritable.get();
        }
        //输出
        context.write(key,new IntWritable(count));
    }
}

package bigdata.b10;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
//import org.apache.hadoop.mapred.lib.CombineTextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class Driver {
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        long starttime = System.currentTimeMillis();

        args = new String[]{"D:\\test\\test.txt","D:\\test\\inputformat8"};

        //实例化配置文件
        Configuration configuration = new Configuration();
//        configuration.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR,",");//分隔符，在数据中必须存在
        configuration.set(NLineInputFormat.LINES_PER_MAP,"2");//切片行数
        //定义一个job任务
        Job job = Job.getInstance(configuration);
        //配置job的信息
        job.setJarByClass(Driver.class);

        //指定自定义的Mapper类以及Mapper的输出数据类型到job
        job.setMapperClass(Map.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        //指定自定义的Reduce类以及Reduce的输出数据类型到job
        job.setReducerClass(Reduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);


        //自定义分区，按照单词长度的奇数和偶数分区
//        job.setPartitionerClass(Partition.class);
//        job.setNumReduceTasks(2);

        //设置预合并combiner
//        job.setCombinerClass(Combiner.class);

        //将多个小文件划分一个个Maptask运行--一种优化
//        job.setInputFormatClass(CombineTextInputFormat.class);
//        CombineTextInputFormat.setMaxInputSplitSize(job,8*1024*1024);//7M
//        CombineTextInputFormat.setMinInputSplitSize(job,2*1024*1024);//2M

        //指定conf的分隔符
//        job.setInputFormatClass(KeyValueTextInputFormat.class);
        //job指定切片行数
        job.setInputFormatClass(NLineInputFormat.class);

        //配置输入数据的路径
        FileInputFormat.setInputPaths(job,new Path(args[0]));//读目录
        //配置输出数据的路径
        FileOutputFormat.setOutputPath(job,new Path(args[1]));


        //提交任务
        job.waitForCompletion(true);

        long endtime = System.currentTimeMillis();
        System.out.println((endtime-starttime)/1000);


    }
}