Hadoop学习—— NLineInputFormat 实现类

本文链接：https://blog.youkuaiyun.com/Alingyuzi/article/details/109897407

NLineInputFormat 实现类

使用NLineInputFormat，代表每一个map进程处理的inputSplit 不再按black块去划分，而是按NLineInputFormat 指定的行数N来划分。即输入文件的总行数/N = 切片书。如果不能整除，切片数=商+1
示例如下：
一个的分片包含了如下4条文本记录

Rich learning form
Intelligent learning engine
Learing more convenient
From the real demand for more close to the enterprise
例如，如果N是2，则每个输入分片包含两行，开启两个MapTask
0， Rich learning form
19， Intelligent learning engine
另一个mapper则收到后两行
47，Learing more convenient
72,From the real demand for more close to the enterprise
这里的键和值与TextInputormat 生成的一样

具体实现如下：
在驱动类中设置

    // 8.设置每个切片InputSplit中划分三条记录
    NLineInputFormat.setNumLinesPerSplit(job, 3);

    // 9.使用NLineInputFormat处理记录数
    job.setInputFormatClass(NLineInputFormat.class);

NLineInputFormat使用案例
1．需求
对每个单词进行个数统计，要求根据每个输入文件的行数来规定输出多少个切片。此案例要求每三行放入一个切片中。
（1）输入数据
banzhang ni hao
xihuan hadoop banzhang
banzhang ni hao
xihuan hadoop banzhang
banzhang ni hao
xihuan hadoop banzhang
banzhang ni hao
xihuan hadoop banzhang
banzhang ni hao
xihuan hadoop banzhang banzhang ni hao
xihuan hadoop banzhang
（2）期望输出数据
Number of splits:4

Mapper类编写如下：

package com.hadwinling.mapreduce.nline;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * @author :HadwinLing
 * @version 1.0
 * @description: TODO
 * @date 2020/11/12 上午9:58
 */
public class NLineMapper extends Mapper<LongWritable,Text,Text,IntWritable> {

    Text k = new Text();
    IntWritable v = new IntWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
            throws IOException, InterruptedException {
//		banzhang ni hao

        // 1 获取一行
        String line = value.toString();

        // 2 切割
        String[] words = line.split(" ");

        // 3 循环写出
        for (String word : words) {

            k.set(word);

            context.write(k, v);
        }
    }
}

Reducer类编写如下：

package com.hadwinling.mapreduce.nline;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * @athor :HadwinLing
 * @version 1.0
 * @description: TODO
 * @date 2020/11/12 上午10:02
 */
public class NLineReducer extends Reducer<Text,IntWritable,Text,IntWritable> {

    IntWritable v = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;

        for (IntWritable value :
                values) {
            sum += value.get();
        }

        v.set(sum);

        context.write(key,v);
    }
}

具体Driver实现如下：

package com.hadwinling.mapreduce.nline;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * @author :HadwinLing
 * @version 1.0
 * @description: TODO
 * @date 2020/11/12 上午10:10
 */
public class NLineDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        //1.获取job对象
        Configuration conf  = new Configuration();
        Job job = Job.getInstance(conf);

        // 8.设置每个切片InputSplit中划分三条记录
        NLineInputFormat.setNumLinesPerSplit(job, 3);

        // 9.使用NLineInputFormat处理记录数
        job.setInputFormatClass(NLineInputFormat.class);

        //2.设置jar包位置
        job.setJarByClass(NLineDriver.class);

        //3.关联mapper和reducer
        job.setMapperClass(NLineMapper.class);
        job.setReducerClass(NLineReducer.class);

        //4.设置map输出kv类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        
        //5设置最终输出kv类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //6设置输入输出数据路径
        FileInputFormat.setInputPaths(job,new Path(args[0]));
        FileOutputFormat.setOutputPath(job,new Path(args[1]));

        //7提交job
        boolean result = job.waitForCompletion(true);

        System.exit(result?0:1);
    }
}

4．测试
（1）输入数据
banzhang ni hao
xihuan hadoop banzhang
banzhang ni hao
xihuan hadoop banzhang
banzhang ni hao
xihuan hadoop banzhang
banzhang ni hao
xihuan hadoop banzhang
banzhang ni hao
xihuan hadoop banzhang banzhang ni hao
xihuan hadoop banzhang
（2）输出结果的切片数，如图所示：
在这里插入图片描述