【Java】MapReduce编程的编程思想以及基于Hadoop的Wordcount的程序的实现

最新推荐文章于 2020-11-24 03:04:43 发布

奔跑的小鲫鱼

最新推荐文章于 2020-11-24 03:04:43 发布

阅读量857

点赞数 6

CC 4.0 BY-SA版权

分类专栏： Java 文章标签： Hadoop MapReduce 核心思想编程详细讲解

本文链接：https://blog.youkuaiyun.com/wyz0516071128/article/details/80727419

Java 专栏收录该内容

36 篇文章

订阅专栏

本文介绍MapReduce编程思想及其实现词频统计的过程。详细解释了MapReduce处理数据流程，包括输入、map、shuffle、reduce阶段，并提供了一个具体的WordCount实例代码。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1.MapReduce编程的编程思想(比如该文件就只输入三行，如下面所示)

hive spark hive hbase
hadoop hive spark
sqoop flume scala


（0，“hive spark hive hbase”）
（22，“hadoop hive spark”）
（40，“sqoop flume scala”）

 输出：（hive,1）,(spark,1）,(hive,1）
在shuffle阶段，相同key的值进行合并，合并放到一个 集合中
(hive,<1,1>)，（spark,<1>）

reduce：聚合

输入：(hive,<1,1>)，（spark,<1>）
输出:(hive,2)

2.mapreduce处理数据过程

整个MapReduce程序，所有数据以（key,value）形式流动的
input:正常情况下，不需要编写代码，仅仅在MapReduce程序运行的时候指定一个路径即可

map(核心关注):map(key,value,output,context)
key:每行数据的偏移量  --毛用       value：每行数据的内容 --真正需要处理的内容

shuffle:阶段

 key：业务需求中的key值
    value：要聚合的值
        reduce(核心关注):reduce(K2 key, Iterator<V2> values,OutputCollector<K3, V3>，output, Reporter reporter)

3.通过idea实现MapReduce的Wordcount的代码实现，

 public class WordCountMapReduce {

public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
    //检查参数
   if (args.length <2){
       System.out.println("Usage: wordcount <in> [<in>...] <out>");
       return;
   }
   //1、读取配置文件
    Configuration configuration = new Configuration();
   //2、创建job
    Job job = Job.getInstance( configuration, "UserWordCountMapReduce" );
    job.setJarByClass( WordCountMapReduce.class );

    //a、input

    Path inputPath = new Path( args[0] );
    FileInputFormat.setInputPaths( job,inputPath );
    //b、map
    job.setMapperClass( WordCountMapper.class );
    job.setMapOutputKeyClass( Text.class ); //字符串String
    job.setMapOutputValueClass( IntWritable.class );//int

    //b.b、shuffle

    //c、reduce
    job.setReducerClass( WordCountReduce.class );
    job.setOutputKeyClass(  Text.class  );
    job.setOutputValueClass( IntWritable.class );
    //d、output
    Path outputPath = new Path(  args[1]);
    FileOutputFormat.setOutputPath( job,outputPath );

    //判断是否成功
    boolean isSuccess = job.waitForCompletion( true );//进度条
    System.exit( isSuccess?0:1 );
}


/**
 * class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
 *     KEYIN:行偏移量   long
 *     VALUEIN：行内容  String
 *     KEYOUT：
 *      map方法输出key的类型
 *      词频统计的单词   String
 *     VALUEOUT：
 *     map方法输出value的类型
 *      出现的次数，就是1，int
 *
 */
private static class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    private Text mapOutputKey = new Text( );
    private static  final IntWritable mapOutputValue = new IntWritable( 1 );
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //（0，“hive spark hive hbase”）
        String valueStr = value.toString();
        String[] items = valueStr.split( " " );
        for (String item:items) {
            mapOutputKey.set( item );
            //使用Context将map方法处理的结果返回输出
            //public void write(KEYOUT key, VALUEOUT value)
            context.write( mapOutputKey ,mapOutputValue);
        }
    }
}

/**
 * public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
 */
private static class WordCountReduce extends Reducer<Text,IntWritable,Text,IntWritable> {

    private IntWritable outputValue = new IntWritable(  );
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

        int sum = 0;
        for (IntWritable valus:values) {
            sum+=valus.get();
        }
        outputValue.set( sum );
        context.write( key,outputValue );
    }
}

}
以上代码是通过MapReduce程序实现的词频统计，统计单词的个数，这是我自己的一个学习笔记，希望能帮助到正在学习该方面知识的人，我也是一个刚入门的小白，写的不好，希望各位大佬提出相关建议，以后改进