MapReduce程序初探 -------------- WordCount_wordcount.intsumreducer-优快云博客

本文介绍了一个简单的 Hadoop MapReduce WordCount 程序实现过程，包括代码编写、编译、打包及运行中遇到的问题及解决办法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

程序代码

package test;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    /*
     * LongWritable 为输入的key的类型
     * Text 为输入value的类型
     * Text-IntWritable 为输出key-value键值对的类型
     */
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());  // 将TextInputFormat生成的键值对转换成字符串类型
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }
  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();
    /*
     * Text-IntWritable 来自map的输入key-value键值对的类型
     * Text-IntWritable 输出key-value 单词-词频键值对
     */
    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }
  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();  // job的配置
    Job job = Job.getInstance(conf, "word count");  // 初始化Job
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);  
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));  // 设置输入路径
    FileOutputFormat.setOutputPath(job, new Path(args[1]));  // 设置输出路径
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

创建目录

[root@Vm90 wxl]# ls
output  wordcount01
[root@Vm90 wxl]# mkdir wordcount01
[root@Vm90 wxl]# cd wordcount01
[root@Vm90 wxl]# mkdir src
[root@Vm90 wxl]# mkdir classes  
[root@Vm90 wordcount01]# ls
classes  src
[root@Vm90 wordcount01]# 
[root@Vm90 wordcount01]# cd src/
[root@Vm90 wordcount01]# vim WordCount.java
#将上述代码粘贴到WordCount.java中
#然后执行编译
[root@Vm90 src]# cd ..
[root@Vm90 wordcount01]# ls
classes  src
#编译需要引用三个jar包

    hadoop-common-2.6.0.jar
    hadoop-mapreduce-client-core-2.6.0.jar
    hadoop-test-1.2.1.jar
#根据本身hadoop版本自行选取jar包，一下为本实验用到的jar包
    hadoop-common-2.6.0-cdh5.5.0.jar
    hadoop-mapreduce-client-core-2.6.0-cdh5.5.0.jar
    hadoop-test-2.6.0-mr1-cdh5.5.0.jar

[root@Vm90 wordcount01]# javac -Xlint:deprecation -classpath /opt/cloudera/parcels/CDH/jars/hadoop-common-2.6.0-cdh5.5.0.jar:/opt/cloudera/parcels/CDH/jars/hadoop-mapreduce-client-core-2.6.0-cdh5.5.0.jar:/opt/cloudera/parcels/CDH/jars/hadoop-test-2.6.0-mr1-cdh5.5.0.jar -d classes/ src/*.java
-------输出为（不用在意警告）：
/opt/cloudera/parcels/CDH/jars/hadoop-common-2.6.0-cdh5.5.0.jar(org/apache/hadoop/fs/Path.class): 警告: 无法找到类型 'LimitedPrivate' 的注释方法 'value()': 找不到org.apache.hadoop.classification.InterfaceAudience的类文件
1 个警告

#打jar包
[root@Vm90 wordcount01]# jar -cvf wordcount.jar classes/* 
已添加清单
正在添加: classes/test/(输入 = 0) (输出 = 0)(存储了 0%)
正在添加: classes/test/WordCount.class(输入 = 1516) (输出 = 815)(压缩了 46%)
正在添加: classes/test/WordCount$TokenizerMapper.class(输入 = 1746) (输出 = 758)(压缩了 56%)
正在添加: classes/test/WordCount$IntSumReducer.class(输入 = 1749) (输出 = 742)(压缩了 57%)
[root@Vm90 wordcount01]# ls
classes  src  wordcount.jar
[root@Vm90 wordcount01]#

#上次测试数据
[root@Vm90 input]# cat 2.txt 
hello hadoop
bye hadoop
good java
great pytho 
[root@Vm90 wxl]# ls
input  output  wordcount01
[root@Vm90 wxl]# hadoop fs -put input /hbase/

此时运行程序会报错

[root@Vm90 wxl]# hadoop jar wordcount01/wordcount.jar test.WordCount /hbase/input/ /hbase/output11
17/06/22 14:29:48 INFO client.RMProxy: Connecting to ResourceManager at Vm90/172.16.2.90:8032
17/06/22 14:29:49 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
17/06/22 14:29:49 WARN mapreduce.JobResourceUploader: No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
17/06/22 14:29:49 INFO input.FileInputFormat: Total input paths to process : 1
17/06/22 14:29:49 INFO mapreduce.JobSubmitter: number of splits:1
17/06/22 14:29:50 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1497340925516_0008
17/06/22 14:29:50 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.
17/06/22 14:29:50 INFO impl.YarnClientImpl: Submitted application application_1497340925516_0008
17/06/22 14:29:50 INFO mapreduce.Job: The url to track the job: http://Vm90:8088/proxy/application_1497340925516_0008/
17/06/22 14:29:50 INFO mapreduce.Job: Running job: job_1497340925516_0008
17/06/22 14:29:58 INFO mapreduce.Job: Job job_1497340925516_0008 running in uber mode : false
17/06/22 14:29:58 INFO mapreduce.Job:  map 0% reduce 0%
17/06/22 14:30:03 INFO mapreduce.Job: Task Id : attempt_1497340925516_0008_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class test.WordCount$TokenizerMapper not found
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2199)
    at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:196)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:745)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassNotFoundException: Class test.WordCount$TokenizerMapper not found
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2105)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2197)
    ... 8 more

原因是：
这里写图片描述

路径问题：报的是 test.WordCount$TokenizerMapper not found
而打的jar包时 classes/test/
重新打包：

[root@Vm90 wordcount01]# cd classes/
[root@Vm90 classes]# ls
test
[root@Vm90 classes]# jar -cvf ../wordcount.jar test
已添加清单
正在添加: test/(输入 = 0) (输出 = 0)(存储了 0%)
正在添加: test/WordCount.class(输入 = 1516) (输出 = 815)(压缩了 46%)
正在添加: test/WordCount$TokenizerMapper.class(输入 = 1746) (输出 = 758)(压缩了 56%)
正在添加: test/WordCount$IntSumReducer.class(输入 = 1749) (输出 = 742)(压缩了 57%)
[root@Vm90 classes]# ls
test
[root@Vm90 classes]# cd ..
[root@Vm90 wordcount01]# ls
classes  src  wordcount.jar
[root@Vm90 wordcount01]# 
#再次运行
[root@Vm90 wxl]# hadoop jar wordcount01/wordcount.jar test.WordCount /hbase/input/ /hbase/output12
17/06/22 14:38:15 INFO client.RMProxy: Connecting to ResourceManager at Vm90/172.16.2.90:8032
17/06/22 14:38:16 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
17/06/22 14:38:16 INFO input.FileInputFormat: Total input paths to process : 1
17/06/22 14:38:16 INFO mapreduce.JobSubmitter: number of splits:1
17/06/22 14:38:16 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1497340925516_0009
17/06/22 14:38:17 INFO impl.YarnClientImpl: Submitted application application_1497340925516_0009
17/06/22 14:38:17 INFO mapreduce.Job: The url to track the job: http://Vm90:8088/proxy/application_1497340925516_0009/
17/06/22 14:38:17 INFO mapreduce.Job: Running job: job_1497340925516_0009
17/06/22 14:38:24 INFO mapreduce.Job: Job job_1497340925516_0009 running in uber mode : false
17/06/22 14:38:24 INFO mapreduce.Job:  map 0% reduce 0%
17/06/22 14:38:31 INFO mapreduce.Job:  map 100% reduce 0%
17/06/22 14:38:38 INFO mapreduce.Job:  map 100% reduce 50%
17/06/22 14:38:39 INFO mapreduce.Job:  map 100% reduce 100%
17/06/22 14:38:40 INFO mapreduce.Job: Job job_1497340925516_0009 completed successfully
17/06/22 14:38:40 INFO mapreduce.Job: Counters: 49
    File System Counters
        FILE: Number of bytes read=120
        FILE: Number of bytes written=344705
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=146
        HDFS: Number of bytes written=54
        HDFS: Number of read operations=9
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=4
    Job Counters 
        Launched map tasks=1
        Launched reduce tasks=2
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=4784
        Total time spent by all reduces in occupied slots (ms)=11119
        Total time spent by all map tasks (ms)=4784
        Total time spent by all reduce tasks (ms)=11119
        Total vcore-seconds taken by all map tasks=4784
        Total vcore-seconds taken by all reduce tasks=11119
        Total megabyte-seconds taken by all map tasks=4898816
        Total megabyte-seconds taken by all reduce tasks=11385856
    Map-Reduce Framework
        Map input records=4
        Map output records=8
        Map output bytes=79
        Map output materialized bytes=112
        Input split bytes=99
        Combine input records=8
        Combine output records=7
        Reduce input groups=7
        Reduce shuffle bytes=112
        Reduce input records=7
        Reduce output records=7
        Spilled Records=14
        Shuffled Maps =2
        Failed Shuffles=0
        Merged Map outputs=2
        GC time elapsed (ms)=134
        CPU time spent (ms)=3220
        Physical memory (bytes) snapshot=874729472
        Virtual memory (bytes) snapshot=4710256640
        Total committed heap usage (bytes)=860356608
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=47
    File Output Format Counters 
        Bytes Written=54
[root@Vm90 wxl]#