Hadoop 1.x 编写自己的WordCount程序

最新推荐文章于 2022-07-24 14:52:48 发布

残缺的孤独

最新推荐文章于 2022-07-24 14:52:48 发布

阅读量908

点赞数

CC 4.0 BY-SA版权

分类专栏： Hadoop 文章标签： Hadoop wordcount

本文链接：https://blog.youkuaiyun.com/yyywyr/article/details/50938836

Hadoop 专栏收录该内容

17 篇文章

订阅专栏

本文详细介绍了如何优化MapReduce WordCount程序，并提供了优化后的客户端代码及运行步骤。同时，解析了运行日志，帮助理解程序执行过程。优化包括通过`GenericOptionsParser`简化命令行参数处理，确保程序更易使用和维护。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1、步骤

程序编写过程如下：

（1）编写自己的Mapper类

（2）编写自己的Reducer类

（3）客户端调用程序

2、示例

package org.dragon.hadoop.mr;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**
 * MapReduce案例，wordcount程序
 * @author Administrator
 *
 */
public class MyWordCount {

	//Mapper区域
	/**
	 * WordCount程序map类
	 *KEYIN 输入key类型 ---开始位置
	 *VALUEIN 输入value类型
	 *KEYOUT 输出key类型
	 *VALUEOUT 输出value类型 
	 */
	static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
		private Text word = new Text();
		private final static IntWritable one = new IntWritable(1);
		//针对每行进行map操作
		protected void map(
				LongWritable key,
				Text value,
				Context context)
				throws java.io.IOException, InterruptedException {
			//获取每行数据的值
			String lineContent = value.toString();
			//进行分割,默认的分割字符：" \t\n\r\f"
			StringTokenizer stringTokenizer = new StringTokenizer(lineContent);
			//遍历
			while(stringTokenizer.hasMoreTokens()){
				//获取每个值
				String wordValue = stringTokenizer.nextToken();
				//设置map输出的key值
				word.set(wordValue);
				//上下文删除map的key和value
				context.write(word, one);
			}
		}
	}
	
	//Reducer 区域
	/**
	 * WordCount程序reduce类
	 * map的输出类型即为reduce的输入类型
	 *
	 */
	static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
		private IntWritable result = new IntWritable();
		protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws java.io.IOException ,InterruptedException {
			//用于累加的中间变量，以便统计key出现的总次数
			int sum = 0;
			//遍历Iterable
			for(IntWritable value :values){
				//进行累加
				sum += value.get();
			}
			//设置key对应出现的总次数
			result.set(sum);
			context.write(key, result);
		}
	}
	
	//client 区域
	public static void main(String[] args) throws Exception {
		//获取HDFS配置信息
		Configuration conf = new Configuration();
		
		//创建Job，设置配置和Job名称
		Job job = new Job(conf,"myjob");
		
		//设置Job运行的类
		job.setJarByClass(MyWordCount.class);
		
		//设置Mapper和Reducer类
		job.setMapperClass(MyMapper.class);
		job.setReducerClass(MyReducer.class);
		
		//设置输入文件的目录和输出文件的目录,运行的时候传入
		FileInputFormat.addInputPath(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		
		//设置输出结果key和value的类型
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
		//提交Job，等待运行结果，并在客户端显示信息
		boolean isSuccess = job.waitForCompletion(true);
		
		//结束程序
		System.exit(isSuccess?0:1);
	}
}

3、运行

上述程序编写好后，进行如下步骤：

（1）打包成jar包

（2）上传hadoop集群环境中

（3）对jar包赋予可执行权限，比如chmod -R 755 mywc.jar

（4）执行：hadoop jar mywc.jar /opt/data/test/input/ /opt/data/test/output/

（5）查看结果

4、运行日志

下面是运行日志

16/03/21 01:14:39 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
16/03/21 01:14:40 INFO input.FileInputFormat: Total input paths to process : 2
16/03/21 01:14:40 INFO util.NativeCodeLoader: Loaded the native-hadoop library
16/03/21 01:14:40 WARN snappy.LoadSnappy: Snappy native library not loaded
16/03/21 01:14:41 INFO mapred.JobClient: Running job: job_201603210057_0001
16/03/21 01:14:42 INFO mapred.JobClient:  map 0% reduce 0%
16/03/21 01:18:54 INFO mapred.JobClient:  map 100% reduce 0%
16/03/21 01:19:03 INFO mapred.JobClient:  map 100% reduce 16%
16/03/21 01:19:05 INFO mapred.JobClient:  map 100% reduce 100%
16/03/21 01:19:06 INFO mapred.JobClient: Job complete: job_201603210057_0001
16/03/21 01:19:06 INFO mapred.JobClient: Counters: 29
16/03/21 01:19:06 INFO mapred.JobClient:   Job Counters 
16/03/21 01:19:06 INFO mapred.JobClient:     Launched reduce tasks=1
16/03/21 01:19:06 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=496875
16/03/21 01:19:06 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
16/03/21 01:19:06 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
16/03/21 01:19:06 INFO mapred.JobClient:     Launched map tasks=2
16/03/21 01:19:06 INFO mapred.JobClient:     Data-local map tasks=2
16/03/21 01:19:06 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10321
16/03/21 01:19:06 INFO mapred.JobClient:   File Output Format Counters 
16/03/21 01:19:06 INFO mapred.JobClient:     Bytes Written=163
16/03/21 01:19:06 INFO mapred.JobClient:   FileSystemCounters
16/03/21 01:19:06 INFO mapred.JobClient:     FILE_BYTES_READ=535
16/03/21 01:19:06 INFO mapred.JobClient:     HDFS_BYTES_READ=481
16/03/21 01:19:06 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=161377
16/03/21 01:19:06 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=163
16/03/21 01:19:06 INFO mapred.JobClient:   File Input Format Counters 
16/03/21 01:19:06 INFO mapred.JobClient:     Bytes Read=231
16/03/21 01:19:06 INFO mapred.JobClient:   Map-Reduce Framework
16/03/21 01:19:06 INFO mapred.JobClient:     Map output materialized bytes=541
16/03/21 01:19:06 INFO mapred.JobClient:     Map input records=13
16/03/21 01:19:06 INFO mapred.JobClient:     Reduce shuffle bytes=541
16/03/21 01:19:06 INFO mapred.JobClient:     Spilled Records=100
16/03/21 01:19:06 INFO mapred.JobClient:     Map output bytes=429
16/03/21 01:19:06 INFO mapred.JobClient:     CPU time spent (ms)=328300
16/03/21 01:19:06 INFO mapred.JobClient:     Total committed heap usage (bytes)=291512320
16/03/21 01:19:06 INFO mapred.JobClient:     Combine input records=0
16/03/21 01:19:06 INFO mapred.JobClient:     SPLIT_RAW_BYTES=250
16/03/21 01:19:06 INFO mapred.JobClient:     Reduce input records=50
16/03/21 01:19:06 INFO mapred.JobClient:     Reduce input groups=25
16/03/21 01:19:06 INFO mapred.JobClient:     Combine output records=0
16/03/21 01:19:06 INFO mapred.JobClient:     Physical memory (bytes) snapshot=429342720
16/03/21 01:19:06 INFO mapred.JobClient:     Reduce output records=25
16/03/21 01:19:06 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2143195136
16/03/21 01:19:06 INFO mapred.JobClient:     Map output records=50

日志的第一行是个警告：Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

其中类GenericOptionsParser用来解释常用的Hadoop命令行选项，并根据需要，为Configuration对象设置相应的取值。通常不直接使用GenericOptionsParser，而是实现Tool接口，通过ToolRunner来运行应用程序，ToolRunner内部调用GenericOptionsParser：

下面对上述的程序进行优化，客户端修改如下：

//client 区域
	public static void main(String[] args) throws Exception {
		//获取HDFS配置信息
		Configuration conf = new Configuration();
		
		/*********************************************优化start*******************************/
		//进行优化
		String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
		if(otherArgs.length!=2){
			System.err.print("Usage:wordcount <in> <out>");
			System.exit(2);
		}
		/*********************************************优化end*******************************/
		//创建Job，设置配置和Job名称
		Job job = new Job(conf,"myjob");
		
		//设置Job运行的类
		job.setJarByClass(MyWordCount.class);
		
		//设置Mapper和Reducer类
		job.setMapperClass(MyMapper.class);
		job.setReducerClass(MyReducer.class);
		
		//设置输入文件的目录和输出文件的目录,运行的时候传入
	//	FileInputFormat.addInputPath(job, new Path(args[0]));
	//	FileOutputFormat.setOutputPath(job, new Path(args[1]));
		
		//优化调用
		/*********************************************优化调用start*******************************/
		FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
		FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
		/*********************************************优化调用end*******************************/
		
		//设置输出结果key和value的类型
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
		//提交Job，等待运行结果，并在客户端显示信息
		boolean isSuccess = job.waitForCompletion(true);
		
		//结束程序
		System.exit(isSuccess?0:1);
	}

重新打包等步骤如上。