Hadoop 1.x 编写自己的WordCount程序

本文详细介绍了如何优化MapReduce WordCount程序,并提供了优化后的客户端代码及运行步骤。同时,解析了运行日志,帮助理解程序执行过程。优化包括通过`GenericOptionsParser`简化命令行参数处理,确保程序更易使用和维护。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1、步骤

   程序编写过程如下:

   (1)编写自己的Mapper类

   (2)编写自己的Reducer类

   (3)客户端调用程序

2、示例

package org.dragon.hadoop.mr;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**
 * MapReduce案例,wordcount程序
 * @author Administrator
 *
 */
public class MyWordCount {

	//Mapper区域
	/**
	 * WordCount程序map类
	 *KEYIN 输入key类型 ---开始位置
	 *VALUEIN 输入value类型
	 *KEYOUT 输出key类型
	 *VALUEOUT 输出value类型 
	 */
	static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
		private Text word = new Text();
		private final static IntWritable one = new IntWritable(1);
		//针对每行进行map操作
		protected void map(
				LongWritable key,
				Text value,
				Context context)
				throws java.io.IOException, InterruptedException {
			//获取每行数据的值
			String lineContent = value.toString();
			//进行分割,默认的分割字符:" \t\n\r\f"
			StringTokenizer stringTokenizer = new StringTokenizer(lineContent);
			//遍历
			while(stringTokenizer.hasMoreTokens()){
				//获取每个值
				String wordValue = stringTokenizer.nextToken();
				//设置map输出的key值
				word.set(wordValue);
				//上下文删除map的key和value
				context.write(word, one);
			}
		}
	}
	
	//Reducer 区域
	/**
	 * WordCount程序reduce类
	 * map的输出类型即为reduce的输入类型
	 *
	 */
	static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
		private IntWritable result = new IntWritable();
		protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws java.io.IOException ,InterruptedException {
			//用于累加的中间变量,以便统计key出现的总次数
			int sum = 0;
			//遍历Iterable
			for(IntWritable value :values){
				//进行累加
				sum += value.get();
			}
			//设置key对应出现的总次数
			result.set(sum);
			context.write(key, result);
		}
	}
	
	//client 区域
	public static void main(String[] args) throws Exception {
		//获取HDFS配置信息
		Configuration conf = new Configuration();
		
		//创建Job,设置配置和Job名称
		Job job = new Job(conf,"myjob");
		
		//设置Job运行的类
		job.setJarByClass(MyWordCount.class);
		
		//设置Mapper和Reducer类
		job.setMapperClass(MyMapper.class);
		job.setReducerClass(MyReducer.class);
		
		//设置输入文件的目录和输出文件的目录,运行的时候传入
		FileInputFormat.addInputPath(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		
		//设置输出结果key和value的类型
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
		//提交Job,等待运行结果,并在客户端显示信息
		boolean isSuccess = job.waitForCompletion(true);
		
		//结束程序
		System.exit(isSuccess?0:1);
	}
}

3、运行

    上述程序编写好后,进行如下步骤:

   (1)打包成jar包

   (2)上传hadoop集群环境中

   (3)对jar包赋予可执行权限,比如chmod -R 755 mywc.jar

   (4)执行:hadoop jar mywc.jar /opt/data/test/input/ /opt/data/test/output/  

   (5)查看结果

4、运行日志

    下面是运行日志

16/03/21 01:14:39 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
16/03/21 01:14:40 INFO input.FileInputFormat: Total input paths to process : 2
16/03/21 01:14:40 INFO util.NativeCodeLoader: Loaded the native-hadoop library
16/03/21 01:14:40 WARN snappy.LoadSnappy: Snappy native library not loaded
16/03/21 01:14:41 INFO mapred.JobClient: Running job: job_201603210057_0001
16/03/21 01:14:42 INFO mapred.JobClient:  map 0% reduce 0%
16/03/21 01:18:54 INFO mapred.JobClient:  map 100% reduce 0%
16/03/21 01:19:03 INFO mapred.JobClient:  map 100% reduce 16%
16/03/21 01:19:05 INFO mapred.JobClient:  map 100% reduce 100%
16/03/21 01:19:06 INFO mapred.JobClient: Job complete: job_201603210057_0001
16/03/21 01:19:06 INFO mapred.JobClient: Counters: 29
16/03/21 01:19:06 INFO mapred.JobClient:   Job Counters 
16/03/21 01:19:06 INFO mapred.JobClient:     Launched reduce tasks=1
16/03/21 01:19:06 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=496875
16/03/21 01:19:06 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
16/03/21 01:19:06 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
16/03/21 01:19:06 INFO mapred.JobClient:     Launched map tasks=2
16/03/21 01:19:06 INFO mapred.JobClient:     Data-local map tasks=2
16/03/21 01:19:06 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10321
16/03/21 01:19:06 INFO mapred.JobClient:   File Output Format Counters 
16/03/21 01:19:06 INFO mapred.JobClient:     Bytes Written=163
16/03/21 01:19:06 INFO mapred.JobClient:   FileSystemCounters
16/03/21 01:19:06 INFO mapred.JobClient:     FILE_BYTES_READ=535
16/03/21 01:19:06 INFO mapred.JobClient:     HDFS_BYTES_READ=481
16/03/21 01:19:06 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=161377
16/03/21 01:19:06 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=163
16/03/21 01:19:06 INFO mapred.JobClient:   File Input Format Counters 
16/03/21 01:19:06 INFO mapred.JobClient:     Bytes Read=231
16/03/21 01:19:06 INFO mapred.JobClient:   Map-Reduce Framework
16/03/21 01:19:06 INFO mapred.JobClient:     Map output materialized bytes=541
16/03/21 01:19:06 INFO mapred.JobClient:     Map input records=13
16/03/21 01:19:06 INFO mapred.JobClient:     Reduce shuffle bytes=541
16/03/21 01:19:06 INFO mapred.JobClient:     Spilled Records=100
16/03/21 01:19:06 INFO mapred.JobClient:     Map output bytes=429
16/03/21 01:19:06 INFO mapred.JobClient:     CPU time spent (ms)=328300
16/03/21 01:19:06 INFO mapred.JobClient:     Total committed heap usage (bytes)=291512320
16/03/21 01:19:06 INFO mapred.JobClient:     Combine input records=0
16/03/21 01:19:06 INFO mapred.JobClient:     SPLIT_RAW_BYTES=250
16/03/21 01:19:06 INFO mapred.JobClient:     Reduce input records=50
16/03/21 01:19:06 INFO mapred.JobClient:     Reduce input groups=25
16/03/21 01:19:06 INFO mapred.JobClient:     Combine output records=0
16/03/21 01:19:06 INFO mapred.JobClient:     Physical memory (bytes) snapshot=429342720
16/03/21 01:19:06 INFO mapred.JobClient:     Reduce output records=25
16/03/21 01:19:06 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2143195136
16/03/21 01:19:06 INFO mapred.JobClient:     Map output records=50
日志的第一行是个警告:Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

其中类GenericOptionsParser用来解释常用的Hadoop命令行选项,并根据需要,为Configuration对象设置相应的取值。通常不直接使用GenericOptionsParser,而是实现Tool接口,通过ToolRunner来运行应用程序,ToolRunner内部调用GenericOptionsParser:

下面对上述的程序进行优化,客户端修改如下:

//client 区域
	public static void main(String[] args) throws Exception {
		//获取HDFS配置信息
		Configuration conf = new Configuration();
		
		/*********************************************优化start*******************************/
		//进行优化
		String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
		if(otherArgs.length!=2){
			System.err.print("Usage:wordcount <in> <out>");
			System.exit(2);
		}
		/*********************************************优化end*******************************/
		//创建Job,设置配置和Job名称
		Job job = new Job(conf,"myjob");
		
		//设置Job运行的类
		job.setJarByClass(MyWordCount.class);
		
		//设置Mapper和Reducer类
		job.setMapperClass(MyMapper.class);
		job.setReducerClass(MyReducer.class);
		
		//设置输入文件的目录和输出文件的目录,运行的时候传入
	//	FileInputFormat.addInputPath(job, new Path(args[0]));
	//	FileOutputFormat.setOutputPath(job, new Path(args[1]));
		
		//优化调用
		/*********************************************优化调用start*******************************/
		FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
		FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
		/*********************************************优化调用end*******************************/
		
		//设置输出结果key和value的类型
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
		//提交Job,等待运行结果,并在客户端显示信息
		boolean isSuccess = job.waitForCompletion(true);
		
		//结束程序
		System.exit(isSuccess?0:1);
	}

 重新打包等步骤如上。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值