1、步骤
程序编写过程如下:
(1)编写自己的Mapper类
(2)编写自己的Reducer类
(3)客户端调用程序
2、示例
package org.dragon.hadoop.mr;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
/**
* MapReduce案例,wordcount程序
* @author Administrator
*
*/
public class MyWordCount {
//Mapper区域
/**
* WordCount程序map类
*KEYIN 输入key类型 ---开始位置
*VALUEIN 输入value类型
*KEYOUT 输出key类型
*VALUEOUT 输出value类型
*/
static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
//针对每行进行map操作
protected void map(
LongWritable key,
Text value,
Context context)
throws java.io.IOException, InterruptedException {
//获取每行数据的值
String lineContent = value.toString();
//进行分割,默认的分割字符:" \t\n\r\f"
StringTokenizer stringTokenizer = new StringTokenizer(lineContent);
//遍历
while(stringTokenizer.hasMoreTokens()){
//获取每个值
String wordValue = stringTokenizer.nextToken();
//设置map输出的key值
word.set(wordValue);
//上下文删除map的key和value
context.write(word, one);
}
}
}
//Reducer 区域
/**
* WordCount程序reduce类
* map的输出类型即为reduce的输入类型
*
*/
static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
private IntWritable result = new IntWritable();
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws java.io.IOException ,InterruptedException {
//用于累加的中间变量,以便统计key出现的总次数
int sum = 0;
//遍历Iterable
for(IntWritable value :values){
//进行累加
sum += value.get();
}
//设置key对应出现的总次数
result.set(sum);
context.write(key, result);
}
}
//client 区域
public static void main(String[] args) throws Exception {
//获取HDFS配置信息
Configuration conf = new Configuration();
//创建Job,设置配置和Job名称
Job job = new Job(conf,"myjob");
//设置Job运行的类
job.setJarByClass(MyWordCount.class);
//设置Mapper和Reducer类
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
//设置输入文件的目录和输出文件的目录,运行的时候传入
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//设置输出结果key和value的类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//提交Job,等待运行结果,并在客户端显示信息
boolean isSuccess = job.waitForCompletion(true);
//结束程序
System.exit(isSuccess?0:1);
}
}
3、运行
上述程序编写好后,进行如下步骤:
(1)打包成jar包
(2)上传hadoop集群环境中
(3)对jar包赋予可执行权限,比如chmod -R 755 mywc.jar
(4)执行:hadoop jar mywc.jar /opt/data/test/input/ /opt/data/test/output/
(5)查看结果
4、运行日志
下面是运行日志
16/03/21 01:14:39 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
16/03/21 01:14:40 INFO input.FileInputFormat: Total input paths to process : 2
16/03/21 01:14:40 INFO util.NativeCodeLoader: Loaded the native-hadoop library
16/03/21 01:14:40 WARN snappy.LoadSnappy: Snappy native library not loaded
16/03/21 01:14:41 INFO mapred.JobClient: Running job: job_201603210057_0001
16/03/21 01:14:42 INFO mapred.JobClient: map 0% reduce 0%
16/03/21 01:18:54 INFO mapred.JobClient: map 100% reduce 0%
16/03/21 01:19:03 INFO mapred.JobClient: map 100% reduce 16%
16/03/21 01:19:05 INFO mapred.JobClient: map 100% reduce 100%
16/03/21 01:19:06 INFO mapred.JobClient: Job complete: job_201603210057_0001
16/03/21 01:19:06 INFO mapred.JobClient: Counters: 29
16/03/21 01:19:06 INFO mapred.JobClient: Job Counters
16/03/21 01:19:06 INFO mapred.JobClient: Launched reduce tasks=1
16/03/21 01:19:06 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=496875
16/03/21 01:19:06 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
16/03/21 01:19:06 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
16/03/21 01:19:06 INFO mapred.JobClient: Launched map tasks=2
16/03/21 01:19:06 INFO mapred.JobClient: Data-local map tasks=2
16/03/21 01:19:06 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=10321
16/03/21 01:19:06 INFO mapred.JobClient: File Output Format Counters
16/03/21 01:19:06 INFO mapred.JobClient: Bytes Written=163
16/03/21 01:19:06 INFO mapred.JobClient: FileSystemCounters
16/03/21 01:19:06 INFO mapred.JobClient: FILE_BYTES_READ=535
16/03/21 01:19:06 INFO mapred.JobClient: HDFS_BYTES_READ=481
16/03/21 01:19:06 INFO mapred.JobClient: FILE_BYTES_WRITTEN=161377
16/03/21 01:19:06 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=163
16/03/21 01:19:06 INFO mapred.JobClient: File Input Format Counters
16/03/21 01:19:06 INFO mapred.JobClient: Bytes Read=231
16/03/21 01:19:06 INFO mapred.JobClient: Map-Reduce Framework
16/03/21 01:19:06 INFO mapred.JobClient: Map output materialized bytes=541
16/03/21 01:19:06 INFO mapred.JobClient: Map input records=13
16/03/21 01:19:06 INFO mapred.JobClient: Reduce shuffle bytes=541
16/03/21 01:19:06 INFO mapred.JobClient: Spilled Records=100
16/03/21 01:19:06 INFO mapred.JobClient: Map output bytes=429
16/03/21 01:19:06 INFO mapred.JobClient: CPU time spent (ms)=328300
16/03/21 01:19:06 INFO mapred.JobClient: Total committed heap usage (bytes)=291512320
16/03/21 01:19:06 INFO mapred.JobClient: Combine input records=0
16/03/21 01:19:06 INFO mapred.JobClient: SPLIT_RAW_BYTES=250
16/03/21 01:19:06 INFO mapred.JobClient: Reduce input records=50
16/03/21 01:19:06 INFO mapred.JobClient: Reduce input groups=25
16/03/21 01:19:06 INFO mapred.JobClient: Combine output records=0
16/03/21 01:19:06 INFO mapred.JobClient: Physical memory (bytes) snapshot=429342720
16/03/21 01:19:06 INFO mapred.JobClient: Reduce output records=25
16/03/21 01:19:06 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2143195136
16/03/21 01:19:06 INFO mapred.JobClient: Map output records=50
日志的第一行是个警告:Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
其中类GenericOptionsParser用来解释常用的Hadoop命令行选项,并根据需要,为Configuration对象设置相应的取值。通常不直接使用GenericOptionsParser,而是实现Tool接口,通过ToolRunner来运行应用程序,ToolRunner内部调用GenericOptionsParser:
下面对上述的程序进行优化,客户端修改如下:
//client 区域
public static void main(String[] args) throws Exception {
//获取HDFS配置信息
Configuration conf = new Configuration();
/*********************************************优化start*******************************/
//进行优化
String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
if(otherArgs.length!=2){
System.err.print("Usage:wordcount <in> <out>");
System.exit(2);
}
/*********************************************优化end*******************************/
//创建Job,设置配置和Job名称
Job job = new Job(conf,"myjob");
//设置Job运行的类
job.setJarByClass(MyWordCount.class);
//设置Mapper和Reducer类
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
//设置输入文件的目录和输出文件的目录,运行的时候传入
// FileInputFormat.addInputPath(job, new Path(args[0]));
// FileOutputFormat.setOutputPath(job, new Path(args[1]));
//优化调用
/*********************************************优化调用start*******************************/
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
/*********************************************优化调用end*******************************/
//设置输出结果key和value的类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//提交Job,等待运行结果,并在客户端显示信息
boolean isSuccess = job.waitForCompletion(true);
//结束程序
System.exit(isSuccess?0:1);
}
重新打包等步骤如上。