1.MapReduce编程的编程思想(比如该文件就只输入三行,如下面所示)
hive spark hive hbase
hadoop hive spark
sqoop flume scala
(0,“hive spark hive hbase”)
(22,“hadoop hive spark”)
(40,“sqoop flume scala”)
输出:(hive,1),(spark,1),(hive,1)
在shuffle阶段,相同key的值进行合并,合并放到一个 集合中
(hive,<1,1>),(spark,<1>)
reduce:聚合
输入:(hive,<1,1>),(spark,<1>)
输出:(hive,2)
2.mapreduce处理数据过程
整个MapReduce程序,所有数据以(key,value)形式流动的
input:正常情况下,不需要编写代码,仅仅在MapReduce程序运行的时候指定一个路径即可
map(核心关注):map(key,value,output,context)
key:每行数据的偏移量 --毛用 value:每行数据的内容 --真正需要处理的内容
shuffle:阶段
key:业务需求中的key值
value:要聚合的值
reduce(核心关注):reduce(K2 key, Iterator<V2> values,OutputCollector<K3, V3>,output, Reporter reporter)
3.通过idea实现MapReduce的Wordcount的代码实现,
public class WordCountMapReduce {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
//检查参数
if (args.length <2){
System.out.println("Usage: wordcount <in> [<in>...] <out>");
return;
}
//1、读取配置文件
Configuration configuration = new Configuration();
//2、创建job
Job job = Job.getInstance( configuration, "UserWordCountMapReduce" );
job.setJarByClass( WordCountMapReduce.class );
//a、input
Path inputPath = new Path( args[0] );
FileInputFormat.setInputPaths( job,inputPath );
//b、map
job.setMapperClass( WordCountMapper.class );
job.setMapOutputKeyClass( Text.class ); //字符串String
job.setMapOutputValueClass( IntWritable.class );//int
//b.b、shuffle
//c、reduce
job.setReducerClass( WordCountReduce.class );
job.setOutputKeyClass( Text.class );
job.setOutputValueClass( IntWritable.class );
//d、output
Path outputPath = new Path( args[1]);
FileOutputFormat.setOutputPath( job,outputPath );
//判断是否成功
boolean isSuccess = job.waitForCompletion( true );//进度条
System.exit( isSuccess?0:1 );
}
/**
* class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
* KEYIN:行偏移量 long
* VALUEIN:行内容 String
* KEYOUT:
* map方法输出key的类型
* 词频统计的单词 String
* VALUEOUT:
* map方法输出value的类型
* 出现的次数,就是1,int
*
*/
private static class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private Text mapOutputKey = new Text( );
private static final IntWritable mapOutputValue = new IntWritable( 1 );
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//(0,“hive spark hive hbase”)
String valueStr = value.toString();
String[] items = valueStr.split( " " );
for (String item:items) {
mapOutputKey.set( item );
//使用Context将map方法处理的结果返回输出
//public void write(KEYOUT key, VALUEOUT value)
context.write( mapOutputKey ,mapOutputValue);
}
}
}
/**
* public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
*/
private static class WordCountReduce extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable outputValue = new IntWritable( );
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable valus:values) {
sum+=valus.get();
}
outputValue.set( sum );
context.write( key,outputValue );
}
}
}
以上代码是通过MapReduce程序实现的词频统计,统计单词的个数,这是我自己的一个学习笔记,希望能帮助到正在学习该方面知识的人,我也是一个刚入门的小白,写的不好,希望各位大佬提出相关建议,以后改进