MapReduce原理及编程
什么是mapreduce
map 数据映射
reduece 计算 把映射好的数据整合到一起
- MapReduce是一个分布式计算框架
它将大型数据操作作业分解为可以跨服务器集群并行执行的单个任务。 - 适用于大规模数据处理场景
每个节点处理存储在该节点的数据 - 每个job包含Map和Reduce两部分
MapReduce的设计思想
- 分而治之
简化并行计算的编程模型 - 构建抽象模型:Map和Reduce
开发人员专注于实现Mapper和Reducer函数 - 隐藏系统层细节
开发人员专注于业务逻辑实现
MapReduce特点
- 优点
- 易于编程
- 可扩展性
- 高容错性
- 高吞吐量
- 不适用领域(缺点)
- 难以实时计算
- 不适合流式计算
MapReduce编程模型

MapReduce执行过程
- 数据定义格式
map: (K1,V1) → list (K2,V2)
reduce: (K2,list(V2)) → list (K3,V3) - MapReduce执行过程
Mapper
Combiner 进行合并
Partitioner 进行重新分区
Shuffle and Sort 排序
Reducer 进行计算

Hadoop V1 MR引擎

- Job Tracker
1.运行在Namenode
2.接受客户端Job请求
3.提交给Task Tracker - Task Tracker
1.从Job Tracker接受任务请求
2.执行map、reduce等操作
3.返回心跳给Job Tracker
Hadoop V2 YARN
Hadoop2 MR在Yarn上运行流程

InputSplit(输入分片)
输入分片存储的并非数据本身,而是一个分片长度和一个记录数据的位置的数组,每个输入分片对应一个Mapper任务,所以Mapper的数量是无法指定的。

Shuffle阶段
Shuffle阶段是指数据从Map输出到Reduce输入的过程

Key&Value类型
- 必须可序列化(serializable)
作用:网络传输以及持久化存储
IntWritable、LongWriteable、FloatWritable、Text、DoubleWritable, BooleanWritable、NullWritable等 - 都继承Writable接口
并实现write()和readFields()方法 - Keys必须实现WritableComparable接口
Reduce阶段需要sort
keys需要可比较
InputFormat接口

Combiner类
Combiner相当于本地化的Reduce操作
在shuffle之前进行本地聚合
用于性能优化,可选项
输入和输出类型一致
Reducer可以被用作Combiner的条件
符合交换律和结合律
实现Combiner
job.setCombinerClass(WCReducer.class)
Partitioner类
用于在Map端对key进行分区
默认使用的是HashPartitioner
获取key的哈希值
使用key的哈希值对Reduce任务数求模
决定每条记录应该送到哪个Reducer处理
自定义Partitioner
继承抽象类Partitioner,重写getPartition方法
job.setPartitionerClass(MyPartitioner.class)
OutputFormat接口

编写M/R Job(格式固定)
InputFormat
Job job = Job.getInstance(getConf(), "WordCount" );
job.setJarByClass( getClass() );
FileInputFormat.addInputPath(job, new Path(args[0]) );
job.setInputFormatClass( TextInputFormat.class );
OutputFormat
FileOutputFormat.setOutputPath( job, new Path(args[1]) );
job.setOutputFormatClass( TextOutputFormat.class );
Mapper
job.setMapperClass( WCMapper.class );
job.setMapOutputKeyClass( Text.class );
job.setMapOutputValueClass( IntWritable.class );
Reducer
job.setReducerClass( WCReducer.class );
job.setOutputKeyClass( Text.class );
job.setOutputValueClass( IntWritable.class );
使用MapReduce实现WordCount
原理图:

- 编写Java代码
Mapper
Reducer
Job - 执行M/R Job
- hadoop jar jar包名称 jar包中驱动(Driver)路径 hdfs中文件全路径 hdfs中输出结果路径.
- 设置M/R参数
1.编写mapper
public class WCMapper extends Mapper<LongWritable, Text,Text, IntWritable> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
System.out.println(key);
String line = value.toString();
String[] words= line.split(" ");
for (String word : words) {
context.write(new Text(word),new IntWritable(1));
}
}
}
2.编写reducer
public class WCReducer extends Reducer<Text, IntWritable,Text,IntWritable>{
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int total=0;
for (IntWritable value : values) {
total+=value.get();
}
context.write(key,new IntWritable(total));
}
}
编写Driver
public class WCDriver {
public static void main(String[] args) throws Exception{
//建立连接
Configuration cfg=new Configuration();
Job job=Job.getInstance(cfg,"job_wc");
job.setJarByClass(WCDriver.class);
//指定mapper和reducer
job.setMapperClass(WCMapper.class);
job.setReducerClass(WCReducer.class);
//指定mapper输出类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
//指定reducer输出类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//指定输入输出路径
FileInputFormat.setInputPaths(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
//运行
boolean result=job.waitForCompletion(true);
System.out.println(result?"成功":"失败");
System.out.println(result?0:1);
}
}
使用MapReduce实现join操作
- map端join
大文件+小文件 - reduce端join
编写类
public class CustomOrder implements Writable {
private String customId;
private String customName;
private String orderId;
private String orderStatus;
private String tableFlag; //为0时custom表,为1时是order表
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(customId);
out.writeUTF(customName);
out.writeUTF(orderId);
out.writeUTF(orderStatus);
out.writeUTF(tableFlag);
}
@Override
public void readFields(DataInput in) throws IOException {
this.customId=in.readUTF();
this.customName=in.readUTF();
this.orderId=in.readUTF();
this.orderStatus=in.readUTF();
this.tableFlag=in.readUTF();
}
复写tostring方法和getter,setter方法
mapper
public class COMapperJoin extends Mapper<LongWritable,Text,Text,CustomOrder> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line=value.toString();
String[] columns=line.split(",");
for (int i = 0; i < columns.length; i++) {
columns[i]=columns[i].split("\"")[1];
}
CustomOrder co=new CustomOrder();
if(columns.length==4){ //order表
co.setCustomId(columns[2]);
co.setCustomName("");
co.setOrderId(columns[0]);
co.setOrderStatus(columns[3]);
co.setTableFlag("1");
}else if(columns.length==9){ //custom表
co.setCustomId(columns[0]);
co.setCustomName(columns[1]+"·"+columns[2]);
co.setOrderId("");
co.setOrderStatus("");
co.setTableFlag("0");
}
context.write(new Text(co.getCustomId()),co);
//{1,{CustomOrder(1,xxx,,,0),CustomOrder(1,20,closed,1)}}
}
}
reducer
public class COReducerJoin extends Reducer<Text,CustomOrder,CustomOrder,NullWritable>{//Reducer<Text,CustomOrder,CustomOrder, NullWritable> {
@Override
protected void reduce(Text key, Iterable<CustomOrder> values, Context context) throws IOException, InterruptedException {
StringBuffer orderIds=new StringBuffer();
StringBuffer statuses=new StringBuffer();
CustomOrder customOrder=new CustomOrder();
for (CustomOrder co : values) {
if (co.getCustomName().equals("")){
orderIds.append(co.getOrderId()+"|");
statuses.append(co.getOrderStatus()+"|");
}else {
customOrder.setCustomId(co.getCustomId());
customOrder.setCustomName(co.getCustomName());
}
}
String orderId="";
String status="";
if (orderIds.length()>0){
orderId=orderIds.substring(0,orderIds.length()-1);
}
if (statuses.length()>0){
status=statuses.substring(0,statuses.length()-1);
}
customOrder.setOrderId(orderId);
customOrder.setOrderStatus(status);
context.write(customOrder,NullWritable.get());
}
}
Driver
public class CODriver {
public static void main(String[] args) throws Exception {
Configuration cfg=new Configuration();
Job job=Job.getInstance(cfg,"co_job");
job.setJarByClass(CODriver.class);
job.setMapperClass(COMapperJoin.class);
job.setReducerClass(COReducerJoin.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(CustomOrder.class);
job.setOutputKeyClass(CustomOrder.class);
job.setOutputValueClass(NullWritable.class);
FileInputFormat.setInputPaths(job,new Path("file:///F:/IdeaProjects/testhdfss/data"));
FileOutputFormat.setOutputPath(job,new Path("file:///G:/test/coResult"));
boolean result=job.waitForCompletion(true);
System.out.println(result?"成功":"失败");
System.exit(result?0:1);
}
}
MapReduce是一种分布式计算框架,适用于大规模数据处理。其设计思想是分而治之,通过Mapper和Reducer实现数据映射和整合。MapReduce具有易于编程、可扩展性、高容错性和高吞吐量的特点,但不适用于实时计算和流式计算。编程模型包括Map、Combine、Partition和Reduce阶段,其中Combiner提供性能优化。Hadoop V1使用Job Tracker,V2引入YARN进行资源管理。在MapReduce中,可以通过自定义InputFormat、Partitioner和OutputFormat来适应不同的数据处理需求。
1081

被折叠的 条评论
为什么被折叠?



