MapReduce原理及编程

最新推荐文章于 2022-10-08 21:50:37 发布

原创最新推荐文章于 2022-10-08 21:50:37 发布 · 314 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#大数据 #hadoop #mapreduce

hadoop 专栏收录该内容

5 篇文章

订阅专栏

MapReduce是一种分布式计算框架，适用于大规模数据处理。其设计思想是分而治之，通过Mapper和Reducer实现数据映射和整合。MapReduce具有易于编程、可扩展性、高容错性和高吞吐量的特点，但不适用于实时计算和流式计算。编程模型包括Map、Combine、Partition和Reduce阶段，其中Combiner提供性能优化。Hadoop V1使用Job Tracker，V2引入YARN进行资源管理。在MapReduce中，可以通过自定义InputFormat、Partitioner和OutputFormat来适应不同的数据处理需求。

MapReduce原理及编程

什么是mapreduce

map 数据映射
reduece 计算把映射好的数据整合到一起

MapReduce是一个分布式计算框架
它将大型数据操作作业分解为可以跨服务器集群并行执行的单个任务。
适用于大规模数据处理场景
每个节点处理存储在该节点的数据
每个job包含Map和Reduce两部分

MapReduce的设计思想

分而治之
简化并行计算的编程模型
构建抽象模型：Map和Reduce
开发人员专注于实现Mapper和Reducer函数
隐藏系统层细节
开发人员专注于业务逻辑实现

MapReduce特点

优点

易于编程
可扩展性
高容错性
高吞吐量

不适用领域（缺点）

难以实时计算
不适合流式计算

MapReduce编程模型

在这里插入图片描述

MapReduce执行过程

数据定义格式
map: (K1,V1) → list (K2,V2)
reduce: (K2,list(V2)) → list (K3,V3)
MapReduce执行过程
Mapper
Combiner 进行合并
Partitioner 进行重新分区
Shuffle and Sort 排序
Reducer 进行计算

Hadoop V1 MR引擎

在这里插入图片描述

Job Tracker
1.运行在Namenode
2.接受客户端Job请求
3.提交给Task Tracker
Task Tracker
1.从Job Tracker接受任务请求
2.执行map、reduce等操作
3.返回心跳给Job Tracker

Hadoop V2 YARN

在这里插入图片描述

Hadoop2 MR在Yarn上运行流程

InputSplit（输入分片）

输入分片存储的并非数据本身，而是一个分片长度和一个记录数据的位置的数组，每个输入分片对应一个Mapper任务，所以Mapper的数量是无法指定的。
在这里插入图片描述

Shuffle阶段

Shuffle阶段是指数据从Map输出到Reduce输入的过程
在这里插入图片描述

Key&Value类型

必须可序列化（serializable）
作用：网络传输以及持久化存储
IntWritable、LongWriteable、FloatWritable、Text、DoubleWritable, BooleanWritable、NullWritable等
都继承Writable接口
并实现write()和readFields()方法
Keys必须实现WritableComparable接口
Reduce阶段需要sort
keys需要可比较

InputFormat接口

在这里插入图片描述

Combiner类

Combiner相当于本地化的Reduce操作
在shuffle之前进行本地聚合
用于性能优化，可选项
输入和输出类型一致
Reducer可以被用作Combiner的条件
符合交换律和结合律
实现Combiner
job.setCombinerClass(WCReducer.class)

Partitioner类

用于在Map端对key进行分区
默认使用的是HashPartitioner
获取key的哈希值
使用key的哈希值对Reduce任务数求模
决定每条记录应该送到哪个Reducer处理
自定义Partitioner
继承抽象类Partitioner，重写getPartition方法
job.setPartitionerClass(MyPartitioner.class)

OutputFormat接口

在这里插入图片描述

编写M/R Job（格式固定）

InputFormat

Job job = Job.getInstance(getConf(), "WordCount" );
job.setJarByClass( getClass() );	
FileInputFormat.addInputPath(job,  new Path(args[0]) );
job.setInputFormatClass( TextInputFormat.class );

OutputFormat

FileOutputFormat.setOutputPath( job,  new Path(args[1]) );
job.setOutputFormatClass( TextOutputFormat.class );

Mapper

job.setMapperClass( WCMapper.class );
job.setMapOutputKeyClass( Text.class );
job.setMapOutputValueClass( IntWritable.class );

Reducer

job.setReducerClass( WCReducer.class );
job.setOutputKeyClass( Text.class );
job.setOutputValueClass( IntWritable.class );

使用MapReduce实现WordCount

原理图：
在这里插入图片描述

编写Java代码
Mapper
Reducer
Job
执行M/R Job

hadoop jar jar包名称 jar包中驱动(Driver)路径 hdfs中文件全路径 hdfs中输出结果路径.

设置M/R参数
1.编写mapper

 public class WCMapper extends Mapper<LongWritable, Text,Text, IntWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        System.out.println(key);
        String line = value.toString();
        String[] words= line.split(" ");
        for (String word : words) {
            context.write(new Text(word),new IntWritable(1));
        }
    }
}

2.编写reducer

public class WCReducer extends Reducer<Text, IntWritable,Text,IntWritable>{
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int total=0;
        for (IntWritable value : values) {
            total+=value.get();
        }
        context.write(key,new IntWritable(total));
    }
}

编写Driver

public class WCDriver {
    public static void main(String[] args) throws Exception{
        //建立连接
        Configuration cfg=new Configuration();
        Job job=Job.getInstance(cfg,"job_wc");
        job.setJarByClass(WCDriver.class);
        //指定mapper和reducer
        job.setMapperClass(WCMapper.class);
        job.setReducerClass(WCReducer.class);
        //指定mapper输出类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        //指定reducer输出类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        //指定输入输出路径
        FileInputFormat.setInputPaths(job,new Path(args[0]));
        FileOutputFormat.setOutputPath(job,new Path(args[1]));
        //运行
        boolean result=job.waitForCompletion(true);
        System.out.println(result?"成功":"失败");
        System.out.println(result?0:1);
    }
}

使用MapReduce实现join操作

map端join
大文件+小文件
reduce端join

编写类

public class CustomOrder implements Writable {
    private String customId;
    private String customName;
    private String orderId;
    private String orderStatus;
    private String tableFlag; //为0时custom表,为1时是order表
    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(customId);
        out.writeUTF(customName);
        out.writeUTF(orderId);
        out.writeUTF(orderStatus);
        out.writeUTF(tableFlag);
    }
    @Override
    public void readFields(DataInput in) throws IOException {
        this.customId=in.readUTF();
        this.customName=in.readUTF();
        this.orderId=in.readUTF();
        this.orderStatus=in.readUTF();
        this.tableFlag=in.readUTF();
    }
    复写tostring方法和getter,setter方法

mapper

public class COMapperJoin extends Mapper<LongWritable,Text,Text,CustomOrder> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line=value.toString();
        String[] columns=line.split(",");
        for (int i = 0; i < columns.length; i++) {
            columns[i]=columns[i].split("\"")[1];
        }
        CustomOrder co=new CustomOrder();
        if(columns.length==4){ //order表
            co.setCustomId(columns[2]);
            co.setCustomName("");
            co.setOrderId(columns[0]);
            co.setOrderStatus(columns[3]);
            co.setTableFlag("1");
        }else if(columns.length==9){ //custom表
            co.setCustomId(columns[0]);
            co.setCustomName(columns[1]+"·"+columns[2]);
            co.setOrderId("");
            co.setOrderStatus("");
            co.setTableFlag("0");
        }
        context.write(new Text(co.getCustomId()),co);
        //{1,{CustomOrder(1,xxx,,,0),CustomOrder(1,20,closed,1)}}
    }
}

reducer

public class COReducerJoin extends Reducer<Text,CustomOrder,CustomOrder,NullWritable>{//Reducer<Text,CustomOrder,CustomOrder, NullWritable> {
    @Override
    protected void reduce(Text key, Iterable<CustomOrder> values, Context context) throws IOException, InterruptedException {
        StringBuffer orderIds=new StringBuffer();
        StringBuffer statuses=new StringBuffer();
        CustomOrder customOrder=new CustomOrder();
        for (CustomOrder co : values) {
            if (co.getCustomName().equals("")){
                orderIds.append(co.getOrderId()+"|");
                statuses.append(co.getOrderStatus()+"|");
            }else {
                customOrder.setCustomId(co.getCustomId());
                customOrder.setCustomName(co.getCustomName());
            }
        }
        String orderId="";
        String status="";
        if (orderIds.length()>0){
            orderId=orderIds.substring(0,orderIds.length()-1);
        }
        if (statuses.length()>0){
            status=statuses.substring(0,statuses.length()-1);
        }
        customOrder.setOrderId(orderId);
        customOrder.setOrderStatus(status);
        context.write(customOrder,NullWritable.get());
    }
}

Driver

public class CODriver {
    public static void main(String[] args) throws Exception {
        Configuration cfg=new Configuration();
        Job job=Job.getInstance(cfg,"co_job");
        job.setJarByClass(CODriver.class);
        job.setMapperClass(COMapperJoin.class);
        job.setReducerClass(COReducerJoin.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(CustomOrder.class);
        job.setOutputKeyClass(CustomOrder.class);
        job.setOutputValueClass(NullWritable.class);
        FileInputFormat.setInputPaths(job,new Path("file:///F:/IdeaProjects/testhdfss/data"));
        FileOutputFormat.setOutputPath(job,new Path("file:///G:/test/coResult"));
        boolean result=job.waitForCompletion(true);
        System.out.println(result?"成功":"失败");
        System.exit(result?0:1);
    }
}