十八 Reduce Join

Reduce Join通过在Map阶段标记数据来源,Reduce阶段按连接字段分组并合并不同表记录。实例展示了如何处理商品信息与订单数据的Join。然而,这种方式可能导致Reduce端处理压力大、数据倾斜,解决方案是尝试Map端数据合并。

1 Reduce Join工作原理

Map端的主要工作:为来自不同表或文件的key/value对,打标签以区别不同来源的记录。然后用连接字段作为key,其余部分和新加的标志作为value,最后进行输出。
Reduce端的主要工作:在Reduce端以连接字段作为key的分组已经完成,我们只需要在每一个分组当中将那些来源于不同文件的记录(在Map阶段已经打标志)分开,最后进行合并就ok了。

2 实例

1.需求

idpidamount
1001011
1002022
1003033
1004014
1005025
1006036
pidpname
01小米
02华为
03格力

将商品信息表中数据根据商品pid合并到订单数据表中。

idpnameamount
1001小米1
1004小米4
1002华为2
1005华为5
1003格力3
1006格力6

2.需求分析
通过将关联条件作为Map输出的key,将两表满足Join条件的数据并携带数据所来源的文件信息,发往同一个ReduceTask,在Reduce中进行数据的串联.

思路:
把两个文件全部读到map,并且封装到结构体中,使用二次排序,先排pid,如果pid相同,把有名字的排在最前面,通过pid分组,如果pid一样分到同一个组内,组内来自商品的数据排在第一行,然后遍历后面的数据,把第一行的pname join到后面的数据中.

  1. 代码
    ReduceDriver 类
package com.reducejoin;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * @author andy
 * @version 1.0
 * @date 2020/3/3 0:44
 * @contact andy.freedoms@gmail.com
 * @since JDK 1.8
 */
public class ReduceDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        if (args == null || args.length < 2)
            args = new String[]{"E:\\temp\\input\\reducejoin", "E:\\temp\\output"};
        Path path = new Path(args[1]);
        Configuration conf = new Configuration();
        FileSystem fs = path.getFileSystem(conf);
        Job job = Job.getInstance(conf);


        try {
            if (fs.exists(path))
                fs.delete(path, true);
        } catch (IllegalArgumentException e) {
            e.printStackTrace();
        } finally {
            fs.close();
        }


        job.setJarByClass(ReduceDriver.class);
        job.setMapperClass(MyReduceJoinMapper.class);
        job.setReducerClass(MyReduceReducer.class);

        job.setMapOutputKeyClass(ReduceBean.class);
        job.setMapOutputValueClass(NullWritable.class);

        job.setOutputKeyClass(ReduceBean.class);
        job.setOutputValueClass(NullWritable.class);

        job.setGroupingComparatorClass(MyGroup.class);

        FileInputFormat.setInputPaths(job, args[0]);
        FileOutputFormat.setOutputPath(job, path);

        boolean b = job.waitForCompletion(true);

        System.exit(b ? 0 : 1);
    }
}

ReduceBean 类

package com.reducejoin;

import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

/**
 * @author andy
 * @version 1.0
 * @date 2020/3/2 23:54
 * @contact andy.freedoms@gmail.com
 * @since JDK 1.8
 */
public class ReduceBean implements WritableComparable<ReduceBean> {
    private int id;
    private String pid;
    private int amount;
    private String pname;

    public ReduceBean() {
    }

    public int getId() {
        return id;
    }

    public void setId(int id) {
        this.id = id;
    }

    public String getPid() {
        return pid;
    }

    public void setPid(String pid) {
        this.pid = pid;
    }

    public int getAmount() {
        return amount;
    }

    public void setAmount(int amount) {
        this.amount = amount;
    }

    public String getPname() {
        return pname;
    }

    public void setPname(String pname) {
        this.pname = pname;
    }

    public void set(int id,String pid,int amount,String pname){
        setId(id);
        setPid(pid);
        setAmount(amount);
        setPname(pname);
    }

    @Override
    public int compareTo(ReduceBean o) {
        int result = this.pid.compareTo(o.getPid());
        if(result == 0){
            result = o.getPname().compareTo(this.pname);
        }
        return result;
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeInt(id);
        out.writeUTF(pid);
        out.writeInt(amount);
        out.writeUTF(pname);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        id = in.readInt();
        pid = in.readUTF();
        amount = in.readInt();
        pname = in.readUTF();
    }

    @Override
    public String toString() {
        return id + "\t" + pid + "\t" + amount + "\t" + pname ;
    }
}

MyReduceJoinMapper 类

package com.reducejoin;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import java.io.IOException;

/**
 * @author andy
 * @version 1.0
 * @date 2020/3/2 23:53
 * @contact andy.freedoms@gmail.com
 * @since JDK 1.8
 */
public class MyReduceJoinMapper extends Mapper<LongWritable, Text,ReduceBean, NullWritable> {

    private String name;
    ReduceBean k = new ReduceBean();
    NullWritable v = NullWritable.get();

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        FileSplit split = (FileSplit)context.getInputSplit();
        name = split.getPath().getName();

    }

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] line = value.toString().split("\t");
        if ("order.txt".equals(name)) {
            k.set(Integer.parseInt(line[0]),line[1],Integer.parseInt(line[2])," ");
        } else {
            k.set(0,line[0],0,line[1]);
        }
        context.write(k, v);
    }
}

MyReduceReducer 类

package com.reducejoin;

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
import java.util.Iterator;

/**
 * @author andy
 * @version 1.0
 * @date 2020/3/3 0:23
 * @contact andy.freedoms@gmail.com
 * @since JDK 1.8
 */
public class MyReduceReducer extends Reducer<ReduceBean, NullWritable,ReduceBean, NullWritable> {

    @Override
    protected void reduce(ReduceBean key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
        Iterator<NullWritable> iterator = values.iterator();
        iterator.next();
        String pname = key.getPname();

        while (iterator.hasNext()){
            iterator.next();
            key.setPname(pname);
            context.write(key, NullWritable.get());
        }
    }
}

MyGroup 类

package com.reducejoin;

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

/**
 * @author andy
 * @version 1.0
 * @date 2020/3/3 0:33
 * @contact andy.freedoms@gmail.com
 * @since JDK 1.8
 */
public class MyGroup extends WritableComparator {
    public MyGroup() {
        super(ReduceBean.class, true);
    }

    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        ReduceBean o1 = (ReduceBean)a;
        ReduceBean o2 = (ReduceBean)b;

        return o1.getPid().compareTo(o2.getPid());
    }
}

3 Reduce Join缺点及解决方案

缺点:这种方式中,合并的操作是在Reduce阶段完成,Reduce端的处理压力太大,Map节点的运算负载则很低,资源利用率不高,且在Reduce阶段极易产生数据倾斜。

解决方案:Map端实现数据合并

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值