1 Reduce Join工作原理
Map端的主要工作:为来自不同表或文件的key/value对,打标签以区别不同来源的记录。然后用连接字段作为key,其余部分和新加的标志作为value,最后进行输出。
Reduce端的主要工作:在Reduce端以连接字段作为key的分组已经完成,我们只需要在每一个分组当中将那些来源于不同文件的记录(在Map阶段已经打标志)分开,最后进行合并就ok了。
2 实例
1.需求
| id | pid | amount |
|---|---|---|
| 1001 | 01 | 1 |
| 1002 | 02 | 2 |
| 1003 | 03 | 3 |
| 1004 | 01 | 4 |
| 1005 | 02 | 5 |
| 1006 | 03 | 6 |
| pid | pname |
|---|---|
| 01 | 小米 |
| 02 | 华为 |
| 03 | 格力 |
将商品信息表中数据根据商品pid合并到订单数据表中。
| id | pname | amount |
|---|---|---|
| 1001 | 小米 | 1 |
| 1004 | 小米 | 4 |
| 1002 | 华为 | 2 |
| 1005 | 华为 | 5 |
| 1003 | 格力 | 3 |
| 1006 | 格力 | 6 |
2.需求分析
通过将关联条件作为Map输出的key,将两表满足Join条件的数据并携带数据所来源的文件信息,发往同一个ReduceTask,在Reduce中进行数据的串联.
思路:
把两个文件全部读到map,并且封装到结构体中,使用二次排序,先排pid,如果pid相同,把有名字的排在最前面,通过pid分组,如果pid一样分到同一个组内,组内来自商品的数据排在第一行,然后遍历后面的数据,把第一行的pname join到后面的数据中.
- 代码
ReduceDriver 类
package com.reducejoin;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
/**
* @author andy
* @version 1.0
* @date 2020/3/3 0:44
* @contact andy.freedoms@gmail.com
* @since JDK 1.8
*/
public class ReduceDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
if (args == null || args.length < 2)
args = new String[]{"E:\\temp\\input\\reducejoin", "E:\\temp\\output"};
Path path = new Path(args[1]);
Configuration conf = new Configuration();
FileSystem fs = path.getFileSystem(conf);
Job job = Job.getInstance(conf);
try {
if (fs.exists(path))
fs.delete(path, true);
} catch (IllegalArgumentException e) {
e.printStackTrace();
} finally {
fs.close();
}
job.setJarByClass(ReduceDriver.class);
job.setMapperClass(MyReduceJoinMapper.class);
job.setReducerClass(MyReduceReducer.class);
job.setMapOutputKeyClass(ReduceBean.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(ReduceBean.class);
job.setOutputValueClass(NullWritable.class);
job.setGroupingComparatorClass(MyGroup.class);
FileInputFormat.setInputPaths(job, args[0]);
FileOutputFormat.setOutputPath(job, path);
boolean b = job.waitForCompletion(true);
System.exit(b ? 0 : 1);
}
}
ReduceBean 类
package com.reducejoin;
import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
/**
* @author andy
* @version 1.0
* @date 2020/3/2 23:54
* @contact andy.freedoms@gmail.com
* @since JDK 1.8
*/
public class ReduceBean implements WritableComparable<ReduceBean> {
private int id;
private String pid;
private int amount;
private String pname;
public ReduceBean() {
}
public int getId() {
return id;
}
public void setId(int id) {
this.id = id;
}
public String getPid() {
return pid;
}
public void setPid(String pid) {
this.pid = pid;
}
public int getAmount() {
return amount;
}
public void setAmount(int amount) {
this.amount = amount;
}
public String getPname() {
return pname;
}
public void setPname(String pname) {
this.pname = pname;
}
public void set(int id,String pid,int amount,String pname){
setId(id);
setPid(pid);
setAmount(amount);
setPname(pname);
}
@Override
public int compareTo(ReduceBean o) {
int result = this.pid.compareTo(o.getPid());
if(result == 0){
result = o.getPname().compareTo(this.pname);
}
return result;
}
@Override
public void write(DataOutput out) throws IOException {
out.writeInt(id);
out.writeUTF(pid);
out.writeInt(amount);
out.writeUTF(pname);
}
@Override
public void readFields(DataInput in) throws IOException {
id = in.readInt();
pid = in.readUTF();
amount = in.readInt();
pname = in.readUTF();
}
@Override
public String toString() {
return id + "\t" + pid + "\t" + amount + "\t" + pname ;
}
}
MyReduceJoinMapper 类
package com.reducejoin;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import java.io.IOException;
/**
* @author andy
* @version 1.0
* @date 2020/3/2 23:53
* @contact andy.freedoms@gmail.com
* @since JDK 1.8
*/
public class MyReduceJoinMapper extends Mapper<LongWritable, Text,ReduceBean, NullWritable> {
private String name;
ReduceBean k = new ReduceBean();
NullWritable v = NullWritable.get();
@Override
protected void setup(Context context) throws IOException, InterruptedException {
FileSplit split = (FileSplit)context.getInputSplit();
name = split.getPath().getName();
}
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] line = value.toString().split("\t");
if ("order.txt".equals(name)) {
k.set(Integer.parseInt(line[0]),line[1],Integer.parseInt(line[2])," ");
} else {
k.set(0,line[0],0,line[1]);
}
context.write(k, v);
}
}
MyReduceReducer 类
package com.reducejoin;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
import java.util.Iterator;
/**
* @author andy
* @version 1.0
* @date 2020/3/3 0:23
* @contact andy.freedoms@gmail.com
* @since JDK 1.8
*/
public class MyReduceReducer extends Reducer<ReduceBean, NullWritable,ReduceBean, NullWritable> {
@Override
protected void reduce(ReduceBean key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
Iterator<NullWritable> iterator = values.iterator();
iterator.next();
String pname = key.getPname();
while (iterator.hasNext()){
iterator.next();
key.setPname(pname);
context.write(key, NullWritable.get());
}
}
}
MyGroup 类
package com.reducejoin;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
/**
* @author andy
* @version 1.0
* @date 2020/3/3 0:33
* @contact andy.freedoms@gmail.com
* @since JDK 1.8
*/
public class MyGroup extends WritableComparator {
public MyGroup() {
super(ReduceBean.class, true);
}
@Override
public int compare(WritableComparable a, WritableComparable b) {
ReduceBean o1 = (ReduceBean)a;
ReduceBean o2 = (ReduceBean)b;
return o1.getPid().compareTo(o2.getPid());
}
}
3 Reduce Join缺点及解决方案
缺点:这种方式中,合并的操作是在Reduce阶段完成,Reduce端的处理压力太大,Map节点的运算负载则很低,资源利用率不高,且在Reduce阶段极易产生数据倾斜。
解决方案:Map端实现数据合并
Reduce Join通过在Map阶段标记数据来源,Reduce阶段按连接字段分组并合并不同表记录。实例展示了如何处理商品信息与订单数据的Join。然而,这种方式可能导致Reduce端处理压力大、数据倾斜,解决方案是尝试Map端数据合并。
566

被折叠的 条评论
为什么被折叠?



