多表Join类型:
基本原理
Reducer Side Join主要利用mapreduce的group功能,将相同key(外键)的记录在Reducer中进行合并。按照关联类型不同,Reduce Side Join有1to1、1toN、NtoN等3种情形。各情形使用场景如下:
- 1to1:表A和表B的字段一一对应。表A、B的公共字段(外键)作为map OutputKey,其他字段作为Map outputValue,同时其他字段增加数据源标识,说明outputValue来自哪个数据集。Reduce对每个inputKey,按照数据源标识合并inputValue。
举例如下,对于key1,有value from A和value from B两个值;对于key2,有value from B和value from A;对于key3,仅有value from A;对于key4,仅有value from B。这样在shuffle进行group时候,按照数据源标识,提取对应数据后合并即可。
(key1, value from A)
(key1, value from B)
(key2, value from B)
(key2, value from A)
(key3, value from A)
(key4, value from B)
- 1toN:表A的一个字段,对应表B多个字段。与1to1不同在于,需要使用二次排序和MultipleInputs。把表A、B的公共字段和数据源标识,构造成新的类作为map OutputKey,并按照(1)公共字段(2)数据源标识,进行排序,按照公共字段进行合并。同时,其他字段也增加数据源标识,构成新的类作为map OutputValue,说明OutputValue来自哪个数据集。对于Reducer,读取OutputValue中的数据源标识,实现inner
join、left join、right join等操作。
举例如下:对于key1,有value1 from A(仅1个值)和value1 from B、value2 from B、value3 from B(3个值)。这里通过Map OutputKey的compareTo(),实现value from A放在每个key的最前面,通过group()实现按照key合并。Reducer读取key1的value1 from A,分别与key1的value1 from
B、value2 from B、value3 from B进行合并即可。对于key2、key3,仅有value from A和value from B。如果对于inner join,可以舍去key2、key3。对于left join、right join,可以分别按照定义操作。
(key1, value1 from A)
(key1, value1 from B)
(key1, value2 from B)
(key1, value3 from B)
(key2, value1 from A)
(key3, value1 from B)
由于对A、B表所有记录进行shuffle,可以通过rawcomparator、compress等方式节省网络带宽和IO消耗。
- NtoN:表A的多个字段,对应表B多个字段。与1toN类似,需要使用二次排序实现。此类情形很少见,这里暂且忽略。
查询要求:关联查询donation数据集和project数据集,按照id进行关联。各数据集的查询结果字段如下:
- Donation:id,project id,donor city,date,total
- Project:id,school city,poverty level,primary subject
- Project:id,school city,poverty level,primary subject
相关SQL:
SELECT d.donation_id, d.project_id as join_fk, d.donor_city, d.ddate, d.total, p.project_id, p.school_city, p.poverty_level, p.primary_focus_subject
FROM donations as d
INNER JOIN projects as p
ON d.project_id = p.project_id;
相关实现:
按照数据集内容,Project与Donation为1toN的关联,可以按照id实现。
1)创建DonationsMapper类,处理Donations数据。输出(projects_id, DonationWritable)
public static class Dontiondata extends
Mapper<LongWritable, Text, Doninfo, Doninfo> {
Doninfo outputValue=new Doninfo();
Doninfo outputKey=new Doninfo();
@Override
protected void map(LongWritable key, Text value,
Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
String str=value.toString();
if(str.contains("_projectid"))
return;
String []words=str.split("\",\"");
if(words[4].isEmpty()|| words[5].isEmpty() ||words[11].isEmpty())
return;
outputKey.setInfo(words[1].replaceAll("\"", ""));
outputKey.setFlag(1);
outputValue.setFlag(1);
outputValue.setInfo(words[4]+DELIMITER+words[5]+DELIMITER+words[11]);
context.write(outputKey, outputValue);
context.getCounter(MyEnum.MAP_DATA).increment(1);
}
}
2)创建ProjectsMapper类,处理Projects数据。输出(projects_id, ProjectWritable)
public static class Donproject extends
Mapper<LongWritable, Text, Doninfo, Doninfo> {
Doninfo outputValue=new Doninfo();
Doninfo outputKey=new Doninfo();
@Override
protected void map(LongWritable key, Text value,
Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
String str=value.toString();
if(str.contains("_projectid"))
return;
String []words=str.split("\",\"");
if(words[6].isEmpty() || words[8].isEmpty()||words[11].isEmpty())
return;
outputKey.setFlag(0);
outputKey.setInfo(words[0].replaceAll("\"", ""));
outputValue.setFlag(0);
outputValue.setInfo(words[6]+DELIMITER+words[8]+DELIMITER+words[11]);
context.write(outputKey, outputValue);
context.getCounter(MyEnum.MAP_PROJ).increment(1);
}
}
3)创建Multiple类,实现多数据集Mapper输入。
MultipleInputs.addInputPath(job, new Path("data/donation-project"), TextInputFormat.class, Donproject.class);
MultipleInputs.addInputPath(job, new Path("data/donation"), TextInputFormat.class, Dontiondata.class);
4)实现projects_id类,并自定义排序
public class Doninfo implements WritableComparable<Doninfo> {
String info;
int flag;//0-donationdata; 1-project
@Override
//Reducer中记录排序顺序,首先按照info排序(也就是key);对于相同info,再按照flag升序排序。
public int compareTo(Doninfo arg0) {
// TODO Auto-generated method stub
int result=0;
result=(result!=0)?result:info.compareTo(arg0.getInfo());
result=(result!=0)?result:flag-arg0.getFlag();
return result;
}
@Override
public String toString() {
return info+"\t"+flag;
}
public Doninfo() {
super();
// TODO Auto-generated constructor stub
}
public Doninfo(String info, int flag) {
super();
this.info = info;
this.flag = flag;
}
public String getInfo() {
return info;
}
public void setInfo(String info) {
this.info = info;
}
public int getFlag() {
return flag;
}
public void setFlag(int flag) {
this.flag = flag;
}
@Override
public void write(DataOutput out) throws IOException {
// TODO Auto-generated method stub
out.writeUTF(info);
out.writeInt(flag);
}
@Override
public void readFields(DataInput in) throws IOException {
// TODO Auto-generated method stub
info=in.readUTF();
flag=in.readInt();
}
/*
static{
WritableComparator.define(Doninfo.class, new Donsort());
}
*/
}
5)自定义group类,实现Reducer中记录按照key进行合并
public static class Dongroup extends WritableComparator {
@Override
public int compare(WritableComparable a, WritableComparable b) {
// TODO Auto-generated method stub
Doninfo lhs=(Doninfo)a;
Doninfo rhs=(Doninfo)b;
return lhs.getInfo().compareTo(rhs.getInfo());
}
public Dongroup() {
super(Doninfo.class,true);
// TODO Auto-generated constructor stub
}
}
6)在Reducer中判断Reducer InputValue来自哪个数据集。
public static class Donreducer extends
Reducer<Doninfo, Doninfo, Text, Text> {
Text outputValue=new Text();
Text outputKey=new Text();
@Override
protected void reduce(Doninfo key, Iterable<Doninfo> values,
Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
String strproj="";
String strdata="";
Iterator<Doninfo> iter=values.iterator();
Doninfo firstone=iter.next();
//默认Reducer中各key按照flag排序,首先为0,其次为1。如果第一个记录为0,非0,说明没有project数据。对于inner join,直接return。
if(firstone.getFlag()==1)
return;
strproj=firstone.getInfo();
while(iter.hasNext()){//如果iter.hasNext()为false,说明donation没有数据。对于inner join,直接return
strdata=iter.next().getInfo();
outputKey.set(key.getInfo());
outputValue.set(strproj+DELIMITER+strdata);
context.write(outputKey, outputValue);
}
}
}
16/05/19 14:30:50 INFO mapreduce.Job: Counters: 52
File System Counters
FILE: Number of bytes read=157719008
FILE: Number of bytes written=317804674
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2073053723
HDFS: Number of bytes written=35640930
HDFS: Number of read operations=60
HDFS: Number of large read operations=0
HDFS: Number of write operations=8
Job Counters
Killed map tasks=1
Launched map tasks=16
Launched reduce tasks=4
Data-local map tasks=16
Total time spent by all maps in occupied slots (ms)=359753
Total time spent by all reduces in occupied slots (ms)=122295
Total time spent by all map tasks (ms)=359753
Total time spent by all reduce tasks (ms)=122295
Total vcore-milliseconds taken by all map tasks=359753
Total vcore-milliseconds taken by all reduce tasks=122295
Total megabyte-milliseconds taken by all map tasks=368387072
Total megabyte-milliseconds taken by all reduce tasks=125230080
Map-Reduce Framework
Map input records=5510190
Map output records=2377907
Map output bytes=152963170
Map output materialized bytes=157719368
Input split bytes=4436
Combine input records=0
Combine output records=0
Reduce input groups=1560293
Reduce shuffle bytes=157719368
Reduce input records=2377907
Reduce output records=460882
Spilled Records=4755814
Shuffled Maps =64
Failed Shuffles=0
Merged Map outputs=64
GC time elapsed (ms)=10495
CPU time spent (ms)=162640
Physical memory (bytes) snapshot=5343625216
Virtual memory (bytes) snapshot=16799875072
Total committed heap usage (bytes)=4116709376
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
donation4.Donation$MyEnum
MAP_DATA=1499075
MAP_PROJ=878832
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=35640930
xuefei@xuefei:~/hdp/donation4$ hdfs dfs -cat output/donation4/p* |head
0000702ee4cefdb1e7f89084e50d3c85 Chicago,60616,Cook,LaGrange,IL,10.00
00008c61582d7b4d3a2d89ad88f9eedf Baltimore,21214,Baltimore City,Baltimore,MD,234.17
000096b54de3b748fce836454030c8d5 Los Angeles,90007,Los Angeles,Los Angeles,CA,179.01
000096b54de3b748fce836454030c8d5 Los Angeles,90007,Los Angeles,Hoboken,NJ,29.18
00012e7f0fe1c81ea915d97febd73111 Bald Knob,72010,White,Bentonville,AR,188.30
000238b44c187c254752736e53ee46f2 Roanoke,24015,Roanoke City,New York,NY,361.79
0002c408c59b058ab5f285afd22a5987 Philadelphia,19122,Philadelphia,Philadelphia,PA,270.68
00038236ed975e2e8be5dbd792a71a6a Los Angeles,90042,Los Angeles,Burbank,CA,5.00
00038236ed975e2e8be5dbd792a71a6a Los Angeles,90042,Los Angeles,Loveland,CO,17.05
00038236ed975e2e8be5dbd792a71a6a Los Angeles,90042,Los Angeles,Los Angeles,CA,5.00
程序优化:
通过raw comparator,实现Map OutputKey的自定义排序。
public class Donation extends Configured implements Tool {
static final String DELIMITER=",";
public enum MyEnum{MAP_DATA, MAP_PROJ}
public static class Donsort extends WritableComparator {
@Override
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
// TODO Auto-generated method stub
int len11=readUnsignedShort(b1, s1);
int len21=readUnsignedShort(b2,s2);
int flag1=readInt(b1,s1+len11+4);
int flag2=readInt(b2,s2+len21+4);
int result=0;
result=(result!=0)?result:compareBytes(b1,s1+2,len11,b2,s2+2,len21);
result=(result!=0)?result:flag1-flag2;
return result;
}
public Donsort() {
super(Doninfo.class,true);
// TODO Auto-generated constructor stub
}
}
优化后程序执行:
File System Counters
FILE: Number of bytes read=157719008
FILE: Number of bytes written=317808134
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2073053723
HDFS: Number of bytes written=29083726
HDFS: Number of read operations=60
HDFS: Number of large read operations=0
HDFS: Number of write operations=8
Job Counters
Killed reduce tasks=1
Launched map tasks=16
Launched reduce tasks=4
Data-local map tasks=16
Total time spent by all maps in occupied slots (ms)=319294
Total time spent by all reduces in occupied slots (ms)=96833
Total time spent by all map tasks (ms)=319294 //使用raw comparator,在Map OutputKey在输出到本地磁盘系统时候,总时间缩短12%
Total time spent by all reduce tasks (ms)=96833//使用raw comparator,在Reducer中Reducer InputKey在合并时候,总时间缩短21%
Total vcore-milliseconds taken by all map tasks=319294
Total vcore-milliseconds taken by all reduce tasks=96833
Total megabyte-milliseconds taken by all map tasks=326957056
Total megabyte-milliseconds taken by all reduce tasks=99156992
Map-Reduce Framework
Map input records=5510190
Map output records=2377907
Map output bytes=152963170
Map output materialized bytes=157719368
Input split bytes=4436
Combine input records=0
Combine output records=0
Reduce input groups=1546816
Reduce shuffle bytes=157719368
Reduce input records=2377907
Reduce output records=374510
Spilled Records=4755814
Shuffled Maps =64
Failed Shuffles=0
Merged Map outputs=64
GC time elapsed (ms)=9466
CPU time spent (ms)=145950
Physical memory (bytes) snapshot=5281873920
Virtual memory (bytes) snapshot=16811524096
Total committed heap usage (bytes)=4075290624
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
donation4.Donation$MyEnum
MAP_DATA=1499075
MAP_PROJ=878832
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=29083726