Hadoop-采样器－多输入路径－只采一个文件－（MultipleInputs+getsample(conf.getInputFormat)...

Hadoop双文件输入采样

最新推荐文章于 2025-12-06 10:24:48 发布

最新推荐文章于 2025-12-06 10:24:48 发布 · 211 阅读

文章标签：

#大数据 #java

本文介绍了一种在Hadoop中实现双文件输入时仅对其中一个大文件进行采样的方法。通过自定义RecordReader区分不同长度的文件，并利用MultipleInputs结合InputSampler实现对特定文件的采样。

之前弄采样器，以为已经结束了工作，结果现在又遇到了问题，因为我的输入有两个文件，设计要求是先只采样其中的大文件（未来是两个文件分别采样的），只有一个输入文件且采样时，使用采样器的代码是：

[java] view plain copy print ?

Pathinput=newPath(args[0].toString());
input=input.makeQualified(input.getFileSystem(conf));
InputSampler.IntervalSampler<Text,NullWritable>sampler=newInputSampler.IntervalSampler<Text,NullWritable>(0.4,5);

Path input = new Path(args[0].toString());
input = input.makeQualified(input.getFileSystem(conf));

InputSampler.IntervalSampler<Text, NullWritable> sampler = new InputSampler.IntervalSampler<Text, NullWritable>(0.4, 5);

[java] view plain copy print ?

//这句话的意思是两个分区，

// 这句话的意思是两个分区，

[java] view plain copy print ?

//K[]getSample(InputFormat<K,V>inf,JobConfjob)函数原型
Stringskewuri_out=args[2]+"/sample_list";//存放采样的结果，不是分区的结果
FileSystemfs=FileSystem.get(URI.create(skewuri_out),conf);
FSDataOutputStreamfs_out=fs.create(newPath(skewuri_out));
finalInputFormatinf=conf.getInputFormat();//这个是获得Jobconf的InputFormat
Object[]p=sampler.getSample(inf,conf);//输出采样的结果，必须前面是Object类型，换成I那头Writable就不管用了，不知道为什么

// K[] getSample(InputFormat<K,V> inf, JobConf job)  函数原型

String skewuri_out = args[2] + "/sample_list"; // 存放采样的结果，不是分区的结果
FileSystem fs = FileSystem.get(URI.create(skewuri_out), conf);
FSDataOutputStream fs_out = fs.create(new Path(skewuri_out));

final InputFormat inf = conf.getInputFormat();//这个是获得Jobconf的InputFormat
Object[] p = sampler.getSample(inf, conf);// 输出采样的结果，必须前面是Object类型，换成I那头Writable就不管用了，不知道为什么

但是这样问题就来了，如果我写了两个Mapper类，分别为Map1class,Map2class,现在两个class分别处理两个不同输入路径的数据，目前是指定输入数据的格式是相同的，那么可以用MultipleInputs 来实现：

[java] view plain copy print ?

MultipleInputs.addInputPath(conf,newPath(args[0]),Definemyself.class,Map1class.class);
MultipleInputs.addInputPath(conf,newPath(args[1]),Definemyself.class,Map2class.class);

MultipleInputs.addInputPath(conf, new Path(args[0]), Definemyself.class,Map1class.class);
MultipleInputs.addInputPath(conf, new Path(args[1]), Definemyself.class,Map2class.class);

//Definemyself.class 是我自定义的继承了FileInputFormat ，并且实现了WritableComparable接口

//继承FileInputFormat 是采样的需要，实现WritableComparable接口，是因为我在join的时候想整体数据进行序列化，我自己也解释不明白这个序列化，可以理解成C里面的结构体吧，就是作为一个整体，可以toString()输出。

原型是：public class Definemyself extends FileInputFormat<Text,Text> implements WritableComparable{...}

这个问题从昨晚就困扰我，上周做梦采样，这种做梦还是采样。中午和老公出去吃的，因为要好好探讨一下这个问题，我的理论就是既然系统提供MultipleInputs，同时Jobconf有能调用getInputFormat(),就肯定有办法二者同时使用，不让就矛盾了，傻子才会建立这样的系统呢。

终于，吃完饭回来，我又尝试了一下想法，终于让我给解决了，即如何有两个输入路径的情况下，只对一个输入路径中的文件进行采样，啊哈，答案很简单！见下：

首先修改Definemyself.class,使其在输出的时候能够分辨是哪个表的，这里我使用最简单的方法，两个表的长度不一样，我通过，长的我称为long表，短的我成为short表，主要就是修改这个class里面的next函数。以下是Definemyself.class中的最重要的函数。

[java] view plain copy print ?

classEmployeeRecordReaderimplementsRecordReader<Text,Text>{
privateLineRecordReaderin;
privateLongWritablejunk=newLongWritable();
privateTextline=newText();
privateintKEY_LENGTH=10;
publicbooleannext(Textkey,Textvalue)throwsIOException{
if(in.next(junk,line)){
String[]lines=line.toString().split(",");
if(lines.length==9){//这个是大表，也就是要采样的表
//System.out.println("ThelineinnextofEmployeeis"+line);
deptno=lines[7];
//很多写全排序的，不论是对整数，还是对字符串，都采用的下面这种写法，自己查去，不过对于表的一行有很多列，只取其中的一列作为排序的key时value.set要做更改。就是输出正行，别的博客中都是些的value=newText(),太局限了，只适合字符串全局排序时使用。其实下面的if...else条件可以省略，留下只是提醒自己，如果从表中抽取的key值占的字节太长时，要做一些处理。
if(deptno.length()<KEY_LENGTH){
key.set(newText(deptno));
value.set(newText(line+","+"0"));//题阿
}else{
key.set(newText(deptno));
value.set(newText(line+","+"0"));
}
returntrue;
}
elseif(lines.length!=9){//断表的处理
System.out.println("(forshort.txt)ThelineinnextofEmployeeis"+line);
deptno=lines[0];//每列较长的文件中，貌似第5列是depto吧，先这么指定吧
if(deptno.length()<KEY_LENGTH/*&&(!lines[8].equals("0"))*/){
key.set(newText(deptno));
value.set(newText(line+","+"1"));
}else{
key.set(newText(deptno));
value.set(newText(line+","+"1"));//1为
}
returntrue;
}
elsereturnfalse;
}else{
returnfalse;
}
}
}

class EmployeeRecordReader implements RecordReader<Text, Text> {
                private LineRecordReader in;
private LongWritable junk = new LongWritable();
private Text line = new Text();
private int KEY_LENGTH = 10;
public boolean next(Text key, Text value) throws IOException {
if (in.next(junk, line)) {
String[] lines = line.toString().split(",");
if (lines.length ==9) { // 这个是大表，也就是要采样的表
// System.out.println("The line in next of Employee is "+ line);
                                        deptno = lines[7];
//很多写全排序的，不论是对整数，还是对字符串，都采用的下面这种写法，自己查去，不过对于表的一行有很多列，只取其中的一列作为排序的key时value.set要做更改。就是输出正行，别的博客中都是些的value = new Text(), 太局限了，只适合字符串全局排序时使用。其实下面的if... else 条件可以省略，留下只是提醒自己，如果从表中抽取的key值占的字节太长时，要做一些处理。
if (deptno.length() < KEY_LENGTH) {
    key.set(new Text(deptno));
    value.set(new Text(line+","+"0"));// 题阿
} else {
    key.set(new Text(deptno));
    value.set(new Text(line+","+"0"));
}
return true;
} 
else if (lines.length !=9) {// 断表的处理
System.out.println("(for short.txt)The line in next of Employee is "+ line);
deptno = lines[0];// 每列较长的文件中，貌似第5列是depto吧，先这么指定吧
if (deptno.length() < KEY_LENGTH /* &&(!lines[8].equals("0")) */) {
    key.set(new Text(deptno));
    value.set(new Text(line+","+"1"));
} else{
    key.set(new Text(deptno));
    value.set(new Text(line+","+"1"));// 1为
}
return true;
}
else return false;

} else {
return false;
}
}
}

以上的更改就是两个表进来，都可通过此类进行输入，无须针对两个表，要写两个继承FileInputFormat并实现WritableComparable接口的类。下面才是如何让才采样器只采一个文件的，啊哈！答案说出来笑死人了，那就是利用MultipleInputs先指定要采样的那个输入路径，然后调用采样器，采样结束后于采样相关的流、文件什么的进行关闭，最后再用MultipleInputs指定第二个输入路径。这样路径一的文件（可以包含多个文本，你懂的）先采样，然后路径一和路径二的文件都进入map了，map再根据一些额外的信息判断来自那个路径的数据。

MultipleInputs.addInputPath(conf, new Path(args[0]), Definemyself.class,Mapclass.class);//第一个输入路径

/*********下面采样**********更多采样的细节见我领一篇博客，不一样的视角那篇***********/

Path input = new Path(args[0].toString());
input = input.makeQualified(input.getFileSystem(conf));

InputSampler.RandomSampler<Text, NullWritable> sampler = new InputSampler.RandomSampler<Text, NullWritable>(0.4,20, 5);

/...........此处省略细节................/

IOUtils.closeStream(fs_out);// 关闭流，有关采样的结束了。

/...............此处添加一些其他的需要的工作，例如分布式缓存啦，Hashtable的处理阿............../

MultipleInputs.addInputPath(conf, new Path(args[3]), Definemyself.class, Mapclass.class); //最后指定输入的第二条路径

JobClient.runJob(conf);