倒排索引
不懂倒排索引含义的见以下链接
倒排索引详解
目的:用hadoop做一个简单的倒排索引
准备文件
几个简单的文本文件:
a.txt
hello tom
hello kitty
hello jack
b.txt
hello jerry
hello tom
hello tim
c.txt
hello tom
hello jack
实现原理分析
1、最后我们需要得到一个文件,文件类容如下:
hello a.txt->3 b.txt->3 c.txt->2
jack a.txt->1 b.txt->0 c.txt->1
jerry a.txt->0 b.txt->1 c.txt->0
kitty a.txt->1 b.txt->0 c.txt->0
tim a.txt->0 b.txt->1 c.txt->0
tom a.txt->1 b.txt->1 c.txt->1
2、要得到这样的输出reduce 输出的key和value均为Text类型,形如“hello”,“a.txt->3 b.txt->3 c.txt->2”
3、map的输入的key和value类型分别为LongWritable和Text。 输出的key和value类型分别为Text和Text
map的输入<0,”hello tom”> <9,”hello kitty”>……….
map的输出<”hello->a.txt”,”1”> <”hello->a.txt”,”1”> <”hello->a.txt”,1> <”jack->a.txt”,”1”>
<”kitty->a.txt”,”1”> <”tom->a.txt”,”1”> …………. //这里只列出文件a.txt
4、combiner (特殊的reduce)的输入为 <”hello->a.txt”,{1,1,1}> <”jack->a.txt”,{1}>
<”kitty->a.txt”,{1}> <”tom->a.txt”,{1}> //文件a.txt
<”hello->b.txt”,{1,1,1}> <”jerry->b.txt”,{1}>
<”tim->b.txt”,{1}> <”tom->b.txt”,{1}> //文件b.txt
<”hello->c.txt”,{1,1}> <”jack->c.txt”,{1}>
<”tom->c.txt”,{1}> //文件c.txt
输出为<”hello->a.txt”,3> <”jack->a.txt”,1> <”kitty->a.txt”,1>
<”tom->a.txt”,1> ……………. //也只累出文件a.txt
5、reduce 的输入即为combiner的输出
reduce的输出只需将key重新拆分,拆分组装后的结果
<”hello”,”a.txt->3”> <”hello”,”b.txt->3”> <”hello”,”c.txt->2”> //单词hello
<”jack”,”a.txt->1”> <”jack”,”c.txt->1”> //单词jack
…
…
…
最后聚合即为最后的结果,下面是代码。
代码实现
package cn.master.hadoop.mr.ii;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class InverseIndex {
public static class IndexMapper extends Mapper<LongWritable, Text, Text, Text>{
private Text k = new Text();
private Text v = new Text();
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] words = line.split(" ");
FileSplit inputSplit = (FileSplit) context.getInputSplit();
String path = inputSplit.getPath().toString();
for(String w : words){
k.set(w + "->" + path);
v.set("1");
context.write(k, v);
}
}
}
//此处的Combiner实现了一个计数的功能,是不可插拔的。
public static class IndexCombiner extends Reducer<Text, Text, Text, Text>{
private Text k = new Text();
private Text v = new Text();
@Override
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String[] wordAndPath = key.toString().split("->");
String word = wordAndPath[0];
String path = wordAndPath[1];
int counter = 0;
for(Text t : values){
counter += Integer.parseInt(t.toString());
}
k.set(word);
v.set(path + "->" + counter);
context.write(k, v);
}
}
public static class IndexReducer extends Reducer<Text, Text, Text, Text>{
private Text v = new Text();
@Override
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String result = "";
for(Text t : values){
result += t.toString() + "\t";
}
v.set(result);
context.write(key, v);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(InverseIndex.class);
job.setMapperClass(IndexMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
job.setCombinerClass(IndexCombiner.class);
job.setReducerClass(IndexReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
运行jar包
最后将项目打成jar包命名为index.jar,将a.txt,b.txt,c.txt 上传至hdfs的ii文件夹中,新建iiout文件夹作为输出结果的存储文件 ,最后运行jar包(对应好相应路径)即可。
hadoop jar /root/index.jar cn.master.hadoop.mr.ii.InverseIndex //ii //iiout。