hadoop之倒排索引

最新推荐文章于 2023-04-10 20:43:19 发布

原创最新推荐文章于 2023-04-10 20:43:19 发布 · 805 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#hadoop #索引

hadoop学习笔记专栏收录该内容

21 篇文章

订阅专栏

倒排索引

不懂倒排索引含义的见以下链接
倒排索引详解

目的：用hadoop做一个简单的倒排索引

准备文件

几个简单的文本文件：
a.txt

hello tom
hello kitty
hello jack

b.txt

hello jerry
hello tom
hello tim

c.txt

hello tom
hello jack

实现原理分析

1、最后我们需要得到一个文件，文件类容如下：
hello a.txt->3 b.txt->3 c.txt->2
jack a.txt->1 b.txt->0 c.txt->1
jerry a.txt->0 b.txt->1 c.txt->0
kitty a.txt->1 b.txt->0 c.txt->0
tim a.txt->0 b.txt->1 c.txt->0
tom a.txt->1 b.txt->1 c.txt->1
2、要得到这样的输出reduce 输出的key和value均为Text类型，形如“hello”,“a.txt->3 b.txt->3 c.txt->2”
3、map的输入的key和value类型分别为LongWritable和Text。输出的key和value类型分别为Text和Text
map的输入<0,”hello tom”> <9,”hello kitty”>……….
map的输出<”hello->a.txt”,”1”> <”hello->a.txt”,”1”> <”hello->a.txt”,1> <”jack->a.txt”,”1”>
<”kitty->a.txt”,”1”> <”tom->a.txt”,”1”> …………. //这里只列出文件a.txt
4、combiner （特殊的reduce）的输入为 <”hello->a.txt”,{1,1,1}> <”jack->a.txt”,{1}>
<”kitty->a.txt”,{1}> <”tom->a.txt”,{1}> //文件a.txt
<”hello->b.txt”,{1,1,1}> <”jerry->b.txt”,{1}>
<”tim->b.txt”,{1}> <”tom->b.txt”,{1}> //文件b.txt
<”hello->c.txt”,{1,1}> <”jack->c.txt”,{1}>
<”tom->c.txt”,{1}> //文件c.txt
输出为<”hello->a.txt”,3> <”jack->a.txt”,1> <”kitty->a.txt”,1>
<”tom->a.txt”,1> ……………. //也只累出文件a.txt
5、reduce 的输入即为combiner的输出
reduce的输出只需将key重新拆分,拆分组装后的结果
<”hello”,”a.txt->3”> <”hello”,”b.txt->3”> <”hello”,”c.txt->2”> //单词hello
<”jack”,”a.txt->1”> <”jack”,”c.txt->1”> //单词jack
…
…
…
最后聚合即为最后的结果，下面是代码。

代码实现

package cn.master.hadoop.mr.ii;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class InverseIndex {

    public static class IndexMapper extends Mapper<LongWritable, Text, Text, Text>{

        private Text k = new Text();
        private Text v = new Text();
        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            String line  =  value.toString();
            String[] words = line.split(" ");
            FileSplit inputSplit = (FileSplit) context.getInputSplit();
            String path = inputSplit.getPath().toString();
            for(String w : words){
                k.set(w + "->" + path);
                v.set("1");
                context.write(k, v);
            }
        }

    }

    //此处的Combiner实现了一个计数的功能，是不可插拔的。
    public static class IndexCombiner extends Reducer<Text, Text, Text, Text>{

        private Text k = new Text();
        private Text v = new Text();
        @Override
        protected void reduce(Text key, Iterable<Text> values, Context context)
                throws IOException, InterruptedException {
            String[] wordAndPath = key.toString().split("->");
            String word = wordAndPath[0];
            String path = wordAndPath[1];
            int counter = 0;
            for(Text t : values){
                counter += Integer.parseInt(t.toString());
            }
            k.set(word);
            v.set(path + "->" + counter);
            context.write(k, v);
        }
    }

    public static class IndexReducer extends Reducer<Text, Text, Text, Text>{

        private Text v = new Text();
        @Override
        protected void reduce(Text key, Iterable<Text> values, Context context)
                throws IOException, InterruptedException {
            String result = "";
            for(Text t : values){
                result += t.toString() + "\t";
            }
            v.set(result);
            context.write(key, v);
        }

    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        job.setJarByClass(InverseIndex.class);
        job.setMapperClass(IndexMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        job.setCombinerClass(IndexCombiner.class);
        job.setReducerClass(IndexReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        job.waitForCompletion(true);
    }

}

运行jar包

最后将项目打成jar包命名为index.jar，将a.txt,b.txt,c.txt 上传至hdfs的ii文件夹中，新建iiout文件夹作为输出结果的存储文件，最后运行jar包（对应好相应路径）即可。
hadoop jar /root/index.jar cn.master.hadoop.mr.ii.InverseIndex //ii //iiout。