hadoop之倒排索引

倒排索引

不懂倒排索引含义的见以下链接
倒排索引详解

目的:用hadoop做一个简单的倒排索引

准备文件

几个简单的文本文件:
a.txt

hello tom
hello kitty
hello jack

b.txt

hello jerry
hello tom
hello tim

c.txt

hello tom
hello jack

实现原理分析

1、最后我们需要得到一个文件,文件类容如下:
hello a.txt->3 b.txt->3 c.txt->2
jack a.txt->1 b.txt->0 c.txt->1
jerry a.txt->0 b.txt->1 c.txt->0
kitty a.txt->1 b.txt->0 c.txt->0
tim a.txt->0 b.txt->1 c.txt->0
tom a.txt->1 b.txt->1 c.txt->1
2、要得到这样的输出reduce 输出的key和value均为Text类型,形如“hello”,“a.txt->3 b.txt->3 c.txt->2”
3、map的输入的key和value类型分别为LongWritable和Text。 输出的key和value类型分别为Text和Text
map的输入<0,”hello tom”> <9,”hello kitty”>……….
map的输出<”hello->a.txt”,”1”> <”hello->a.txt”,”1”> <”hello->a.txt”,1> <”jack->a.txt”,”1”>
<”kitty->a.txt”,”1”> <”tom->a.txt”,”1”> …………. //这里只列出文件a.txt
4、combiner (特殊的reduce)的输入为 <”hello->a.txt”,{1,1,1}> <”jack->a.txt”,{1}>
<”kitty->a.txt”,{1}> <”tom->a.txt”,{1}> //文件a.txt
<”hello->b.txt”,{1,1,1}> <”jerry->b.txt”,{1}>
<”tim->b.txt”,{1}> <”tom->b.txt”,{1}> //文件b.txt
<”hello->c.txt”,{1,1}> <”jack->c.txt”,{1}>
<”tom->c.txt”,{1}> //文件c.txt
输出为<”hello->a.txt”,3> <”jack->a.txt”,1> <”kitty->a.txt”,1>
<”tom->a.txt”,1> ……………. //也只累出文件a.txt
5、reduce 的输入即为combiner的输出
reduce的输出只需将key重新拆分,拆分组装后的结果
<”hello”,”a.txt->3”> <”hello”,”b.txt->3”> <”hello”,”c.txt->2”> //单词hello
<”jack”,”a.txt->1”> <”jack”,”c.txt->1”> //单词jack



最后聚合即为最后的结果,下面是代码。

代码实现

package cn.master.hadoop.mr.ii;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class InverseIndex {

    public static class IndexMapper extends Mapper<LongWritable, Text, Text, Text>{

        private Text k = new Text();
        private Text v = new Text();
        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            String line  =  value.toString();
            String[] words = line.split(" ");
            FileSplit inputSplit = (FileSplit) context.getInputSplit();
            String path = inputSplit.getPath().toString();
            for(String w : words){
                k.set(w + "->" + path);
                v.set("1");
                context.write(k, v);
            }
        }

    }

    //此处的Combiner实现了一个计数的功能,是不可插拔的。
    public static class IndexCombiner extends Reducer<Text, Text, Text, Text>{

        private Text k = new Text();
        private Text v = new Text();
        @Override
        protected void reduce(Text key, Iterable<Text> values, Context context)
                throws IOException, InterruptedException {
            String[] wordAndPath = key.toString().split("->");
            String word = wordAndPath[0];
            String path = wordAndPath[1];
            int counter = 0;
            for(Text t : values){
                counter += Integer.parseInt(t.toString());
            }
            k.set(word);
            v.set(path + "->" + counter);
            context.write(k, v);
        }
    }

    public static class IndexReducer extends Reducer<Text, Text, Text, Text>{

        private Text v = new Text();
        @Override
        protected void reduce(Text key, Iterable<Text> values, Context context)
                throws IOException, InterruptedException {
            String result = "";
            for(Text t : values){
                result += t.toString() + "\t";
            }
            v.set(result);
            context.write(key, v);
        }

    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        job.setJarByClass(InverseIndex.class);
        job.setMapperClass(IndexMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        job.setCombinerClass(IndexCombiner.class);
        job.setReducerClass(IndexReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        job.waitForCompletion(true);
    }

}

运行jar包

最后将项目打成jar包命名为index.jar,将a.txt,b.txt,c.txt 上传至hdfs的ii文件夹中,新建iiout文件夹作为输出结果的存储文件 ,最后运行jar包(对应好相应路径)即可。
hadoop jar /root/index.jar cn.master.hadoop.mr.ii.InverseIndex //ii //iiout。

Hadoop倒排索引是一种常见的分布式计算应用,它可以高效地处理大规模的文本数据,并生成倒排索引。下面是一个基于Java实现的Hadoop倒排索引程序的示例代码: ```java import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class InvertedIndex { public static class TokenizerMapper extends Mapper<Object, Text, Text, Text> { private Text word = new Text(); private Text docID = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String[] line = value.toString().split("\t"); String id = line[0]; String text = line[1]; docID.set(id); StringTokenizer itr = new StringTokenizer(text); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, docID); } } } public static class IntSumReducer extends Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { StringBuilder sb = new StringBuilder(); for (Text val : values) { sb.append(val.toString()).append(","); } context.write(key, new Text(sb.toString())); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "inverted index"); job.setJarByClass(InvertedIndex.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } ``` 该程序包含两个Mapper和Reducer类。第一个Mapper类用于将输入文本数据切分成单词,并将单词作为Key,文档ID作为Value输出。第一个Reducer类用于将相同单词的文档ID合并为一个字符串,并将单词作为Key,文档ID字符串作为Value输出。在第一个Mapper和Reducer之后,可以添加多个Mapper和Reducer进行进一步的处理和优化。 要运行该程序,可以使用以下命令: ``` hadoop jar InvertedIndex.jar InvertedIndex input output ``` 其中,InvertedIndex.jar是程序的打包文件,input是输入数据的路径,output是输出结果的路径。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值