MapReduce实战之倒排索引

博客介绍了如何通过MapReduce实现倒排索引的构建,分为两个步骤:首先,通过一次MapReduce得到初步的单词-文件对应计数;然后,再进行第二次MapReduce得到最终的倒排索引。示例中展示了具体的操作过程,包括mapper和reducer的代码实现。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

来源:https://www.bilibili.com/video/av36033875?from=search&seid=12700632591522714293

需求

有大量的文本(文档、网页),需要建立搜索索引
输入:

  • 文本a.txt:

pdc pdc
a b

  • 文本b.txt:

pdc pdc
a b

  • 文本c.txt:

pdc pdc
a b

预期结果:

pdc a.txt–2 b.txt–2 c.txt–2
a a.txt–1 b.txt–1 c.txt–1
b a.txt–1 b.txt–1 c.txt–1

分析

可以分为两步来实现:
1.第一次mapReduce结果:

pdc–a.txt 2
pdc–b.txt 2
pdc–c.txt 2
a–a.txt 1
a–b.txt 1
a–c.txt 1
b–a.txt 1
b–b.txt 1
b–c.txt 1

2.第二次mapReduce结果:

pdc c.txt–>2 b.txt–>2 a.txt–>2
a c.txt–>1 b.txt–>1 a.txt–>1
b c.txt–>1 b.txt–>1 a.txt–>1

开始编写代码

1.第一次mapReduce:

FirstIndexMapper:

public class FirstIndexMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

	Text k = new Text();
	
	@Override
	protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
		// 1 获取切片名称
		FileSplit inputSplit = (FileSplit) context.getInputSplit();
		String name = inputSplit.getPath().getName();
		// 2 获取1行
		String line = value.toString();
		// 3 截取
		String[] words = line.split(" ");
		// 4 把每个单词和切片名称关联起来
		for (String word : words) {
			k.set(word + "--" + name);
			context.write(k, new IntWritable(1));
		}
	}
}

FirstIndexReducer:

public class FirstIndexReducerextends Reducer<Text, IntWritable, Text, IntWritable>{
	
	@Override
	protected void reduce(Text key, Iterable<IntWritable> values,
			Context context) throws IOException, InterruptedException {
		int count = 0;
		// 累加和
		for(IntWritable value: values){
			count +=value.get();
		}
		// 写出
		context.write(key, new IntWritable(count));
	}
}

FirstIndexDriver:

public class FirstIndexDriver{

	public static void main(String[] args) throws Exception {
		
		Configuration conf = new Configuration();

		Job job = Job.getInstance(conf);
		job.setJarByClass(OneIndexDriver.class);

		job.setMapperClass(FirstIndexMapper.class);
		job.setReducerClass(FirstIndexReducer.class);

		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);

		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));

		job.waitForCompletion(true);
	}
}

此时数据为:如 pdc -a.txt 2

2.第二次mapReduce:

SecondIndexMapper:

public class SecondIndexMapperextends Mapper<LongWritable, Text, Text, Text>{

	Text k = new Text();
	Text v = new Text();
	
	@Override
	protected void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {
		// 1 获取一行数据
		String line = value.toString();
		// 2用“--”切割
		String[] fields = line.split("--");
		k.set(fields[0]);//pdc
		v.set(fields[1]);//a.txt    2
		// 3 输出数据
		context.write(k, v);
	}
}

SecondIndexReducer:

public class SecondIndexReducerextends Reducer<Text, Text, Text, Text> {

	@Override
	protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
		// pdc a.txt 2
		// pdc b.txt 2
		// pdc c.txt 2
		// 变为:
		// pdc c.txt-->2 b.txt-->2 a.txt-->3
		StringBuilder sb = new StringBuilder();
		for (Text value : values) {
			sb.append(value.toString().replace("\t", "-->") + "\t");
		}
		context.write(key, new Text(sb.toString()));
	}
}

最后修改一下Dirver即可

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值