public static class RemoveRepeatMapper extends Mapper<LongWritable, Text, Text, Text>{
private static Gson gson=new Gson();
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
BBS bbs=gson.fromJson(value.toString(), BBS.class);
context.write(new Text(bbs.getUrl()), value);
}
}
public static class RemoveRepeatReducer extends Reducer<Text, Text, NullWritable, Text>{
@Override
protected void reduce(Text key, Iterable<Text> values,
Context context)
throws IOException, InterruptedException {
while(values.iterator().hasNext()){
context.write(NullWritable.get(), values.iterator().next());
break;
}
}
}
式中,式(1)中的表示词t在文档n中出现的次数,
表示文档n中词的总个数;式(2)中的|D|表示库中文档的总数目;
表示包含词t的文档的数据;式总的
是指词t与文档n的相关度
本系统改进的部分是将词出现在文档的区域(标题或者正文)加入计算Rank的考虑范畴,改进后的算法思想描述如下:
(1)令词i与帖子j的相关度表示为R,这个值越大表示越相关,初始为0
(2)统计词i在帖子j的标题中出现的次数,每次出现一次,则R的值增加5
(3)统计词i在帖子j的正文中出现的次数,每次出现一次,则R的值加1
(4)在所有帖子中统计出包含词i的帖子的数据,记为num
(5)最后计算出R的值等于R/num
相对于Rank的计算,记录Position则要容易得多,下面对此进行简单的介绍。
IKAnalzer分词软件在切分出每个词时,同时会输出这个词在帖子中出现的起始和结束为止(相对于帖子首部的偏移量)。因此,Position信息只要用两个整数表示即可,本系统用(start,end)来表示。
由于在MapReduce的计算过程中,需要在Map和Reduce阶段之间传递Rank和Position,因此本系统在将Rank和Position封装成类之后,继承了Hadoop提供的IO类中的Writable类,重新实现了一个Writable类RecordWritable ,用于封装并序列化传输Rank和Position的信息。类RecordWritable 的核心实现代码如下:
public class RecordWritable implements WritableComparable<RecordWritable> {
private LongWritable DID = new LongWritable();
private FloatWritable rank = new FloatWritable();
private Text positions = new Text();
public RecordWritable() {
}
public RecordWritable(LongWritable DID, FloatWritable rank, Text positions) {
set(DID, rank, positions);
}
public void set(LongWritable DID, FloatWritable rank, Text positions) {
this.DID.set(Long.valueOf(DID.toString()).longValue());
this.rank.set(Float.valueOf(rank.toString()).floatValue());
this.positions.set(positions.toString());
}
public void set(long DID, float rank, String positions) {
this.DID.set(DID);
this.rank.set(rank);
this.positions.set(positions);
}
public void set(long DID, RankPosition rankPosition) {
this.DID.set(DID);
this.rank.set(rankPosition.getRank());
this.positions.set(rankPosition.getPositions());
}
public void setDID(long DID) {
this.DID.set(DID);
}
public void setRank(float rank) {
this.rank.set(rank);
}
public void setPositions(Text positions) {
this.positions.set(positions);
}
@Override
public void readFields(DataInput in) throws IOException {
// TODO Auto-generated method stub
this.DID.readFields(in);
this.rank.readFields(in);
this.positions.readFields(in);
}
@Override
public void write(DataOutput out) throws IOException {
// TODO Auto-generated method stub
this.DID.write(out);
this.rank.write(out);
this.positions.write(out);
}
public LongWritable getDID() {
return DID;
}
public FloatWritable getRank() {
return rank;
}
public Text getPositions() {
return positions;
}
@Override
public boolean equals(Object obj) {
if (obj instanceof RecordWritable) {
RecordWritable tmp = (RecordWritable) obj;
return this.DID.equals(tmp.DID) && this.rank.equals(tmp.rank)
&& this.positions.equals(tmp.positions);
}
return false;
}
@Override
public String toString() {
return this.DID.toString() + ":" + this.rank.toString() + ":"
+ this.positions.toString();
}
@Override
public int compareTo(RecordWritable tmp) {
return this.DID.compareTo(tmp.DID);
}
}
本文介绍了一种处理BBS论坛帖子的方法,包括源文件的预处理与过滤、生成倒排索引的过程。详细说明了如何利用改进的TF-IDF算法计算索引词的相关度(Rank)以及记录索引词的位置(Position),以便后续检索时生成相关性排序和摘要。
4443

被折叠的 条评论
为什么被折叠?



