lucene 3.x版本采用了全新的API,作为过渡的2.9中那些deprecated方法在3.0中已经彻底废弃了。不过我也没有太多东西要改,主要是修正了TokenStreams的相关代码,似乎TokenStream也是3.0中最大的革新。
A new TokenStream API has been introduced with Lucene 2.9. This API has moved from being Token-based to Attribute-based. While Token still exists in 2.9 as a convenience class, the preferred way to store the information of a Token is to use AttributeImpls.
lucene的中文分词是使用的mmseg4j 1.8.2,这个版本也是针对lucene 2.x的,因此首先对mmseg4j下手。
与lucene相关的代码全部位于com.chenlb.mmseg4j.analysis;中,可以看到要做修正的地方并不多,主要还是把MMSegTokenizer中的next()换作boolean incrementToken()
//class MMSegTokenizer
public MMSegTokenizer(Seg seg, Reader input) {
super(input);
mmSeg = new MMSeg(input, seg);
offsetAtt = addAttribute(OffsetAttribute.class);
termAtt = addAttribute(TermAttribute.class);
}
@Override
public boolean incrementToken() throws IOException {
clearAttributes();
Word word = mmSeg.next();
if (word != null) {
termAtt.setTermBuffer(word.getString());
offsetAtt.setOffset(word.getStartOffset(), word.getEndOffset());
return true;
} else {
return false;
}
}
之前的Token-based或许比较好理解,但采用现在Attribute-based似乎更简洁,在之前需要next()地方,现在也得改用incrementToken()啰,比如这样一个方法
static void printTokenStream(TokenStream ts) throws IOException {
TermAttribute termAtt = (TermAttribute)ts.getAttribute(TermAttribute.class);
while (ts.incrementToken()) {
System.out.println(termAtt.term());
}
}
本文介绍Lucene 3.x版本中引入的新API,该API从基于Token转为基于Attribute,同时概述了如何更新mmseg4j中文分词器以适配这一变化。
2022

被折叠的 条评论
为什么被折叠?



