lucene 3.0 分词例子转载

最新推荐文章于 2018-08-09 19:10:46 发布

原创最新推荐文章于 2018-08-09 19:10:46 发布 · 124 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#lucene #Apache #Blog

lucene 专栏收录该内容

22 篇文章

订阅专栏

本文通过示例详细解析了Lucene标准分词器的工作原理，包括如何利用不同类型的属性来提取文本特征，并介绍了这些属性在短语搜索等场景中的应用。

源：http://hxraid.iteye.com/blog/634577

首先我们用下面的代码来看看打印标准分词器的运行结果 (在2.9下也可以运行)

class StandardTest{
	public static void main(String[] args) throws IOException{
		//输入流
		StringReader s=new StringReader(new String("I'm a student. these are apples"));
                //标准分词
		TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_CURRENT, s);
		//标准过滤
                tokenStream=new StandardFilter(tokenStream);
                //大小写过滤
		tokenStream=new LowerCaseFilter(tokenStream);
		
		TermAttribute termAtt=(TermAttribute)tokenStream.getAttribute(TermAttribute.class);
		TypeAttribute typeAtt=(TypeAttribute)tokenStream.getAttribute(TypeAttribute.class);
		OffsetAttribute offsetAtt=(OffsetAttribute)tokenStream.getAttribute(OffsetAttribute.class);
		PositionIncrementAttribute  posAtt=(PositionIncrementAttribute)tokenStream.getAttribute(PositionIncrementAttribute.class);
  		
		
		System.out.println("termAtt       typeAtt       offsetAtt       posAtt");
		while (tokenStream.incrementToken())  {  
			System.out.println(termAtt.term()+" "+typeAtt.type()+" ("+offsetAtt.startOffset()+","+offsetAtt.endOffset()+")   "+posAtt.getPositionIncrement());  
		} 
        }
}

打印结果：

termAtt	typeAtt	offsetAtt	posAtt
i'm	<APOSTROPHE>	(0,3)	1
a	<ALPHANUM>	(4,5)	1
student	<ALPHANUM>	(6,13)	1
these	<ALPHANUM>	(15,20)	1
are	<ALPHANUM>	(21,34)	1
apples	<ALPHANUM>	(25,31)	1

在前面讲 StandardTokenizer的的时候，我们已经谈到了token的这四种属性。在这里我们再次强调一下这些Lucene的基础知识。

Lucene 3.0之后，TokenStream中的每一个token不再用next()方法返回，而是采用了incrementToken()方法(具体参见上面)。每调用一次incrementToken()，都会得到token的四种属性信息(org.apache.lucene.analysis.tokenattributes包中):

如上例：

原文本：I'm a student. these are apples

TokenSteam： [1： I'm ] [2：a] [3：student] [4：these] [5：are ] [6：apples]

(1) TermAttribute：表示token的字符串信息。比如"I'm"

(2) TypeAttribute：表示token的类别信息(在上面讲到)。比如 I'm 就属于<APOSTROPHE>，有撇号的类型

(3) OffsetAttribute：表示token的首字母和尾字母在原文本中的位置。比如 I'm 的位置信息就是(0,3)

(4) PositionIncrementAttribute：这个有点特殊，它表示tokenStream中的当前token与前一个token在实际的原文本中相隔的词语数量。

比如：在tokenStream中[2：a] 的前一个token是[1： I'm ] ，它们在原文本中相隔的词语数是1，则token="a"的PositionIncrementAttribute值为1。如果token是原文本中的第一个词，则默认值为1。因此上面例子的PositionIncrementAttribute结果就全是1了。

如果我们使用停用词表来进行过滤之后的话：TokenSteam就会变成： [1： I'm ] [2：student] [3：apples]这时student的PositionIncrementAttribute值就不会再是1，而是与[1： I'm ]在原文本中相隔词语数量=2。而apples则变成了5。

那么这个属性有什么用呢，用处很大的。加入我们想搜索一个短语student apples(假如有这个短语)。很显然，用户是要搜索出student apples紧挨着出现的文档。这个时候我们找到了某一篇文档(比如上面例子的字符串)都含有student apples。但是由于apples的PositionIncrementAttribute值是5，说明肯定没有紧挨着。怎么样，用处很大吧。轻而易举的解决了短语搜索的难题哦。

其实还有两种：PayloadAttribute和FlagsAttribute。