lucene in action第二章（2）（深…

最新推荐文章于 2017-02-12 15:06:08 发布

原创最新推荐文章于 2017-02-12 15:06:08 发布 · 583 阅读

0 ·

CC 4.0 BY-SA版权

lucene 专栏收录该内容

7 篇文章

订阅专栏

本文详细介绍了Lucene中Field的概念及应用，包括Field的处理方式、向量空间模型、同名多Field处理、文档和Field的boost提升、以及norm的作用与控制方法等内容。

Field 详解

document是search和index的基本单位，Field就是存储数据的基本单位。Field 有 name和value还有其他很多的选项，可以控制它的行为。

一、如上一篇所讲的 Field的三种处理

new Field("city", "Den Haag", Field.Store.YES, Field.Index.ANALYZED,TermVector.WITH_POSITIONS);

1、是否index，

用 Field.Index.ANALYZED Field.Index.NO等等来表示

ANALYZED
使用analyzer 分析，将分析得到的字段用于索引，

ANALYZED_NO_NORMS
ANALYZED 的变体，区别是，ANALYZED 存储了index-time boost information等norms，而ANALYZED_NO_NORMS 不存储，这会在search的时候节约内存空间

NO
这个field不能被search

NOT_ANALYZED
不使用 analyzer分析，整体作为一个token，常用语精确匹配，例如文件名，ID等就用这个。

NOT_ANALYZED_NO_NORMS
同理 NOT_ANALYZED 的变种

2、index是否存储term vector

term 就是analyzer分词后的词组。每一个document 都含有一个term vector 存储了这个document含有的term（unique，如果某个term出现多次也只存一个），以及这个term出现在field 中的position，以及offset。这些信息可以用来以后高亮一个选中的term等等。

3、field的value是否存在index中

用Field.Store.YES,Field.Store.NO来表示

二、向量空间模型（vector space model ）与term vector

term 就是analyzer分词后的词组。每一个document 都含有一个term vector。它存储了这个document含有的term（unique，如果某个term出现多次也只存一个），以及这个term出现在field 中的position，以及offset。这些信息可以用来以后高亮一个选中的term等等。

与《集体智慧编程》中的第二章是一样滴。

我们可以使用 f.setOmitTermFreqAndPositions(true)（Omit是“忽略，跳过”的意思）高数indexWriter不用存储term的count和position等信息。减少磁盘空间。

可以使用Field的第四个参数控制是否存储termVector。

Field(String name, String value, Field.Store store, Field.Index index, Field.TermVector termVector)

TermVector.YES

Records the unique terms that occurred, and their counts,in each document, but doesn’t store any positions or offsets information

TermVector.WITH_POSITIONS

Records the unique terms and their counts,and also the positions of each occurrence of every term, but no offsets

TermVector.WITH_OFFSETS

Records the unique terms and their counts, with the offsets (start and end character position) of each occurrence of every term,but no positions

TermVector.WITH_POSITIONS_OFFSETS

Stores unique terms and their counts,along with positions and offsets

TermVector.NO

Doesn’t store any term vector information

例如：

Field f = new Field("city", "Den Haag", Field.Store.YES, Field.Index.ANALYZED,TermVector.WITH_POSITIONS);

三、是否存储 filed的value

用Field.Store.YES,Field.Store.NO来高数indexWriter 是否存储这个Field 的value。比如很长的文章，就不必存放了，如果title这个较短的可以存放。以节约空间。当然如果需要存放的话，可以在存放前进行压缩处理，使用Lucene的一个utility class（工具类） org.apache.lucene.document.CompressionTools 。但是这回会消耗磁盘交换空间和cpu资源。

四、一些常用的Field.Analyzed ,Field.Store 和 TermVector 的组合使用情况

五、处理同名的多field。

例如一本书有多个author。lucene允许同名的 field。

Document doc = new Document();

for (String author : authors)

{

doc.add(new Field("author", author, Field.Store.YES, Field.Index.ANALYZED )）；

}

六、boosting document and field

boosting 的意思是推动，加力的意思，就是给每一个document或者field 一个分数，以区别他们的重要性

boosting 可以在index的时候做也可以在search的时候做。

默认情况下所有的document都没有boost 或者说他们的boost 系数都是1.0

1、boosting Field

Field class 有一个 setBoost方法可以使用。但是记住在如果修改一个document的boost，需要删除这个docment，再加入一个新的 document或者update一个document。lucene update的基本单位是document。而update是delete和add 的组合。

2、boosting document。

其实就是为document的每一个field设置于document相同的boost值。

七、norm

ANALYZED_NO_NORMS 这个可以控制是否存储norm。norm的意思是规范，基准；定额，分配之工作量。

Norm的前世今生

field有一个boost 值。这个值是一个float数。每一个document也有。在index阶段这个boost 的float值会被转化为一个byte存在每一个document每一个field中。当在searching阶段，这些norm都会被读入到内存中来，被转化为原来那个float的boost值。

norm 会被调入内存，所以在search的时候 norm是很吃内存的。

norm需要注意的地方。

当决定要关闭index的norm的时候，必须重建整个index。因为在index merge的时候，如果一个index中有一个document有norm，也会导致merge后的整个index都会含有norm。

可以使用indexReader的 setNorm方法，更改norm。但是不要用的好。这个方法会在lucene4.0废掉，并且没有替代的方法。如下所示

void setNorm(int doc, String field, byte value)
Deprecated. Write support will be removed in Lucene 4.0. There will be no replacement for this method.