lucene点滴-优快云博客

本文深入探讨了Lucene索引中的压缩技术，包括关键词压缩、数字压缩及FieldData文件压缩。详细介绍了如何通过优化索引过程来提高效率，并提供了加快索引速度的方法，如使用单个写入器、关闭复合文件格式等。

feild和term的关系是什么?

[在feild中切分出term]

===============

为了减小索引文件的大小，Lucene对索引还使用了压缩技术。首先，对词典文件中的关键词进行了压缩，

关键词压缩为<前缀长度，后缀>，例如：当前词为“阿拉伯语”，上一个词为“阿拉伯”，那么“阿拉伯语”压缩为<3，语>。

其次大量用到的是对数字的压缩，数字只保存与上一个值的差值（这样可以减小数字的长度，进而减少保存该数字需要的字节数）。

例如当前文章号是16389（不压缩要用3个字节保存），上一文章号是16382，压缩后保存7（只用一个字节）。

Field Data文件进行了压缩，使用的是zlib。

zlib、zip不如ppmd?

++++++++++++++++++

Lucene indexes may be composed of multiple sub-indexes, or segments.

Each segment has a separate Fieldable Info file.

tocken是term的一次出现，它包含trem文本和相应的起止偏移，以及一个类型字符串。

filter的作用就是限制只查询索引的某个子集，它的作用有点像SQL语句里的where，但又有区别，它不是正规查询的一部分，

只是对数据源进行预处理，然后交给查询语句。注意它执行的是预处理，而不是对查询结果进行过滤，

所以使用filter的代价是很大的，它可能会使一次查询耗时提高一百倍。

+++++++++++++++++

建索引的时候，只有一个线程在处理

When this file Lock File is present, a writer is currently modifying the index (adding or removing documents).

++++++++++++++++++

while (left < right && array[left].fieldInfo.name.compareTo(partition.fieldInfo.name) <= 0) // 这里的"<="应该为"<"

++left;

++++++++++ Merge +++++++++++

mergeFeild一直是二路归并，没有用多路

merge的时候，io有IndexReader完成

?何时开始merge

MergeFactor

涉及的类和文件类型

FieldInfos .fnm

单个文件无法全部载入内存时，怎么处理的?

+++++++++++++++++++++++

加快indexing的方法:

Open a single writer and re-use it for the duration of your indexing session.

Turn off compound file format.(but may run out of file descriptors)

Re-use Document and Field instances

Always add fields in the same order to your Document, when using stored fields or term vectors

>Use autoCommit=false when you open your IndexWriter (3.0.1无此设置了)

>Turn off any features you are not in fact using

>Use a faster analyzer.

>Speed up document construction.

>Don't optimize unless you really need to (for faster searching)

>Index into separate indices then merge.

>setMaxBufferedDocs(int maxBufferedDocs)

控制写入一个新的segment前内存中保存的document的数目，设置较大的数目可以加快建索引速度，默认为-1（不起用）。

* Determines the minimal number of documents required

* before the buffered in-memory documents are flushed as

* a new Segment. Large values generally gives faster

* indexing.

Disabled by default (writer flushes by RAM usage)

>我们可以先把索引写入RAMDirectory，达到一定数量时再批量写进FSDirectory，减少磁盘IO次数。

>setMergeFactor (10w个doc时，100-200是比较好的值)

>setRAMBufferSizeMB 默认16m

Generally for faster

* indexing performance it's best to flush by RAM usage

* instead of document count and use as large a RAM buffer

* as you can.

>优化时间范围限制(当需要搜索指定时间范围内的结果时)

>Quick tips:

Keep the size of the index small. Eliminate norms, Term vectors when not needed. Set Store flag for a field only if it a must.

Obvious, but oft-repeated mistake. Create only one instance of Searcher and reuse.

Keep in the index on fast disks. RAM, if you are paranoid.

>Use multiple threads with one IndexWriter

+++++++++++++++++++++

dspeak@lm-vm01:~/luowl/soft/lucene-3.0.1/src/demo$ java -cp .:/home/dspeak/luowl/search-node/lib/* org.apache.lucene.demo.IndexFiles ~/luowl/soft/data1.xml

+++++++++++++++++++

每次optimize都会重新做索引

+++++++++++++++++++++

LEVEL_LOG_SPAN, level的意义是?

Whenever extra segments (beyond the merge factor upper bound) are encountered,

all segments within the level are merged.

按level分组

++++++++++++++

We keep a separate Posting hash and other state for each

* thread and then merge postings hashes from all threads

* when writing the segment.

++++++++++++++++++++++++

This is the current indexing chain:

DocConsumer / DocConsumerPerThread

--> code: DocFieldProcessor / DocFieldProcessorPerThread

--> DocFieldConsumer / DocFieldConsumerPerThread / DocFieldConsumerPerField

--> code: DocFieldConsumers / DocFieldConsumersPerThread / DocFieldConsumersPerField

--> code: DocInverter / DocInverterPerThread / DocInverterPerField

--> InvertedDocConsumer / InvertedDocConsumerPerThread / InvertedDocConsumerPerField

--> code: TermsHash / TermsHashPerThread / TermsHashPerField

--> TermsHashConsumer / TermsHashConsumerPerThread / TermsHashConsumerPerField

--> code: FreqProxTermsWriter / FreqProxTermsWriterPerThread / FreqProxTermsWriterPerField

--> code: TermVectorsTermsWriter / TermVectorsTermsWriterPerThread / TermVectorsTermsWriterPerField

--> InvertedDocEndConsumer / InvertedDocConsumerPerThread / InvertedDocConsumerPerField

--> code: NormsWriter / NormsWriterPerThread / NormsWriterPerField

--> code: StoredFieldsWriter / StoredFieldsWriterPerThread / StoredFieldsWriterPerField

++++++++++++++++++++++++++++++

TermVector用来做什么的?

[统计词频，求词的出现位置等等]

++++++++++++++++++++

根据代码分析和测试，不Optimize, close的时候也会flush，FormatPostingsDocsWriter.addDoc会被调用