lucene点滴

本文深入探讨了Lucene索引中的压缩技术,包括关键词压缩、数字压缩及FieldData文件压缩。详细介绍了如何通过优化索引过程来提高效率,并提供了加快索引速度的方法,如使用单个写入器、关闭复合文件格式等。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

feild和term的关系是什么?

[在feild中切分出term]

 

===============

 

为了减小索引文件的大小,Lucene对索引还使用了压缩技术。首先,对词典文件中的关键词进行了压缩,

关键词压缩为<前缀长度,后缀>,例如:当前词为“阿拉伯语”,上一个词为“阿拉伯”,那么“阿拉伯语”压缩为<3,语>。

其次大量用到的是对数字的压缩,数字只保存与上一个值的差值(这样可以减小数字的长度,进而减少保存该数字需要的字节数)。

例如当前文章号是16389(不压缩要用3个字节保存),上一文章号是16382,压缩后保存7(只用一个字节)。 

 

 

 

Field Data文件进行了压缩,使用的是zlib。

zlib、zip不如ppmd?

 

++++++++++++++++++

 

Lucene indexes may be composed of multiple sub-indexes, or segments.

 

Each segment has a separate Fieldable Info file. 

 

tocken是term的一次出现,它包含trem文本和相应的起止偏移,以及一个类型字符串。

 

filter的作用就是限制只查询索引的某个子集,它的作用有点像SQL语句里的where,但又有区别,它不是正规查询的一部分,

只是对数据源进行预处理,然后交给查询语句。注意它执行的是预处理,而不是对查询结果进行过滤,

所以使用filter的代价是很大的,它可能会使一次查询耗时提高一百倍。

 

+++++++++++++++++

 

建索引的时候,只有一个线程在处理

 

When this file Lock File is present, a writer is currently modifying the index (adding or removing documents).

 

++++++++++++++++++

 

      while (left < right && array[left].fieldInfo.name.compareTo(partition.fieldInfo.name) <= 0)  // 这里的"<="应该为"<"

        ++left;

 

++++++++++ Merge +++++++++++

 

mergeFeild一直是二路归并,没有用多路

 

merge的时候,io有IndexReader完成

 

?何时开始merge

MergeFactor

 

涉及的类和文件类型

FieldInfos .fnm

 

单个文件无法全部载入内存时,怎么处理的?

 

+++++++++++++++++++++++

加快indexing的方法:

 

Open a single writer and re-use it for the duration of your indexing session.

Turn off compound file format.(but may run out of file descriptors)

Re-use Document and Field instances 

Always add fields in the same order to your Document, when using stored fields or term vectors

 

>Use autoCommit=false when you open your IndexWriter   (3.0.1无此设置了)

 

>Turn off any features you are not in fact using

 

>Use a faster analyzer.

 

>Speed up document construction.

 

>Don't optimize unless you really need to (for faster searching)

 

>Index into separate indices then merge.

 

>setMaxBufferedDocs(int maxBufferedDocs)

控制写入一个新的segment前内存中保存的document的数目,设置较大的数目可以加快建索引速度,默认为-1(不起用)。

   * Determines the minimal number of documents required

   * before the buffered in-memory documents are flushed as

   * a new Segment.  Large values generally gives faster

   * indexing.

Disabled by default (writer flushes by RAM usage)   

 

>我们可以先把索引写入RAMDirectory,达到一定数量时再批量写进FSDirectory,减少磁盘IO次数。

 

>setMergeFactor   (10w个doc时,100-200是比较好的值)

 

>setRAMBufferSizeMB 默认16m

Generally for faster

   * indexing performance it's best to flush by RAM usage

   * instead of document count and use as large a RAM buffer

   * as you can.

 

 

>优化时间范围限制(当需要搜索指定时间范围内的结果时)

 

 

>Quick tips:

Keep the size of the index small. Eliminate norms, Term vectors when not needed. Set Store flag for a field only if it a must.

Obvious, but oft-repeated mistake. Create only one instance of Searcher and reuse.

Keep in the index on fast disks. RAM, if you are paranoid.

 

>Use multiple threads with one IndexWriter

 

+++++++++++++++++++++

 

 

dspeak@lm-vm01:~/luowl/soft/lucene-3.0.1/src/demo$ java -cp .:/home/dspeak/luowl/search-node/lib/*  org.apache.lucene.demo.IndexFiles ~/luowl/soft/data1.xml

 

 

+++++++++++++++++++

 

 

 每次optimize都会重新做索引

 

 

+++++++++++++++++++++

 

LEVEL_LOG_SPAN, level的意义是?

Whenever extra segments (beyond the merge factor upper bound) are encountered,

 all segments within the level are merged.

 按level分组

 

++++++++++++++

 

We keep a separate Posting hash and other state for each

 *  thread and then merge postings hashes from all threads

 *  when writing the segment. 

 

++++++++++++++++++++++++

 

 

      /*

      This is the current indexing chain:

 

      DocConsumer / DocConsumerPerThread

        --> code: DocFieldProcessor / DocFieldProcessorPerThread

          --> DocFieldConsumer / DocFieldConsumerPerThread / DocFieldConsumerPerField

            --> code: DocFieldConsumers / DocFieldConsumersPerThread / DocFieldConsumersPerField

              --> code: DocInverter / DocInverterPerThread / DocInverterPerField

                --> InvertedDocConsumer / InvertedDocConsumerPerThread / InvertedDocConsumerPerField

                  --> code: TermsHash / TermsHashPerThread / TermsHashPerField

                    --> TermsHashConsumer / TermsHashConsumerPerThread / TermsHashConsumerPerField

                      --> code: FreqProxTermsWriter / FreqProxTermsWriterPerThread / FreqProxTermsWriterPerField

                      --> code: TermVectorsTermsWriter / TermVectorsTermsWriterPerThread / TermVectorsTermsWriterPerField

                --> InvertedDocEndConsumer / InvertedDocConsumerPerThread / InvertedDocConsumerPerField

                  --> code: NormsWriter / NormsWriterPerThread / NormsWriterPerField

              --> code: StoredFieldsWriter / StoredFieldsWriterPerThread / StoredFieldsWriterPerField

    */

 

 

++++++++++++++++++++++++++++++    

 

TermVector用来做什么的?

[统计词频,求词的出现位置等等]

 

++++++++++++++++++++

根据代码分析和测试,不Optimize, close的时候也会flush,FormatPostingsDocsWriter.addDoc会被调用

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值