lucene in action 第二章（3）（索…

最新推荐文章于 2020-12-09 10:49:03 发布

原创最新推荐文章于 2020-12-09 10:49:03 发布 · 973 阅读

0 ·

CC 4.0 BY-SA版权

lucene 专栏收录该内容

7 篇文章

订阅专栏

本文深入探讨了Lucene搜索引擎的高级特性，包括数值字段索引、域截断、近似实时搜索、索引优化等技术细节，并介绍了多线程环境下IndexReader与IndexWriter的工作原理。

一、索引数值类型的数据。

在早期的lucene，数值类型的值，如“1900”是作为一个文本来对待的。他就是一个字符串，没有大小，没有范围。在实际使用当中，我们经常需要使用数字作为索引。例如，图书的价格，邮件的收发时间等等。

在lucene2.9以后提供了这种numeric index的功能。

可以使用一个 NumericField的Field子类来实现。使用setXXValue的方法可以为Field设置numeric的值。

doc.add(new NumericField("price").setDoubleValue(19.99));

这样这个price 的field就可以用于搜索，sort 了。也可以像textual文本那样精确匹配。

通用 NumericField可以接受多个同名的field。

例如：doc.add(new NumericField("price").setDoubleValue(19.99));

doc.add(new NumericField("price").setINTValue(9.99));

在获取搜索结果的时候，NumericRangeQuery 和 NumericRangeFilter这个两个类会以 “or” 的形式来对待所有一个同名的field有多个value的情况。但是sort在这种情况下没有定义。

我们在index时间和日期的时候，就可以使用NumericField了。基本方法是将其转化为数值。

doc.add(new NumericField("timestamp") .setLongValue(new Date().getTime()));

Calendar cal = Calendar.getInstance();

cal.setTime(date);

doc.add(new NumericField("dayOfMonth") .setIntValue(cal.get(Calendar.DAY_OF_MONTH)));

二、域截断

在我们实例化一个IndexWriter的时候，会传入一个参数 IndexWriter.MaxFieldLength.UNLIMITED

new IndexWriter(dir,new StandardAnalyzer(Version.LUCENE_36),IndexWriter.MaxFieldLength.UNLIMITED);

这个参数就是用户域截断的。当我们在索引一个Field的value的时候，我们事先不知道这个value有多长，也许很长，

或者这个value是非textual的很长的 binary content。这就会造成问题。我们可以使用这个参数控制index的数量。

也可使用indexWriter的setMaxFieldLength(maxFieldLength)

writer.setMaxFieldLength(maxFieldLength)

这个maxFieldLength的意思是index的最多多少个term：

The maximum number of terms that will be indexed for a single field in a document

如果刚开始的indexWriter的没有截断，在后面设置了设置截断长度，不影响前面的index。

！慎用这个feature，因为可能造成索引不全，影响体验。

三、近似实时搜索

当我们频繁通过一个writer来修改一个index。现在我们想要search实时的内容时候就可以调用getReader方法。来获得一个只读的IndexReader。它会将这个writer的commit或者uncommit的change添加到index中，并返回一个read-only indexReader。

IndexReader reader = writer.getReader();

记住使用玩这个reader后记住关闭它。

这是lucene对它的解释。

"near real-time" searching, in that changes made during an IndexWriter session can be quickly made available for searching without closing the writer nor calling commit(). Note that this is functionally equivalent to calling {#flush} and then using IndexReader.open(org.apache.lucene.store.Directory) to open a new reader. But the turarnound time of this method should be faster since it avoids the potentially costly commit() .You must close the IndexReader returned by this method once you are done using it.

四、优化index

优化原因：频繁更新indexWriter，会生成很多的segment（每个segment会有多个文件）。lucene在search的时候会search每一个segment然后合并结果，造成速度下降。而且太多的segment会消耗更多的文件描述符。所以需要优化。

优化的目的就是将散落的多个segmet合并为N个，提高search的速度。

indexWriter提供了4个方法用于优化。

􀂃 optimize() reduces the index to a single segment, not returning until the

operation is finished.

􀂃 optimize(int maxNumSegments), also known as partial optimize, reduces the

index to at most maxNumSegments segments. Because the final merge down to

one segment is the most costly, optimizing to, say, five segments should be quite

a bit faster than optimizing down to one segment, allowing you to trade less

optimization time for slower search speed.

􀂃 optimize(boolean doWait) is just like optimize, except if doWait is false then

the call returns immediately while the necessary merges take place in the background.

Note that doWait=false only works for a merge scheduler that runs

merges in background threads, such as the default ConcurrentMergeScheduler .

Section 2.13.6 describes merge schedulers in more detail.

􀂃 optimize(int maxNumSegments, boolean doWait) is a partial optimize that

runs in the background if doWait is false.

注意！优化会使用临时文件夹，因为在合并segment后的文件是放在tmp目录下的，当合并完成后才会删除原来的segment，这就必须保证有足够的磁盘空间。

五、lucene的Directory

Directory dir = FSDirectory.open(new File(indexDir));

writer = new IndexWriter(dir,new StandardAnalyzer(Version.LUCENE_36),false,IndexWriter.MaxFieldLength.UNLIMITED);

todo。

六、lucene indexReader 和 IndexWriter在多线程下工作

1、任意多个的indexReader 可以在同一个index上打开。不管是在同一个jvm上还是多个jvm上，都没有问题。但是如果在同一个jvm上多个线程公用一个indexReader会更好滴利用资源。甚至在一个indexWriter在修改index的时候也可以创建一个indexReader。

2、在一个index上只能打开一个indexWriter。只是不能同时打开。每次只能打开一个。因为当一个indexWriter实例化后，就会创建一个write 锁。直到这个indexWriter关闭后，才释放这个锁。所以只能有一个indexWriter在一个index上。

3、但是可以有任意多个 thread 共享一个 indexReader或者indexwriter。因为这两个类是线程安全的。

4、writeIndexer的锁。Index每一个时刻只能有一个writerIndexer打开，控制这个机制就是使用write.clock来决定的。

例如两个IndexWriter同时打开一个index，会抛出 LockObtainFailedException异常。

{
IndexWriter writer = new IndexWriter(dir,new StandardAnalyzer(Version.LUCENE_36),IndexWriter.MaxFieldLength.UNLIMITED);
IndexWriter new = new IndexWriter(dir,new StandardAnalyzer(Version.LUCENE_36),IndexWriter.MaxFieldLength.UNLIMITED);

}

Exception in thread "main" org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@D:\work\eclipseworkpace\lucenelearn\charpter2-1\write.lock

at org.apache.lucene.store.Lock.obtain(Lock.java:84)

at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1098)

at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:953)

at charpter2.ChangeIndex.<init>(ChangeIndex.java:46)

at charpter2.ChangeIndex.main(ChangeIndex.java:177)

可以自行设置锁的机制。使用Directory的 setLockFactory 方法设置锁的机制。

dir = FSDirectory.open(new File(indexDir));

dir.setLockFactory(lockFactory)

this.writer = new IndexWriter(dir,new StandardAnalyzer(Version.LUCENE_36),IndexWriter.MaxFieldLength.UNLIMITED);

注意必须在调用IndexWriter构造函数之前，设置setLockFactory方法。

lucene有四种 LockFactory的实现

NativeFSLockFactory This is the default locking for FSDirectory, using java.nio native OS locking, which will never leave leftover lock files when the JVM exits. But this locking implementation may not work correctly

over certain shared file systems, notably NFS.

SimpleFSLockFactory Uses Java’s File.createNewFile API, which may be more portable across different file systems than NativeFSLockFactory. Be aware that if the JVM crashes or IndexWriter isn’t closed before the JVM exits, this may leave a leftover write.lock file, which you must manually remove.

SingleInstanceLockFactory Creates a lock entirely in memory. This is the default locking implementation for RAMDirectory. Use this when you know all IndexWriters will be instantiated in a single JVM.

NoLockFactory Disables locking entirely. Be careful! Only use this when you are

absolutely certain that Lucene’s normal locking safeguard isn’t necessary—for example, when using a private RAMDirectory with a single IndexWriter instance.

可以使用 IndexWriter 的isLocked(Directory) 方法检查一个index是否已经上锁了。

可以使用 IndexWriter 的unlock(Directory) 给一个index解锁，但是最好不要用，可能会使一个index变得不可用。

七、 lucene的debug

this.writer = new IndexWriter(dir,new StandardAnalyzer(Version.LUCENE_36),IndexWriter.MaxFieldLength.UNLIMITED); this.writer.setInfoStream(System.out);

可以使用setInfoStream（PrintStream) lucene 会打印出所有 lucene的index过程的信息，如下

docs = 4

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: commit: start

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: commit: enter lock

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: commit: now prepare

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: prepareCommit: flush

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: now trigger flush reason=explicit flush

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: start flush: applyAllDeletes=true

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: index before flush _0(3.6.1):C2/1 _2(3.6.1):C2/1 _4(3.6.1):C1

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: DW: flush postings as segment _5 numDocs=1

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: DW: new segment has no vectors

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: DW: flushedFiles=[_5.fdt, _5.prx, _5.fnm, _5.nrm, _5.tis, _5.fdx, _5.frq, _5.tii]

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: DW: flush: segment=_5(3.6.1):C1

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: DW: ramUsed=0.095 MB newFlushedSize=0 MB (0 MB w/o doc stores) docs/MB=3,898.052 new/old=0.191%

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: DW: flush: push buffered deletes startSize=98 frozenSize=1024

BD 0 [Sun Mar 18 17:54:58 CST 2012; main]: push deletes 1 deleted terms (unique count=1) bytesUsed=1024 delGen=1 packetCount=1

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: DW: flush: delGen=1

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: DW: flush time 314 msec

IFD [Sun Mar 18 17:54:58 CST 2012; main]: now checkpoint "segments_5" [4 segments ; isCommit = false]

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: apply all deletes during flush

BD 0 [Sun Mar 18 17:54:58 CST 2012; main]: applyDeletes: infos=[_0(3.6.1):C2/1, _2(3.6.1):C2/1, _4(3.6.1):C1, _5(3.6.1):C1] packetCount=1

BD 0 [Sun Mar 18 17:54:58 CST 2012; main]: seg=_5(3.6.1):C1 segGen=1 segDeletes=[ 1 deleted terms (unique count=1) bytesUsed=1024]; coalesced deletes=[null] delCount=0

BD 0 [Sun Mar 18 17:54:58 CST 2012; main]: seg=_4(3.6.1):C1/1 segGen=0 coalesced deletes=[CoalescedDeletes(termSets=1,queries=0)] delCount=1 100% deleted

BD 0 [Sun Mar 18 17:54:58 CST 2012; main]: seg=_2(3.6.1):C2/1 segGen=0 coalesced deletes=[CoalescedDeletes(termSets=1,queries=0)] delCount=0

BD 0 [Sun Mar 18 17:54:58 CST 2012; main]: seg=_0(3.6.1):C2/1 segGen=0 coalesced deletes=[CoalescedDeletes(termSets=1,queries=0)] delCount=0

BD 0 [Sun Mar 18 17:54:58 CST 2012; main]: applyDeletes took 27 msec

IFD [Sun Mar 18 17:54:58 CST 2012; main]: now checkpoint "segments_5" [4 segments ; isCommit = false]

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: drop 100% deleted segments: [_4(3.6.1):C1/1]

IFD [Sun Mar 18 17:54:58 CST 2012; main]: now checkpoint "segments_5" [3 segments ; isCommit = false]

IFD [Sun Mar 18 17:54:58 CST 2012; main]: delete "_4_1.del"

BD 0 [Sun Mar 18 17:54:58 CST 2012; main]: prune sis=org.apache.lucene.index.SegmentInfos@d08633 minGen=2 packetCount=1

BD 0 [Sun Mar 18 17:54:58 CST 2012; main]: pruneDeletes: prune 1 packets; 0 packets remain

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: clearFlushPending

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: LMP: findMerges: 3 segments

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: LMP: seg=_0(3.6.1):C2/1 level=2.240549 size=0.000 MB

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: LMP: seg=_2(3.6.1):C2/1 level=2.240549 size=0.000 MB

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: LMP: seg=_5(3.6.1):C1 level=2.429752 size=0.000 MB

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: LMP: level -1.0 to 2.429752: 3 segments

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: CMS: now merge

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: CMS: index: _0(3.6.1):C2/1 _2(3.6.1):C2/1 _5(3.6.1):C1

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: CMS: no more merges pending; now return

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: startCommit(): start

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: startCommit index=_0(3.6.1):C2/1 _2(3.6.1):C2/1 _5(3.6.1):C1 changeCount=4

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: done all syncs

IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: commit: pendingCommit != null

IW 0 [Sun Mar 18 17:54:59 CST 2012; main]: commit: wrote segments file "segments_6"

IFD [Sun Mar 18 17:54:59 CST 2012; main]: now checkpoint "segments_6" [3 segments ; isCommit = true]

IFD [Sun Mar 18 17:54:59 CST 2012; main]: deleteCommits: now decRef commit "segments_5"

IFD [Sun Mar 18 17:54:59 CST 2012; main]: delete "_4.prx"

IFD [Sun Mar 18 17:54:59 CST 2012; main]: unable to remove file "_4.prx": java.io.IOException: Cannot delete D:\work\eclipseworkpace\lucenelearn\charpter2-1\_4.prx; Will re-try later.

IFD [Sun Mar 18 17:54:59 CST 2012; main]: delete "_4.fnm"

IFD [Sun Mar 18 17:54:59 CST 2012; main]: delete "_4.fdx"

IFD [Sun Mar 18 17:54:59 CST 2012; main]: unable to remove file "_4.fdx": java.io.IOException: Cannot delete D:\work\eclipseworkpace\lucenelearn\charpter2-1\_4.fdx; Will re-try later.

IFD [Sun Mar 18 17:54:59 CST 2012; main]: delete "_4.frq"

IFD [Sun Mar 18 17:54:59 CST 2012; main]: unable to remove file "_4.frq": java.io.IOException: Cannot delete D:\work\eclipseworkpace\lucenelearn\charpter2-1\_4.frq; Will re-try later.

IFD [Sun Mar 18 17:54:59 CST 2012; main]: delete "_4.tis"

IFD [Sun Mar 18 17:54:59 CST 2012; main]: unable to remove file "_4.tis": java.io.IOException: Cannot delete D:\work\eclipseworkpace\lucenelearn\charpter2-1\_4.tis; Will re-try later.

IFD [Sun Mar 18 17:54:59 CST 2012; main]: delete "_4.tii"

IFD [Sun Mar 18 17:54:59 CST 2012; main]: delete "_4.nrm"

IFD [Sun Mar 18 17:54:59 CST 2012; main]: unable to remove file "_4.nrm": java.io.IOException: Cannot delete D:\work\eclipseworkpace\lucenelearn\charpter2-1\_4.nrm; Will re-try later.

IFD [Sun Mar 18 17:54:59 CST 2012; main]: delete "_4.fdt"

IFD [Sun Mar 18 17:54:59 CST 2012; main]: unable to remove file "_4.fdt": java.io.IOException: Cannot delete D:\work\eclipseworkpace\lucenelearn\charpter2-1\_4.fdt; Will re-try later.

IFD [Sun Mar 18 17:54:59 CST 2012; main]: delete "segments_5"

IW 0 [Sun Mar 18 17:54:59 CST 2012; main]: commit: done