一、索引数值类型的数据。
在早期的lucene,数值类型的值,如“1900”是作为一个文本来对待的。他就是一个字符串,没有大小,没有范围。在实际使用当中,我们经常需要使用数字作为索引。例如,图书的价格,邮件的收发时间等等。
在lucene2.9以后提供了这种numeric index的功能。
可以使用一个 NumericField的Field子类来实现。使用setXXValue的方法可以为Field设置numeric的值。
doc.add(new NumericField("price").setDoubleValue(19.99));
这样这个price 的field就可以用于 搜索,sort 了。也可以像textual文本那样精确匹配。
通用 NumericField可以接受多个同名的field。
例如:doc.add(new NumericField("price").setDoubleValue(19.99));
doc.add(new NumericField("price").setINTValue(9.99));
我们在index时间和日期的时候,就可以使用NumericField了。基本方法是将其转化为数值。
doc.add(new NumericField("timestamp") .setLongValue(new Date().getTime()));
Calendar cal = Calendar.getInstance();
cal.setTime(date);
doc.add(new NumericField("dayOfMonth")
.setIntValue(cal.get(Calendar.DAY_OF_MONTH)));
在我们实例化一个IndexWriter的时候,会传入一个参数 IndexWriter.MaxFieldLength.UNLIMITED
new IndexWriter(dir,new StandardAnalyzer(Version.LUCENE_36),IndexWriter.MaxFieldLength.UNLIMITED);
这个参数就是用户域截断的。当我们在索引一个Field的value的时候,我们事先不知道这个value有多长,也许很长,
也可使用indexWriter的setMaxFieldLength(maxFieldLength)
! 慎用这个feature,因为可能造成索引不全,影响体验。
当我们频繁通过一个writer来修改一个index。现在我们想要search实时的内容时候就可以调用getReader方法。来获得一个只读的IndexReader。它会将这个writer的commit或者uncommit的change添加到index中,并返回一个read-only indexReader。
IndexReader reader = writer.getReader();
1、任意多个的indexReader 可以 在同一个index上打开。不管是在同一个jvm上还是多个jvm上,都没有问题。但是如果在同一个jvm上多个线程公用一个indexReader会更好滴利用资源。甚至在一个indexWriter在修改index的时候也可以创建一个indexReader。
2、在一个index上只能打开一个indexWriter。只是不能同时打开。每次只能打开一个。因为当一个indexWriter实例化后,就会创建一个write 锁。直到这个indexWriter关闭后,才释放这个锁。所以只能有一个indexWriter在一个index上。
3、但是可以有任意多个 thread 共享一个 indexReader或者indexwriter。因为这两个类是线程安全的。
4、writeIndexer的锁。Index每一个时刻只能有一个writerIndexer打开,控制这个机制就是使用write.clock来决定的。
IndexWriter
writer = new IndexWriter(dir,new StandardAnalyzer(Version.LUCENE_36),IndexWriter.MaxFieldLength.UNLIMITED);
IndexWriter new = new IndexWriter(dir,new StandardAnalyzer(Version.LUCENE_36),IndexWriter.MaxFieldLength.UNLIMITED);
可以 自行设置锁的机制。使用Directory的
setLockFactory
方法设置锁的机制。
dir = FSDirectory.open(new File(indexDir));
dir.setLockFactory(lockFactory)
this.writer = new IndexWriter(dir,new StandardAnalyzer(Version.LUCENE_36),IndexWriter.MaxFieldLength.UNLIMITED);
可以使用 IndexWriter 的isLocked(Directory) 方法检查一个index是否已经上锁了。
可以使用 IndexWriter 的unlock(Directory) 给一个index解锁,但是最好不要用,可能会使一个index变得不可用。
docs = 4
在获取搜索结果的时候,NumericRangeQuery 和 NumericRangeFilter这个两个类会以 “or” 的形式来对待所有一个同名的field有多个value的情况。但是sort在这种情况下没有定义。
二、域截断
或者这个value是非textual的 很长的 binary content。这就会造成问题。我们可以使用这个参数控制index的数量。
writer.setMaxFieldLength(maxFieldLength)
这个maxFieldLength的意思是index的最多多少个term:
The maximum number of terms that will be indexed for a single field in a document
如果刚开始的indexWriter的没有截断,在后面设置了设置截断长度,不影响前面的index。
三、近似实时搜索
记住使用玩这个reader后记住关闭它。
这是lucene对它的解释。
"near real-time" searching, in that changes made during an IndexWriter session can be quickly made available for searching without closing the writer nor calling
to open a new reader. But the turarnound time of this method should be faster since it avoids the potentially costly
returned by this method once you are done using it.
commit()
.
Note that this is functionally equivalent to calling {#flush} and then using IndexReader.open(org.apache.lucene.store.Directory)
commit()
.You must close the IndexReader
四、优化index
优化原因:频繁更新indexWriter,会生成很多的segment(每个segment会有多个文件)。lucene在search的时候会search每一个segment然后合并结果,造成速度下降。而且太多的segment会消耗 更多的文件描述符。所以需要优化。
优化的目的就是将散落的多个segmet合并为N个,提高search的速度。
indexWriter提供了4个方法用于优化。
optimize() reduces the index to a single segment, not returning until the
operation is finished.
optimize(int maxNumSegments), also known as partial optimize, reduces the
index to at most maxNumSegments segments. Because the final merge down to
one segment is the most costly, optimizing to, say, five segments should be quite
a bit faster than optimizing down to one segment, allowing you to trade less
optimization time for slower search speed.
optimize(boolean doWait) is just like optimize, except if doWait is false then
the call returns immediately while the necessary merges take place in the background.
Note that doWait=false only
works for a merge scheduler that runs
merges in background threads,
such as the default ConcurrentMergeScheduler
.
Section 2.13.6 describes merge schedulers in more detail.
optimize(int maxNumSegments, boolean doWait)
is a partial optimize that
runs in the background if doWait is false.
注意!优化会使用临时文件夹,因为在合并segment后的文件是放在tmp目录下的,当合并完成后才会删除原来的segment,这就必须保证有足够的磁盘空间。
五、lucene的Directory
Directory dir = FSDirectory.open(new File(indexDir));
writer = new IndexWriter(dir,new StandardAnalyzer(Version.LUCENE_36),false,IndexWriter.MaxFieldLength.UNLIMITED);
todo。
六、lucene indexReader 和 IndexWriter在 多线程下工作
例如
两个IndexWriter同时打开一个index,会抛出
LockObtainFailedExceptio n异常。
{
IndexWriter
IndexWriter new
}
Exception in thread "main" org.apache.lucene.store.LockObtainFailedExceptio n: Lock obtain timed out: NativeFSLock@D:\work\eclipseworkpace\lucenelearn\charpter2-1\write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:84)
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1098)
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:953)
at charpter2.ChangeIndex.<init>(ChangeIndex.java:46)
at charpter2.ChangeIndex.main(ChangeIndex.java:177)
注意必须在调用IndexWriter构造函数之前,设置setLockFactory方法。
lucene有四种 LockFactory的实现
NativeFSLockFactory This is the default locking for FSDirectory, using java.nio native OS locking, which will never leave leftover lock files when the JVM exits. But this locking implementation may not work correctly
over certain shared file systems, notably NFS.
SimpleFSLockFactory Uses Java’s File.createNewFile API, which may be more portable across different file systems than NativeFSLockFactory. Be aware that if the JVM crashes or IndexWriter isn’t closed before the JVM exits, this may leave a leftover write.lock file, which you must manually remove.
SingleInstanceLockFactor y Creates a lock entirely in memory. This is the default locking implementation for RAMDirectory. Use this when you know all IndexWriters will be instantiated in a single JVM.
NoLockFactory Disables locking entirely. Be careful! Only use this when you are
absolutely certain that Lucene’s normal locking safeguard isn’t necessary—for example, when using a private RAMDirectory with a single IndexWriter instance.
七、 lucene的debug
this.writer = new IndexWriter(dir,new StandardAnalyzer(Version.LUCENE_36),IndexWriter.MaxFieldLength.UNLIMITED);
this.writer.setInfoStream(System.out);
可以使用setInfoStream(PrintStream) lucene 会打印出所有 lucene的index过程的信息,如下
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: commit: start
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: commit: enter lock
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: commit: now prepare
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: prepareCommit: flush
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: now trigger flush reason=explicit flush
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: start flush: applyAllDeletes=true
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: index before flush _0(3.6.1):C2/1 _2(3.6.1):C2/1 _4(3.6.1):C1
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: DW: flush postings as segment _5 numDocs=1
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: DW: new segment has no vectors
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: DW: flushedFiles=[_5.fdt, _5.prx, _5.fnm, _5.nrm, _5.tis, _5.fdx, _5.frq, _5.tii]
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: DW: flush: segment=_5(3.6.1):C1
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: DW: ramUsed=0.095 MB newFlushedSize=0 MB (0 MB w/o doc stores) docs/MB=3,898.052 new/old=0.191%
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: DW: flush: push buffered deletes startSize=98 frozenSize=1024
BD 0 [Sun Mar 18 17:54:58 CST 2012; main]: push deletes 1 deleted terms (unique count=1) bytesUsed=1024 delGen=1 packetCount=1
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: DW: flush: delGen=1
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: DW: flush time 314 msec
IFD [Sun Mar 18 17:54:58 CST 2012; main]: now checkpoint "segments_5" [4 segments ; isCommit = false]
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: apply all deletes during flush
BD 0 [Sun Mar 18 17:54:58 CST 2012; main]: applyDeletes: infos=[_0(3.6.1):C2/1, _2(3.6.1):C2/1, _4(3.6.1):C1, _5(3.6.1):C1] packetCount=1
BD 0 [Sun Mar 18 17:54:58 CST 2012; main]: seg=_5(3.6.1):C1 segGen=1 segDeletes=[ 1 deleted terms (unique count=1) bytesUsed=1024]; coalesced deletes=[null] delCount=0
BD 0 [Sun Mar 18 17:54:58 CST 2012; main]: seg=_4(3.6.1):C1/1 segGen=0 coalesced deletes=[CoalescedDeletes(termSets=1,queries=0)] delCount=1 100% deleted
BD 0 [Sun Mar 18 17:54:58 CST 2012; main]: seg=_2(3.6.1):C2/1 segGen=0 coalesced deletes=[CoalescedDeletes(termSets=1,queries=0)] delCount=0
BD 0 [Sun Mar 18 17:54:58 CST 2012; main]: seg=_0(3.6.1):C2/1 segGen=0 coalesced deletes=[CoalescedDeletes(termSets=1,queries=0)] delCount=0
BD 0 [Sun Mar 18 17:54:58 CST 2012; main]: applyDeletes took 27 msec
IFD [Sun Mar 18 17:54:58 CST 2012; main]: now checkpoint "segments_5" [4 segments ; isCommit = false]
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: drop 100% deleted segments: [_4(3.6.1):C1/1]
IFD [Sun Mar 18 17:54:58 CST 2012; main]: now checkpoint "segments_5" [3 segments ; isCommit = false]
IFD [Sun Mar 18 17:54:58 CST 2012; main]: delete "_4_1.del"
BD 0 [Sun Mar 18 17:54:58 CST 2012; main]: prune sis=org.apache.lucene.index.SegmentInfos@d08633 minGen=2 packetCount=1
BD 0 [Sun Mar 18 17:54:58 CST 2012; main]: pruneDeletes: prune 1 packets; 0 packets remain
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: clearFlushPending
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: LMP: findMerges: 3 segments
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: LMP: seg=_0(3.6.1):C2/1 level=2.240549 size=0.000 MB
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: LMP: seg=_2(3.6.1):C2/1 level=2.240549 size=0.000 MB
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: LMP: seg=_5(3.6.1):C1 level=2.429752 size=0.000 MB
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: LMP: level -1.0 to 2.429752: 3 segments
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: CMS: now merge
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: CMS: index: _0(3.6.1):C2/1 _2(3.6.1):C2/1 _5(3.6.1):C1
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: CMS: no more merges pending; now return
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: startCommit(): start
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: startCommit index=_0(3.6.1):C2/1 _2(3.6.1):C2/1 _5(3.6.1):C1 changeCount=4
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: done all syncs
IW 0 [Sun Mar 18 17:54:58 CST 2012; main]: commit: pendingCommit != null
IW 0 [Sun Mar 18 17:54:59 CST 2012; main]: commit: wrote segments file "segments_6"
IFD [Sun Mar 18 17:54:59 CST 2012; main]: now checkpoint "segments_6" [3 segments ; isCommit = true]
IFD [Sun Mar 18 17:54:59 CST 2012; main]: deleteCommits: now decRef commit "segments_5"
IFD [Sun Mar 18 17:54:59 CST 2012; main]: delete "_4.prx"
IFD [Sun Mar 18 17:54:59 CST 2012; main]: unable to remove file "_4.prx": java.io.IOException: Cannot delete D:\work\eclipseworkpace\lucenelearn\charpter2-1\_4.prx; Will re-try later.
IFD [Sun Mar 18 17:54:59 CST 2012; main]: delete "_4.fnm"
IFD [Sun Mar 18 17:54:59 CST 2012; main]: delete "_4.fdx"
IFD [Sun Mar 18 17:54:59 CST 2012; main]: unable to remove file "_4.fdx": java.io.IOException: Cannot delete D:\work\eclipseworkpace\lucenelearn\charpter2-1\_4.fdx; Will re-try later.
IFD [Sun Mar 18 17:54:59 CST 2012; main]: delete "_4.frq"
IFD [Sun Mar 18 17:54:59 CST 2012; main]: unable to remove file "_4.frq": java.io.IOException: Cannot delete D:\work\eclipseworkpace\lucenelearn\charpter2-1\_4.frq; Will re-try later.
IFD [Sun Mar 18 17:54:59 CST 2012; main]: delete "_4.tis"
IFD [Sun Mar 18 17:54:59 CST 2012; main]: unable to remove file "_4.tis": java.io.IOException: Cannot delete D:\work\eclipseworkpace\lucenelearn\charpter2-1\_4.tis; Will re-try later.
IFD [Sun Mar 18 17:54:59 CST 2012; main]: delete "_4.tii"
IFD [Sun Mar 18 17:54:59 CST 2012; main]: delete "_4.nrm"
IFD [Sun Mar 18 17:54:59 CST 2012; main]: unable to remove file "_4.nrm": java.io.IOException: Cannot delete D:\work\eclipseworkpace\lucenelearn\charpter2-1\_4.nrm; Will re-try later.
IFD [Sun Mar 18 17:54:59 CST 2012; main]: delete "_4.fdt"
IFD [Sun Mar 18 17:54:59 CST 2012; main]: unable to remove file "_4.fdt": java.io.IOException: Cannot delete D:\work\eclipseworkpace\lucenelearn\charpter2-1\_4.fdt; Will re-try later.
IFD [Sun Mar 18 17:54:59 CST 2012; main]: delete "segments_5"
IW 0 [Sun Mar 18 17:54:59 CST 2012; main]: commit: done