《Lucene In Action》读书笔记

最新推荐文章于 2025-12-02 21:20:38 发布

最新推荐文章于 2025-12-02 21:20:38 发布 · 73 阅读

文章标签：

本文深入探讨了Lucene应用的性能优化方法、关键类解析及存储策略，包括索引类的理解、搜索核心类的运用、存储选项分析、性能调优步骤等。文章还涉及了内存映射目录、并发优化、线程使用、硬件升级等方面的实践建议，旨在帮助开发者提高Lucene应用的运行效率。

P27：需要理解索引的核心类：IndexWriter，Directory，Analyzer，Document，Field

P29：理解搜索的核心类：IndexSearcher，Term，Query，TermQuery，TopDocs

p34：但获得搜索结果时，只能得到已经存储的字段（stored fields）。例如，只索引但是没有存储的字段不会出现。这经常会带来困惑。

p42：不幸的是，尽管更新索引这样的功能是经常被提出，Lucene依然不能实现：相反，它删除之前的整个文档，然后加入一个新文档。这就要求新加入的文档包含所有的字段内容，包括原来文档没有改变的内容。

P45：Store.No--不存储相应的值。这个选项经常和Index.ANALYZED一起使用，来处理不需要取回内容的大段文本字段，例如网页内容，或者其它文本内容。

P47：字段总结表，保护字段生成方式和实例
Index Store TermVector Example usage
NOT_ANALYZED_NO_NORMS YES NO Identifiers (filenames, primary keys),telephone and Social Security
numbers,URLs,personal names, dates,and textual fields for sorting
ANALYZED YES WITH_POSITIONS_OFFSETS Document title, document abstract
ANALYZED NO WITH_POSITIONS_OFFSETS Document body
NO YES NO Document type, database primary key
(if not used for searching)
NOT_ANALYZED NO NO Hidden keywords

P51：注意：如果在建索引过程中关掉norms，必须重建整个索引，因为即使只有一个文档的字段有norms，但段进行合并的时候会给每个文档增加一个字节，甚至norms是关掉的。这之所以发生是因为Lucene并不分别存储norms。

P56：在索引文件优化过程中，Lucene需要更多的磁盘空间，最大是原来的三倍。优化完成后，索引文件一般会比优化前小。

P58：MMapDirectory：利用内存映射的方式访问文件的目录结构。对于64位JRE是个很好的选择，或者对于32位JRE索引文件比较小。

但对于具有足够内存的计算机，大部分操作系统会使用可用的内存作为I/O缓存。这就意味着，在warming up后，FSDirectory将和RAMDirectory几乎一样快。

P67
Lucene uses a simple approach to record deleted documents in the index: the document is marked as deleted in a bit array, which is a quick operation, but the data corresponding to that document still consumes disk space in the index. This technique is necessary because in an inverted index, a given document’s terms are scattered all over the place, and it’d be impractical to try to reclaim that space when thedocument is deleted. It’s not until segments are merged, either by normal merging over time or by an explicit call to optimize, that these bytes are reclaimed.

P76
Lucene’s primary searching API
IndexSearcher Gateway to searching an index. All searches come through an IndexSearcher instance usingany of the several overloaded search methods.
Query (andsubclasses) Concrete subclasses encapsulate logic for a particular query type. Instances ofQuery are passed to an IndexSearcher’s search method.
QueryParser Processes a human-entered (and readable) expression into a concrete Query object.
TopDocs Holds the top scoring documents, returned by IndexSearcher.search.
ScoreDoc Provides access to each search result in TopDocs.

P147
The only built-in analyzer capable of doing anything useful with Asian text is the StandardAnalyzer, which recognizessome ranges of the Unicode space as CJK characters and tokenizes them individually.

p346
A well-tuned Lucene application is like a well-maintained car: it will operate for years without problems, requiring only a small, informed investment on your part. You can take pride in that!

P347
Ask yourself, honestly (use a mirror, if necessary): would your time be better spent improving the user interface or tuningrelevance? You can always improve performance by simply rolling out more or faster hardware, so always consider that option first.

P348
Simple performance-tuning steps:
Upgrade to the latest release of Lucene.
Upgrade to the latest release of Java; then try tuning the JVM’s performance settings.
Run your JVM with the -server switch;
Use a local file system for your index.
Run a Java profiler, or collect your own rough timing using System.nanoTime,to verify your performance problem is in fact Lucene and not elsewhere in your application stack.
Do not reopen IndexWriter or IndexReader/IndexSearcher any more frequently than required.
Use multiple threads.
Use faster hardware.
Put as much physical memory as you can in your computers, and configure Lucene to take advantage of all of it.
Budget enough memory, CPU, and file descriptors for your peak usage.
Turn off any fields or features that your application isn’t using. Be ruthless!
Group multiple text fields into a single text field.?

P356
If you’re not on Windows, use NIOFSDirectory, which has better concurrency, instead of FSDirectory. If you’re running with a 64-bit JVM, try MMapDirectory
as well.

P356
Be sure you’re using enough threads to fully utilize the computer’s hardware. Increase the thread count until throughput nolonger improves, but don’t add so many threads that latency gets worse. There’s a sweet spot—find it!

P357
Therefore, it’s critical to use threads for indexing and searching. Otherwise, you’re simply not fully utilizing the computer. It’s like buying a Corvette and driving it no
faster than 20 mph!

P358
Lucene is thread-safe: sharing IndexSearcher, IndexReader, IndexWriter, and so forth across multiple threads is perfectly fine.

P365
Figure 11.3 shows the disk usage over time while indexing all documents from Wikipedia, finishing with an optimize call. The final disk usage was 14.2 GB, but the
peak disk usage was 32.4 GB, which was reached while several large concurrent merges were running.

P367
A good rule of thumb is to measure the total size of your index. Let’s call that X. Then, make sure at all times you have twotimes free disk space on the file system where the index is stored at all times.

P373
In a production server environment, you should set both of these sizes to the same value, so the JVM doesn’t spend time growing and shrinking the heap.

P373
On Unix, run vmstat 1 to print virtual memory statistics, once per second.
Using top on Unix, check the Mem: line.
For example, on Linux there is a kernel parameter called swappiness; setting it to 0 forces the OS to never swap out RAM for I/O cache.

P374
During indexing, one big usage of RAM is the buffer used by IndexWriter, which you can control with setRAMBufferSizeMB.