lucene segment会包含所有的索引文件，如tim tip等，可以认为是mini的独立索引

最新推荐文章于 2023-06-03 11:42:33 发布

weixin_33849942

最新推荐文章于 2023-06-03 11:42:33 发布

阅读量78

点赞数

本文探讨了Lucene中索引段的概念及其工作原理。每个索引段被视为一个迷你索引或分片，包含所有必要的文件如.tim和.tip等。当新增文档时，通常会创建新的索引段来避免重新索引的成本。搜索执行时，查询会在所有段上进行扇出，以确保从客户端的角度看，排名与在一个单一索引段上的搜索相同。

A Lucene index segment can be viewed as a "mini" index or a shard. Each segment is a collection of all needed files for an index, including .tim and .tip. If you list your Lucene index directory, you'll see files belonging to the same segment have the same names with all different types. In fact, if you force a merge, you'll get an index of one single segment.

Each segment contains an index of a subset of your document collection. Lucene usually creates a new segment when new documents are added to a working index, to avoid (or rather delay and batch later) reindexing cost.

When a search is executed, Lucene will fan that query over all segments, and all the index wide statistics required for relevance ranking (such as idf) are combined, so from the client's perspective, the ranking is the same as searching from an index of one segment. Note that the other famous stat, tf, is per-document, so it is already available at the segment reader layer.

Now things get more interesting when you have Lucene indexes across machines (as the case in Solr Cloud, which is one of the distributed search service built on Lucene). Due to performance and complexity, Solr Cloud don't aggregate global stats across clusters (yet), so each machine would use their own stats on the index it holds (which could be consisted of multiple segments :).

摘自：https://www.quora.com/Are-the-individual-tim-and-tip-files-term-dictionaries-of-a-Lucene-index-segment-updated-when-a-new-segment-is-added-to-Lucene