Lucene Core Index Class and Index Structure (tii,tis,frq,nrm...)

最新推荐文章于 2025-08-17 18:36:57 发布

envykok

最新推荐文章于 2025-08-17 18:36:57 发布

阅读量268

点赞数

CC 4.0 BY-SA版权

文章标签： lucene structure class file each 存储

本文链接：https://blog.youkuaiyun.com/envykok/article/details/6337449

本文详细介绍了Lucene搜索引擎的核心索引与搜索类，并深入剖析了Lucene索引文件的结构，包括FieldInfo、StoredFields、TermDictionary等关键组件的工作原理。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Understanding the core indexing classes

As you saw in our Indexer class, you need the following classes to perform the

simplest indexing procedure:

■ IndexWriter

■ Directory

■ Analyzer

■ Document

■ Field

Indexing with Lucene breaks down into three mainn operations: converting data to text, analyzing it, and saving it to the index.

Understanding the core searching classes

The basic search interface that Lucene provides is as straightforward as the one

for indexing. Only a few classes are needed to perform the basic search operation:

■ IndexSearcher

■ Term

■ Query

■ TermQuery

■ Hits

http://lucene.apache.org/java/2_4_0/fileformats.html

Field Info

Field names are stored in the field info file, with suffix .fnm.

FieldInfos (.fnm) --> FieldsCount, <FieldName, FieldBits> ^FieldsCount

FieldsCount --> VInt

FieldName --> String

FieldBits --> Byte

Stored Fields

Stored fields are represented by two files:

The field index, or .fdx file.

This contains, for each document, a pointer to its field data, as follows:

FieldIndex (.fdx) --> <FieldValuesPosition> ^SegSize

FieldValuesPosition --> Uint64

This is used to find the location within the field data file of the fields of a particular document. Because it contains fixed-length data, this file may be easily randomly accessed. The position of document n 's field data is the Uint64 at n*8 in this file.
The field data, or .fdt file.

This contains the stored fields of each document, as follows:

FieldData (.fdt) --> <DocFieldData> ^SegSize

DocFieldData --> FieldCount, <FieldNum, Bits, Value> ^FieldCount

FieldCount --> VInt

FieldNum --> VInt

Term Dictionary

The term dictionary is represented as two files:

The term infos, or tis file.

TermInfoFile (.tis)--> TIVersion, TermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermInfos

TIVersion --> UInt32

TermCount --> UInt64

IndexInterval --> UInt32

SkipInterval --> UInt32

MaxSkipLevels --> UInt32

TermInfos --> <TermInfo> ^TermCount

TermInfo --> <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta>

Term --> <PrefixLength, Suffix, FieldNum>

Suffix --> String

PrefixLength, DocFreq, FreqDelta, ProxDelta, SkipDelta
--> VInt

This file is sorted by Term. Terms are ordered first lexicographically (by UTF16 character code) by the term's field name, and within that lexicographically (by UTF16 character code) by the term's text.

TIVersion names the version of the format of this file and is -2 in Lucene 1.4.

Term text prefixes are shared. The PrefixLength is the number of initial characters from the previous term which must be pre-pended to a term's suffix in order to form the term's text. Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y".

FieldNumber determines the term's field, whose name is stored in the .fdt file.

DocFreq is the count of documents which contain the term.

FreqDelta determines the position of this term's TermFreqs within the .frq file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the file).

ProxDelta determines the position of this term's TermPositions within the .prx file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the file. For fields with omitTf true, this will be 0 since prox information is not stored.

SkipDelta determines the position of this term's SkipData within the .frq file. In particular, it is the number of bytes after TermFreqs that the SkipData starts. In other words, it is the length of the TermFreq data. SkipDelta is only stored if DocFreq is not smaller than SkipInterval.
The term info index, or .tii file.

This contains every IndexInterval ^th entry from the .tis file, along with its location in the "tis" file. This is designed to be read entirely into memory and used to provide random access to the "tis" file.

The structure of this file is very similar to the .tis file, with the addition of one item per record, the IndexDelta.

TermInfoIndex (.tii)--> TIVersion, IndexTermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermIndices

TIVersion --> UInt32

IndexTermCount --> UInt64

IndexInterval --> UInt32

SkipInterval --> UInt32

TermIndices --> <TermInfo, IndexDelta> ^{IndexTermCount}

IndexDelta --> VLong

IndexDelta determines the position of this term's TermInfo within the .tis file. In particular, it is the difference between the position of this term's entry in that file and the position of the previous term's entry.

SkipInterval is the fraction of TermDocs stored in skip tables. It is used to accelerate TermDocs.skipTo(int). Larger values result in smaller indexes, greater acceleration, but fewer accelerable cases, while smaller values result in bigger indexes, less acceleration (in case of a small value for MaxSkipLevels) and more accelerable cases.

MaxSkipLevels is the max. number of skip levels stored for each term in the .frq file. A low value results in smaller indexes but less acceleration, a larger value results in slighly larger indexes but greater acceleration. See format of .frq file for more information about skip levels.

Frequencies

The .frq file contains the lists of documents which contain each term, along with the frequency of the term in that document (if omitTf is false).

FreqFile (.frq) --> <TermFreqs, SkipData> ^TermCount

TermFreqs --> <TermFreq> ^DocFreq

TermFreq --> DocDelta[, Freq?]

SkipData --> <<SkipLevelLength, SkipLevel> ^{NumSkipLevels-1}, SkipLevel> <SkipDatum>

SkipLevel --> <SkipDatum> ^{DocFreq/(SkipInterval^(Level + 1))}

SkipDatum --> DocSkip,PayloadLength?,FreqSkip,ProxSkip,SkipChildLevelPointer?

DocDelta,Freq,DocSkip,PayloadLength,FreqSkip,ProxSkip --> VInt

SkipChildLevelPointer --> VLong

Positions

The .prx file contains the lists of positions that each term occurs at within documents. Note that fields with omitTf true do not store anything into this file, and if all fields in the index have omitTf true then the .prx file will not exist.

ProxFile (.prx) --> <TermPositions> ^TermCount

TermPositions --> <Positions> ^DocFreq

Positions --> <PositionDelta,Payload?> ^Freq

Payload --> <PayloadLength?,PayloadData>

PositionDelta --> VInt

PayloadLength --> VInt

PayloadData --> byte^{PayloadLength}

TermPositions are ordered by term (the term is implicit, from the .tis file).

Positions entries are ordered by increasing document number (the document number is implicit from the .frq file).

Normalization Factors

Pre-2.1: There's a norm file for each indexed field with a byte for each document. The .f[0-9]* file contains, for each document, a byte that encodes a value that is multiplied into the score for hits on that field:

Norms (.f[0-9]*) --> <Byte> ^SegSize

2.1 and above: There's a single .nrm file containing all norms:

AllNorms (.nrm) --> NormsHeader,<Norms> ^{NumFieldsWithNorms}

Norms --> <Byte> ^SegSize

NormsHeader --> 'N','R','M',Version

Version --> Byte

NormsHeader has 4 bytes, last of which is the format version for this file, currently -1.

Each byte encodes a floating point value. Bits 0-2 contain the 3-bit mantissa, and bits 3-8 contain the 5-bit exponent.

Reference:

http://blog.youkuaiyun.com/uniorg/archive/2010/12/23/6093539.aspx

http://blog.youkuaiyun.com/panglaohutrue/archive/2005/01/15/254257.aspx

Lucene中的字典文件既没有使用商业数据库的b tree结构,也不是经过hash而得。而由.tii与.tis两个文件组成了一种二层文件结构。

在.tis文件中每隔一个分组跨度便产生一个分组点，在.tis文件中term编号（从0起）能够整除indexinterval时，便将当前term的前驱term作为分组点（第一个分组点为““）保存在.tii文件中。交替填写.tii与.tis文件，直到二层文件结构建立完毕。

有人会问.tii文件为什么不使用hash方法保存哪？回答是：查询时需要找到query term的临近term，hash方法不能胜任（hash算法不能找出范围）。

.tii文件中保存了指向.tis文件中的指针，检索时.tii文件要被预取入内存中，再折半查询找出相邻近并小于或等于query term的分组点term，从.tii文件中分组点term的指针指向的.tis文件位置开始，次序查询.tis文件中的term直到找到quey term或者找出字典排序大于query term的term为止（表明没有包含query term）。

http://javenstudio.org/blog/lucene-indexfile-structure-4

3.2.3.3 Term字典（.tii和.tis）

Term字典使用如下两种文件存储，第一种是存储term信息（TermInfoFile）的文件，即.tis文件，格式如下：

版本	包含的项	数目	类型	描述
全部版本	TIVersion	1	UInt32	记录该文件的版本，1.4版本中为-2
	TermCount	1	UInt64
	IndexInterval	1	UInt32
	SkipInterval	1	UInt32
	MaxSkipLevels	1	UInt32
	TermInfos	1	TermInfo…
	TermInfos->TermInfo	TermCount	TermInfo
	TermInfo->Term	TermCount	Term
	Term->PrefixLength	TermCount	VInt	Term文本的前缀可以共享，该项的值表示根据前一个term的文本来初始化的字符串前缀长度，前一个term必须已经预设成后缀文本以便构成该term的文本。比如，如果前一个term为“bone”，而当前term为“boy”，则该PrefixLength值为2，suffix值为“y”
	Term->Suffix	TermCount	String	如上
	Term->FieldNum	TermCount	VInt	用来确定term的field，它们存储在.fdt文件中。
	TermInfo->DocFreq	TermCount	VInt	包含该term的文档数目
	TermInfo->FreqDelta	TermCount	VInt	用来确定包含在.frq文件中该term的TermFreqs的位置。特别指出，它是该term的数据在文件中位置与前一个term的位置的差值，当为第一个term时，该值为0
	TermInfo->ProxDelta	TermCount	VInt	用来确定包含在.prx文件中该term的TermPositions的位置。特别指出，它是该term的数据在文件中的位置与前一个term的位置地差值，当为第一个term时，该值为0。如果fields的omitTF设置为true，该值也为0，因为prox信息没有被存储。
	TermInfo->SkipDelta	TermCount	VInt	用来确定包含在.frq文件中该term的SkipData的位置。特别指出，它是TermFreqs之后即SkipData开始的字节数目，换句话说，它是TermFreq的长度。SkipDelta只有在DocFreq不比SkipInteval小的情况下才会存储。

TermInfoFile文件按照Term来排序，排序方法首先按照Term的field名称（按照UTF-16字符编码）排序，然后按照Term的Text字符串（UTF-16编码）排序。

结构如下图所示：

另一种是存储term信息的索引文件，即.tii文件，该文件包含.tis文件中每一个IndexInterval的值，与它在.tis中的位置一起被存储，这被设计来完全地读进内存中（read entirely into memory），以便用来提供随机访问.tis文件。该文件的结构与.tis文件非常相似，只是添加了一项数据，即IndexDelta。格式如下

版本	包含的项	数目	类型	描述
全部版本	TIVersion	1	UInt32	同tis
	IndexTermCount	1	UInt64	同tis
	IndexInterval	1	UInt32	同tis
	SkipInterval	1	UInt32	是TermDocs存储在skip表中的分数（fraction），用来加速（accelerable）TermDocs.skipTo(int)的调用。在更小的索引中获得更大的结果值（larger values result），将获得更高的速度，但却更小开销？（fewer accelerable cases）。but fewer accelerable cases, while smaller values result in bigger indexes, less acceleration (in case of a small value for MaxSkipLevels) and more accelerable cases.
	MaxSkipLevels	1	UInt32	是.frq文件中为每一个term存储的skip levels的最大数目，A low value results in smaller indexes but less acceleration, a larger value results in slighly larger indexes but greater acceleration.参见.frq文件格式中关于skip levels的详细介绍。
	TermIndices	IndexTermCount	TermIndice	同tis
	TermIndice->TermInfo	IndexTermCount	TermInfo	同tis
	TermIndice->IndexDelta	IndexTermCount	VLong	用来确定该Term的TermInfo在.tis文件中的位置，特别指出，它是该term的数据的位置与前一个term位置的差值。

结构如下图所示：