Lucene Core Index Class and Index Structure (tii,tis,frq,nrm...)

本文详细介绍了Lucene搜索引擎的核心索引与搜索类,并深入剖析了Lucene索引文件的结构,包括FieldInfo、StoredFields、TermDictionary等关键组件的工作原理。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

 

Understanding the core indexing classes

As you saw in our Indexer class, you need the following classes to perform the

simplest indexing procedure:

IndexWriter

Directory

Analyzer

Document

 

Field

 

 

Indexing with Lucene breaks down into three mainn operations: converting data to text, analyzing it, and saving it to the index.

 

 

 

Understanding the core searching classes

The basic search interface that Lucene provides is as straightforward as the one

for indexing. Only a few classes are needed to perform the basic search operation:

IndexSearcher

Term

Query

TermQuery

Hits

 

 

 

http://lucene.apache.org/java/2_4_0/fileformats.html

Field Info 

Field names are stored in the field info file, with suffix .fnm.

FieldInfos (.fnm) --> FieldsCount, <FieldName, FieldBits> FieldsCount

FieldsCount --> VInt

FieldName --> String

FieldBits --> Byte

 

 

Stored Fields 

Stored fields are represented by two files:

  1. The field index, or .fdx file.

    This contains, for each document, a pointer to its field data, as follows:

    FieldIndex (.fdx) --> <FieldValuesPosition> SegSize

    FieldValuesPosition --> Uint64

    This is used to find the location within the field data file of the fields of a particular document. Because it contains fixed-length data, this file may be easily randomly accessed. The position of document n 's field data is the Uint64 at n*8 in this file.

  2. The field data, or .fdt file.

    This contains the stored fields of each document, as follows:

    FieldData (.fdt) --> <DocFieldData> SegSize

    DocFieldData --> FieldCount, <FieldNum, Bits, Value> FieldCount

    FieldCount --> VInt

    FieldNum --> VInt

Term Dictionary

The term dictionary is represented as two files:

  1. The term infos, or tis file.

    TermInfoFile (.tis)--> TIVersion, TermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermInfos

    TIVersion --> UInt32

    TermCount --> UInt64

    IndexInterval --> UInt32

    SkipInterval --> UInt32

    MaxSkipLevels --> UInt32

    TermInfos --> <TermInfo> TermCount

    TermInfo --> <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta>

    Term --> <PrefixLength, Suffix, FieldNum>

    Suffix --> String

    PrefixLength, DocFreq, FreqDelta, ProxDelta, SkipDelta 
    --> VInt

    This file is sorted by Term. Terms are ordered first lexicographically (by UTF16 character code) by the term's field name, and within that lexicographically (by UTF16 character code) by the term's text.

    TIVersion names the version of the format of this file and is -2 in Lucene 1.4.

    Term text prefixes are shared. The PrefixLength is the number of initial characters from the previous term which must be pre-pended to a term's suffix in order to form the term's text. Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y".

    FieldNumber determines the term's field, whose name is stored in the .fdt file.

    DocFreq is the count of documents which contain the term.

    FreqDelta determines the position of this term's TermFreqs within the .frq file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the file).

    ProxDelta determines the position of this term's TermPositions within the .prx file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the file. For fields with omitTf true, this will be 0 since prox information is not stored.

    SkipDelta determines the position of this term's SkipData within the .frq file. In particular, it is the number of bytes after TermFreqs that the SkipData starts. In other words, it is the length of the TermFreq data. SkipDelta is only stored if DocFreq is not smaller than SkipInterval.

  2. The term info index, or .tii file.

    This contains every IndexInterval th entry from the .tis file, along with its location in the "tis" file. This is designed to be read entirely into memory and used to provide random access to the "tis" file.

    The structure of this file is very similar to the .tis file, with the addition of one item per record, the IndexDelta.

    TermInfoIndex (.tii)--> TIVersion, IndexTermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermIndices

    TIVersion --> UInt32

    IndexTermCount --> UInt64

    IndexInterval --> UInt32

    SkipInterval --> UInt32

    TermIndices --> <TermInfo, IndexDelta> IndexTermCount

    IndexDelta --> VLong

    IndexDelta determines the position of this term's TermInfo within the .tis file. In particular, it is the difference between the position of this term's entry in that file and the position of the previous term's entry.

    SkipInterval is the fraction of TermDocs stored in skip tables. It is used to accelerate TermDocs.skipTo(int). Larger values result in smaller indexes, greater acceleration, but fewer accelerable cases, while smaller values result in bigger indexes, less acceleration (in case of a small value for MaxSkipLevels) and more accelerable cases.

    MaxSkipLevels is the max. number of skip levels stored for each term in the .frq file. A low value results in smaller indexes but less acceleration, a larger value results in slighly larger indexes but greater acceleration. See format of .frq file for more information about skip levels.

Frequencies

The .frq file contains the lists of documents which contain each term, along with the frequency of the term in that document (if omitTf is false).

FreqFile (.frq) --> <TermFreqs, SkipData> TermCount

TermFreqs --> <TermFreq> DocFreq

TermFreq --> DocDelta[, Freq?]

SkipData --> <<SkipLevelLength, SkipLevel> NumSkipLevels-1, SkipLevel> <SkipDatum>

SkipLevel --> <SkipDatum> DocFreq/(SkipInterval^(Level + 1))

SkipDatum --> DocSkip,PayloadLength?,FreqSkip,ProxSkip,SkipChildLevelPointer?

DocDelta,Freq,DocSkip,PayloadLength,FreqSkip,ProxSkip --> VInt

SkipChildLevelPointer --> VLong

 

 

 

Positions

The .prx file contains the lists of positions that each term occurs at within documents. Note that fields with omitTf true do not store anything into this file, and if all fields in the index have omitTf true then the .prx file will not exist.

ProxFile (.prx) --> <TermPositions> TermCount

TermPositions --> <Positions> DocFreq

Positions --> <PositionDelta,Payload?> Freq

Payload --> <PayloadLength?,PayloadData>

PositionDelta --> VInt

PayloadLength --> VInt

PayloadData --> bytePayloadLength

TermPositions are ordered by term (the term is implicit, from the .tis file).

Positions entries are ordered by increasing document number (the document number is implicit from the .frq file).

 

 

Normalization Factors

Pre-2.1: There's a norm file for each indexed field with a byte for each document. The .f[0-9]* file contains, for each document, a byte that encodes a value that is multiplied into the score for hits on that field:

Norms (.f[0-9]*) --> <Byte> SegSize

2.1 and above: There's a single .nrm file containing all norms:

AllNorms (.nrm) --> NormsHeader,<Norms> NumFieldsWithNorms

Norms --> <Byte> SegSize

NormsHeader --> 'N','R','M',Version

Version --> Byte

NormsHeader has 4 bytes, last of which is the format version for this file, currently -1.

Each byte encodes a floating point value. Bits 0-2 contain the 3-bit mantissa, and bits 3-8 contain the 5-bit exponent.

 

 

Reference:

http://blog.youkuaiyun.com/uniorg/archive/2010/12/23/6093539.aspx

 

http://blog.youkuaiyun.com/panglaohutrue/archive/2005/01/15/254257.aspx

Lucene中的字典文件既没有使用商业数据库的b tree结构,也不是经过hash而得。而由.tii与.tis两个文件组成了一种二层文件结构。

在.tis文件中每隔一个分组跨度便产生一个分组点,在.tis文件中term编号(从0起)能够整除indexinterval时,便将当前term的前驱term作为分组点(第一个分组点为““)保存在.tii文件中。交替填写.tii与.tis文件,直到二层文件结构建立完毕。

有人会问.tii文件为什么不使用hash方法保存哪?回答是:查询时需要找到query term的临近term,hash方法不能胜任(hash算法不能找出范围)。

.tii文件中保存了指向.tis文件中的指针,检索时.tii文件要被预取入内存中,再折半查询找出相邻近并小于或等于query term的分组点term,从.tii文件中分组点term的指针指向的.tis文件位置开始,次序查询.tis文件中的term直到找到quey term或者找出字典排序大于query term的term为止(表明没有包含query term)。

 

http://javenstudio.org/blog/lucene-indexfile-structure-4

 

3.2.3.3 Term字典(.tii和.tis)

Term字典使用如下两种文件存储,第一种是存储term信息(TermInfoFile)的文件,即.tis文件,格式如下:

版本

包含的项

数目

类型

描述

全部版本

TIVersion

1

UInt32

记录该文件的版本,1.4版本中为-2

TermCount

1

UInt64

IndexInterval

1

UInt32

SkipInterval

1

UInt32

MaxSkipLevels

1

UInt32

TermInfos

1

TermInfo…

TermInfos->TermInfo

TermCount

TermInfo

TermInfo->Term

TermCount

Term

Term->PrefixLength

TermCount

VInt

Term文本的前缀可以共享,该项的值表示根据前一个term的文本来初始化的字符串前缀长度,前一个term必须已经预设成后缀文本以便构成该term的文本。比如,如果前一个term为“bone”,而当前term为“boy”,则该PrefixLength值为2suffix值为“y

Term->Suffix

TermCount

String

如上

Term->FieldNum

TermCount

VInt

用来确定termfield,它们存储在.fdt文件中。

TermInfo->DocFreq

TermCount

VInt

包含该term的文档数目

TermInfo->FreqDelta

TermCount

VInt

用来确定包含在.frq文件中该termTermFreqs的位置。特别指出,它是该term的数据在文件中位置与前一个term的位置的差值,当为第一个term时,该值为0

TermInfo->ProxDelta

TermCount

VInt

用来确定包含在.prx文件中该termTermPositions的位置。特别指出,它是该term的数据在文件中的位置与前一个term的位置地差值,当为第一个term时,该值为0。如果fieldsomitTF设置为true,该值也为0,因为prox信息没有被存储。

TermInfo->SkipDelta

TermCount

VInt

用来确定包含在.frq文件中该termSkipData的位置。特别指出,它是TermFreqs之后即SkipData开始的字节数目,换句话说,它是TermFreq的长度。SkipDelta只有在DocFreq不比SkipInteval小的情况下才会存储。

TermInfoFile文件按照Term来排序,排序方法首先按照Term的field名称(按照UTF-16字符编码)排序,然后按照Term的Text字符串(UTF-16编码)排序。

结构如下图所示:

另一种是存储term信息的索引文件,即.tii文件,该文件包含.tis文件中每一个IndexInterval的值,与它在.tis中的位置一起被存储,这被设计来完全地读进内存中(read entirely into memory),以便用来提供随机访问.tis文件。该文件的结构与.tis文件非常相似,只是添加了一项数据,即IndexDelta。格式如下

版本

包含的项

数目

类型

描述

全部版本

TIVersion

1

UInt32

tis

IndexTermCount

1

UInt64

tis

IndexInterval

1

UInt32

tis

SkipInterval

1

UInt32

TermDocs存储在skip表中的分数(fraction),用来加速(accelerableTermDocs.skipTo(int)的调用。在更小的索引中获得更大的结果值(larger values result),将获得更高的速度,但却更小开销?(fewer accelerable cases)。but fewer accelerable cases, while smaller values result in bigger indexes, less acceleration (in case of a small value for MaxSkipLevels) and more accelerable cases.

MaxSkipLevels

1

UInt32

.frq文件中为每一个term存储的skip levels的最大数目,A low value results in smaller indexes but less acceleration, a larger value results in slighly larger indexes but greater acceleration.参见.frq文件格式中关于skip levels的详细介绍。

TermIndices

IndexTermCount

TermIndice

tis

TermIndice->TermInfo

IndexTermCount

TermInfo

tis

TermIndice->IndexDelta

IndexTermCount

VLong

用来确定该TermTermInfo.tis文件中的位置,特别指出,它是该term的数据的位置与前一个term位置的差值。

结构如下图所示:


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值