Understanding the core indexing classes
As you saw in our Indexer class, you need the following classes to perform the
simplest indexing procedure:
■ IndexWriter
■ Directory
■ Analyzer
■ Document
■ Field
Indexing with Lucene breaks down into three mainn operations: converting data to text, analyzing it, and saving it to the index.
Understanding the core searching classes
The basic search interface that Lucene provides is as straightforward as the one
for indexing. Only a few classes are needed to perform the basic search operation:
■ IndexSearcher
■ Term
■ Query
■ TermQuery
■ Hits
http://lucene.apache.org/java/2_4_0/fileformats.html
Field Info
Field names are stored in the field info file, with suffix .fnm.
FieldInfos (.fnm) --> FieldsCount, <FieldName, FieldBits> FieldsCount
FieldsCount --> VInt
FieldName --> String
FieldBits --> Byte
Stored Fields
Stored fields are represented by two files:
-
The field index, or .fdx file.
This contains, for each document, a pointer to its field data, as follows:
FieldIndex (.fdx) --> <FieldValuesPosition> SegSize
FieldValuesPosition --> Uint64
This is used to find the location within the field data file of the fields of a particular document. Because it contains fixed-length data, this file may be easily randomly accessed. The position of document n 's field data is the Uint64 at n*8 in this file.
-
The field data, or .fdt file.
This contains the stored fields of each document, as follows:
FieldData (.fdt) --> <DocFieldData> SegSize
DocFieldData --> FieldCount, <FieldNum, Bits, Value> FieldCount
FieldCount --> VInt
FieldNum --> VInt
Term Dictionary
The term dictionary is represented as two files:
-
The term infos, or tis file.
TermInfoFile (.tis)--> TIVersion, TermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermInfos
TIVersion --> UInt32
TermCount --> UInt64
IndexInterval --> UInt32
SkipInterval --> UInt32
MaxSkipLevels --> UInt32
TermInfos --> <TermInfo> TermCount
TermInfo --> <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta>
Term --> <PrefixLength, Suffix, FieldNum>
Suffix --> String
PrefixLength, DocFreq, FreqDelta, ProxDelta, SkipDelta
--> VIntThis file is sorted by Term. Terms are ordered first lexicographically (by UTF16 character code) by the term's field name, and within that lexicographically (by UTF16 character code) by the term's text.
TIVersion names the version of the format of this file and is -2 in Lucene 1.4.
Term text prefixes are shared. The PrefixLength is the number of initial characters from the previous term which must be pre-pended to a term's suffix in order to form the term's text. Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y".
FieldNumber determines the term's field, whose name is stored in the .fdt file.
DocFreq is the count of documents which contain the term.
FreqDelta determines the position of this term's TermFreqs within the .frq file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the file).
ProxDelta determines the position of this term's TermPositions within the .prx file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the file. For fields with omitTf true, this will be 0 since prox information is not stored.
SkipDelta determines the position of this term's SkipData within the .frq file. In particular, it is the number of bytes after TermFreqs that the SkipData starts. In other words, it is the length of the TermFreq data. SkipDelta is only stored if DocFreq is not smaller than SkipInterval.
-
The term info index, or .tii file.
This contains every IndexInterval th entry from the .tis file, along with its location in the "tis" file. This is designed to be read entirely into memory and used to provide random access to the "tis" file.
The structure of this file is very similar to the .tis file, with the addition of one item per record, the IndexDelta.
TermInfoIndex (.tii)--> TIVersion, IndexTermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermIndices
TIVersion --> UInt32
IndexTermCount --> UInt64
IndexInterval --> UInt32
SkipInterval --> UInt32
TermIndices --> <TermInfo, IndexDelta> IndexTermCount
IndexDelta --> VLong
IndexDelta determines the position of this term's TermInfo within the .tis file. In particular, it is the difference between the position of this term's entry in that file and the position of the previous term's entry.
SkipInterval is the fraction of TermDocs stored in skip tables. It is used to accelerate TermDocs.skipTo(int). Larger values result in smaller indexes, greater acceleration, but fewer accelerable cases, while smaller values result in bigger indexes, less acceleration (in case of a small value for MaxSkipLevels) and more accelerable cases.
MaxSkipLevels is the max. number of skip levels stored for each term in the .frq file. A low value results in smaller indexes but less acceleration, a larger value results in slighly larger indexes but greater acceleration. See format of .frq file for more information about skip levels.
Frequencies
The .frq file contains the lists of documents which contain each term, along with the frequency of the term in that document (if omitTf is false).
FreqFile (.frq) --> <TermFreqs, SkipData> TermCount
TermFreqs --> <TermFreq> DocFreq
TermFreq --> DocDelta[, Freq?]
SkipData --> <<SkipLevelLength, SkipLevel> NumSkipLevels-1, SkipLevel> <SkipDatum>
SkipLevel --> <SkipDatum> DocFreq/(SkipInterval^(Level + 1))
SkipDatum --> DocSkip,PayloadLength?,FreqSkip,ProxSkip,SkipChildLevelPointer?
DocDelta,Freq,DocSkip,PayloadLength,FreqSkip,ProxSkip --> VInt
SkipChildLevelPointer --> VLong
Positions
The .prx file contains the lists of positions that each term occurs at within documents. Note that fields with omitTf true do not store anything into this file, and if all fields in the index have omitTf true then the .prx file will not exist.
ProxFile (.prx) --> <TermPositions> TermCount
TermPositions --> <Positions> DocFreq
Positions --> <PositionDelta,Payload?> Freq
Payload --> <PayloadLength?,PayloadData>
PositionDelta --> VInt
PayloadLength --> VInt
PayloadData --> bytePayloadLength
TermPositions are ordered by term (the term is implicit, from the .tis file).
Positions entries are ordered by increasing document number (the document number is implicit from the .frq file).
Normalization Factors
Pre-2.1: There's a norm file for each indexed field with a byte for each document. The .f[0-9]* file contains, for each document, a byte that encodes a value that is multiplied into the score for hits on that field:
Norms (.f[0-9]*) --> <Byte> SegSize
2.1 and above: There's a single .nrm file containing all norms:
AllNorms (.nrm) --> NormsHeader,<Norms> NumFieldsWithNorms
Norms --> <Byte> SegSize
NormsHeader --> 'N','R','M',Version
Version --> Byte
NormsHeader has 4 bytes, last of which is the format version for this file, currently -1.
Each byte encodes a floating point value. Bits 0-2 contain the 3-bit mantissa, and bits 3-8 contain the 5-bit exponent.
Reference:
http://blog.youkuaiyun.com/uniorg/archive/2010/12/23/6093539.aspx
http://blog.youkuaiyun.com/panglaohutrue/archive/2005/01/15/254257.aspx
Lucene中的字典文件既没有使用商业数据库的b tree结构,也不是经过hash而得。而由.tii与.tis两个文件组成了一种二层文件结构。
在.tis文件中每隔一个分组跨度便产生一个分组点,在.tis文件中term编号(从0起)能够整除indexinterval时,便将当前term的前驱term作为分组点(第一个分组点为““)保存在.tii文件中。交替填写.tii与.tis文件,直到二层文件结构建立完毕。
有人会问.tii文件为什么不使用hash方法保存哪?回答是:查询时需要找到query term的临近term,hash方法不能胜任(hash算法不能找出范围)。
.tii文件中保存了指向.tis文件中的指针,检索时.tii文件要被预取入内存中,再折半查询找出相邻近并小于或等于query term的分组点term,从.tii文件中分组点term的指针指向的.tis文件位置开始,次序查询.tis文件中的term直到找到quey term或者找出字典排序大于query term的term为止(表明没有包含query term)。
http://javenstudio.org/blog/lucene-indexfile-structure-4
3.2.3.3 Term字典(.tii和.tis)
Term字典使用如下两种文件存储,第一种是存储term信息(TermInfoFile)的文件,即.tis文件,格式如下:
版本 | 包含的项 | 数目 | 类型 | 描述 |
全部版本 | TIVersion | 1 | UInt32 | 记录该文件的版本,1.4版本中为-2 |
TermCount | 1 | UInt64 |
| |
IndexInterval | 1 | UInt32 |
| |
SkipInterval | 1 | UInt32 |
| |
MaxSkipLevels | 1 | UInt32 |
| |
TermInfos | 1 | TermInfo… |
| |
TermInfos->TermInfo | TermCount | TermInfo |
| |
TermInfo->Term | TermCount | Term |
| |
Term->PrefixLength | TermCount | VInt | Term文本的前缀可以共享,该项的值表示根据前一个term的文本来初始化的字符串前缀长度,前一个term必须已经预设成后缀文本以便构成该term的文本。比如,如果前一个term为“bone”,而当前term为“boy”,则该PrefixLength值为2,suffix值为“y” | |
Term->Suffix | TermCount | String | 如上 | |
Term->FieldNum | TermCount | VInt | 用来确定term的field,它们存储在.fdt文件中。 | |
TermInfo->DocFreq | TermCount | VInt | 包含该term的文档数目 | |
TermInfo->FreqDelta | TermCount | VInt | 用来确定包含在.frq文件中该term的TermFreqs的位置。特别指出,它是该term的数据在文件中位置与前一个term的位置的差值,当为第一个term时,该值为0 | |
TermInfo->ProxDelta | TermCount | VInt | 用来确定包含在.prx文件中该term的TermPositions的位置。特别指出,它是该term的数据在文件中的位置与前一个term的位置地差值,当为第一个term时,该值为0。如果fields的omitTF设置为true,该值也为0,因为prox信息没有被存储。 | |
TermInfo->SkipDelta | TermCount | VInt | 用来确定包含在.frq文件中该term的SkipData的位置。特别指出,它是TermFreqs之后即SkipData开始的字节数目,换句话说,它是TermFreq的长度。SkipDelta只有在DocFreq不比SkipInteval小的情况下才会存储。 |
TermInfoFile文件按照Term来排序,排序方法首先按照Term的field名称(按照UTF-16字符编码)排序,然后按照Term的Text字符串(UTF-16编码)排序。
结构如下图所示:
另一种是存储term信息的索引文件,即.tii文件,该文件包含.tis文件中每一个IndexInterval的值,与它在.tis中的位置一起被存储,这被设计来完全地读进内存中(read entirely into memory),以便用来提供随机访问.tis文件。该文件的结构与.tis文件非常相似,只是添加了一项数据,即IndexDelta。格式如下
版本 | 包含的项 | 数目 | 类型 | 描述 |
全部版本 | TIVersion | 1 | UInt32 | 同tis |
IndexTermCount | 1 | UInt64 | 同tis | |
IndexInterval | 1 | UInt32 | 同tis | |
SkipInterval | 1 | UInt32 | 是TermDocs存储在skip表中的分数(fraction),用来加速(accelerable)TermDocs.skipTo(int)的调用。在更小的索引中获得更大的结果值(larger values result),将获得更高的速度,但却更小开销?(fewer accelerable cases)。but fewer accelerable cases, while smaller values result in bigger indexes, less acceleration (in case of a small value for MaxSkipLevels) and more accelerable cases. | |
MaxSkipLevels | 1 | UInt32 | 是.frq文件中为每一个term存储的skip levels的最大数目,A low value results in smaller indexes but less acceleration, a larger value results in slighly larger indexes but greater acceleration.参见.frq文件格式中关于skip levels的详细介绍。 | |
TermIndices | IndexTermCount | TermIndice | 同tis | |
TermIndice->TermInfo | IndexTermCount | TermInfo | 同tis | |
TermIndice->IndexDelta | IndexTermCount | VLong | 用来确定该Term的TermInfo在.tis文件中的位置,特别指出,它是该term的数据的位置与前一个term位置的差值。 |
结构如下图所示: