Lucene学习总结之四:Lucene索引过程分析(4)
2010年12月23日
代码: 对段的合并将在后面的章节进行讨论,此处仅仅讨论将索引信息由写入磁盘的过程。
代码: 将索引写入磁盘包括以下几个过程: 得到要写入的段名:String segment = docWriter.getSegment();
DocumentsWriter将缓存的信息写入段:docWriter.flush(flushDocStores);
生成新的段信息对象:newSegment = new SegmentInfo(segment, flushedDocCount, directory, false, true, docStoreOffset, docStoreSegment, docStoreIsCompoundFile, docWriter.hasProx());
准备删除文档:docWriter.pushDeletes();
生成cfs段:docWriter.createCompoundFile(segment);
删除文档:applyDeletes();
代码: SegmentInfo newSegment = null; final int numDocs = docWriter.getNumDocsInRAM();//文档总数
String docStoreSegment = docWriter.getDocStoreSegment();//存储域和词向量所要要写入的段名,"_0"
int docStoreOffset = docWriter.getDocStoreOffset();//存储域和词向量要写入的段中的偏移量
String segment = docWriter.getSegment();//段名,"_0" 在Lucene的索引文件结构一章做过详细介绍,存储域和词向量可以和索引域存储在不同的段中。 代码: 此过程又包含以下两个阶段; 按照基本索引链关闭存储域和词向量信息
按照基本索引链的结构将索引结果写入段
6.2.1、按照基本索引链关闭存储域和词向量信息
代码为: 其主要是根据基本索引链结构,关闭存储域和词向量信息: consumer(DocFieldProcessor).closeDocStore(flushState);
consumer(DocInverter).closeDocStore(state);
consumer(TermsHash).closeDocStore(state);
consumer(FreqProxTermsWriter).closeDocStore(state);
if (nextTermsHash != null) nextTermsHash.closeDocStore(state);
consumer(TermVectorsTermsWriter).closeDocStore(state);
endConsumer(NormsWriter).closeDocStore(state);
fieldsWriter(StoredFieldsWriter).closeDocStore(state);
其中有实质意义的是以下两个closeDocStore: void closeDocStore(final SegmentWriteState state) throws IOException { if (tvx != null) {
//为不保存词向量的文档在tvd文件中写入零。即便不保存词向量,在tvx, tvd中也保留一个位置
fill(state.numDocsInStore - docWriter.getDocStoreOffset());
//关闭tvx, tvf, tvd文件的写入流
tvx.close();
tvf.close();
tvd.close();
tvx = null;
//记录写入的文件名,为以后生成cfs文件的时候,将这些写入的文件生成一个统一的cfs文件。
state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_INDEX_EXTENSION);
state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_FIELDS_EXTENSION);
state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_DOCUMENTS_EXTENSION);
//从DocumentsWriter的成员变量openFiles中删除,未来可能被IndexFileDeleter删除
docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_INDEX_EXTENSION);
docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_FIELDS_EXTENSION);
docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_DOCUMENTS_EXTENSION);
lastDocID = 0;
}
}
public void closeDocStore(SegmentWriteState state) throws IOException { //关闭fdx, fdt写入流
fieldsWriter.close();
--> fieldsStream.close();
--> indexStream.close();
fieldsWriter = null;
lastDocID = 0;
//记录写入的文件名
state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.FIELDS_EXTENSION);
state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.FIELDS_INDEX_EXTENSION);
state.docWriter.removeOpenFile(state.docStoreSegme ntName + "." + IndexFileNames.FIELDS_EXTENSION);
state.docWriter.removeOpenFile(state.docStoreSegme ntName + "." + IndexFileNames.FIELDS_INDEX_EXTENSION);
} 6.2.2、按照基本索引链的结构将索引结果写入段
代码为: consumer(DocFieldProcessor).flush(threads, flushState); //回收fieldHash,以便用于下一轮的索引,为提高效率,索引链中的对象是被复用的。
Map> childThreadsAndFields = new HashMap>();
for ( DocConsumerPerThread thread : threads) {
DocFieldProcessorPerThread perThread = (DocFieldProcessorPerThread) thread;
childThreadsAndFields.put(perThread.consumer, perThread.fields());
perThread.trimFields(state);
}
//写入存储域
--> fieldsWriter(StoredFieldsWriter).flush(state);
//写入索引域
--> consumer(DocInverter).flush(childThreadsAndFields, state);
//写入域元数据信息,并记录写入的文件名,以便以后生成cfs文件
--> final String fileName = state.segmentFileName(IndexFileNames.FIELD_INFOS_E XTENSION);
--> fieldInfos.write(state.directory, fileName);
--> state.flushedFiles.add(fileName); 此过程也是按照基本索引链来的: 6.2.2.1、写入存储域
代码为: 从代码中可以看出,是写入fdx, fdt两个文件,但是在上述的closeDocStore已经写入了,并且把state.numDocsInStore置零,fieldsWriter设为null,在这里其实什么也不做。
6.2.2.2、写入索引域
代码为: 6.2.2.2.1、写入倒排表及词向量信息
代码为: TermsHash.flush(Map>, SegmentWriteState) //写入倒排表信息
--> consumer(FreqProxTermsWriter).flush(childThreadsAndFields, state);
//回收RawPostingList
--> shrinkFreePostings(threadsAndFields, state);
//写入词向量信息
--> if (nextTermsHash != null) nextTermsHash.flush(nextThreadsAndFields, state);
--> consumer(TermVectorsTermsWriter).flush(childThreadsAndFields, state); 6.2.2.2.1.1、写入倒排表信息
代码为: FreqProxTermsWriter.flush(Map Collection>, SegmentWriteState) (a) 所有域按名称排序,使得同名域能够一起处理
Collections.sort(allFields);
final int numAllFields = allFields.size();
(b) 生成倒排表的写对象
final FormatPostingsFieldsConsumer consumer = new FormatPostingsFieldsWriter(state, fieldInfos);
int start = 0;
(c) 对于每一个域
while(start fms = mergeStates[i] = new FreqProxFieldMergeState(fields[i]); boolean result = fms.nextTerm(); //对所有的域,取第一个词(Term)
}
(1) 添加此域,虽然有多个域,但是由于是同名域,只取第一个域的信息即可。返回的是用于添加此域中的词的对象。
final FormatPostingsTermsConsumer termsConsumer = consumer.addField(fields[0].fieldInfo);
FreqProxFieldMergeState[] termStates = new FreqProxFieldMergeState[numFields];
final boolean currentFieldOmitTermFreqAndPositions = fields[0].fieldInfo.omitTermFreqAndPositions;
(2) 此while循环是遍历每一个尚有未处理的词的域,依次按照词典顺序处理这些域所包含的词。当一个域中的所有的词都被处理过后,则numFields减一,并从mergeStates数组中移除此域。直到所有的域的所有的词都处理完毕,方才退出此循环。
while(numFields > 0) {
(2-1) 找出所有域中按字典顺序的下一个词。可能多个同名域中,都包含同一个term,因而要遍历所有的numFields,得到所有的域里的下一个词,numToMerge即为有多少个域包含此词。
termStates[0] = mergeStates[0];
int numToMerge = 1;
for(int i=1;i
final char[] text = mergeStates[i].text;
final int textOffset = mergeStates[i].textOffset;
final int cmp = compareText(text, textOffset, termStates[0].text, termStates[0].textOffset);
if (cmp 0) {
(2-3-1) 找出最小的文档号
FreqProxFieldMergeState minState = termStates[0];
for(int i=1;i
if (termStates[i].docID > 1;
final int payloadLength;
// 如果此位置有payload信息,则从bytepool中读出,否则设为零。
if ((code & 1) != 0) {
payloadLength = prox.readVInt();
if (payloadBuffer == null || payloadBuffer.length >> 1;
if ((code & 1) != 0)
termFreq = 1;
else
termFreq = freq.readVInt();
}
return true;
} (2-3-2) 添加文档号及词频信息代码如下: FormatPostingsPositionsConsumer FormatPostingsDocsWriter.addDoc(int docID, int termDocFreq) { final int delta = docID - lastDocID; //当文档数量达到skipInterval倍数的时候,添加跳表项。
if ((++df % skipInterval) == 0) {
skipListWriter.setSkipData(lastDocID, storePayloads, posWriter.lastPayloadLength);
skipListWriter.bufferSkip(df);
}
lastDocID = docID;
if (omitTermFreqAndPositions)
out.writeVInt(delta);
else if (1 == termDocFreq)
out.writeVInt((delta 0)
out.writeBytes(payload, payloadLength);
} else
out.writeVInt(delta);
} (2-4) 将跳表和词典(tii, tis)写入文件 将跳表缓存写入文件: DefaultSkipListWriter(MultiLevelSkipListWriter).wr iteSkip(IndexOutput) { long skipPointer = output.getFilePointer(); if (skipBuffer == null || skipBuffer.length == 0) return skipPointer; //正如我们在索引文件格式中分析的那样, 高层在前,低层在后,除最低层外,其他的层都有长度保存。
for (int level = numberOfSkipLevels - 1; level > 0; level--) {
long length = skipBuffer[level].getFilePointer();
if (length > 0) {
output.writeVLong(length);
skipBuffer[level].writeTo(output);
}
}
//写入最低层
skipBuffer[0].writeTo(output);
return skipPointer;
} 将词典(terminfo)写入tii,tis文件: tii文件是tis文件的类似跳表的东西,是在tis文件中每隔indexInterval个词提取出一个词放在tii文件中,以便很快的查找到词。
因而TermInfosWriter类型中有一个成员变量other也是TermInfosWriter类型的,还有一个成员变量isIndex来表示此对象是用来写tii文件的还是用来写tis文件的。
如果一个TermInfosWriter对象的isIndex=false则,它是用来写tis文件的,它的other指向的是用来写tii文件的TermInfosWriter对象
如果一个TermInfosWriter对象的isIndex=true则,它是用来写tii文件的,它的other指向的是用来写tis文件的TermInfosWriter对象
TermInfosWriter.add (int fieldNumber, byte[] termBytes, int termBytesLength, TermInfo ti) { //如果词的总数是indexInterval的倍数,则应该写入tii文件
if (!isIndex && size % indexInterval == 0)
other.add(lastFieldNumber, lastTermBytes, lastTermBytesLength, lastTi);
//将词写入tis文件
writeTerm(fieldNumber, termBytes, termBytesLength);
output.writeVInt(ti.docFreq); // write doc freq
output.writeVLong(ti.freqPointer - lastTi.freqPointer); // write pointers
output.writeVLong(ti.proxPointer - lastTi.proxPointer);
if (ti.docFreq >= skipInterval) {
output.writeVInt(ti.skipOffset);
}
if (isIndex) {
output.writeVLong(other.output.getFilePointer() - lastIndexPointer);
lastIndexPointer = other.output.getFilePointer(); // write pointer
}
lastFieldNumber = fieldNumber;
lastTi.set(ti);
size++;
} 6.2.2.2.1.2、写入词向量信息
代码为: TermVectorsTermsWriter.flush (Map>
threadsAndFields, final SegmentWriteState state) { if (tvx != null) { if (state.numDocsInStore > 0) fill(state.numDocsInStore - docWriter.getDocStoreOffset()); tvx.flush(); tvd.flush(); tvf.flush(); } for (Map.Entry> entry :
threadsAndFields.entrySet()) { for (final TermsHashConsumerPerField field : entry.getValue() ) { TermVectorsTermsWriterPerField perField = (TermVectorsTermsWriterPerField) field; perField.termsHashPerField.reset(); perField.shrinkHash(); } TermVectorsTermsWriterPerThread perThread = (TermVectorsTermsWriterPerThread) entry.getKey(); perThread.termsHashPerThread.reset(true); } } 从代码中可以看出,是写入tvx, tvd, tvf三个文件,但是在上述的closeDocStore已经写入了,并且把tvx设为null,在这里其实什么也不做,仅仅是清空postingsHash,以便进行下一轮索引时重用此对象。
6.2.2.2.2、写入标准化因子
代码为: NormsWriter.flush(Map> threadsAndFields, SegmentWriteState state) { final Map> byField = new HashMap>(); for (final Map.Entry> entry :
threadsAndFields.entrySet()) { //遍历所有的域,将同名域对应的NormsWriterPerField放到同一个链表中。
final Collection fields = entry.getValue();
final Iterator fieldsIt = fields.iterator();
while (fieldsIt.hasNext()) {
final NormsWriterPerField perField = (NormsWriterPerField) fieldsIt.next();
List l = byField.get(perField.fieldInfo);
if (l == null) {
l = new ArrayList();
byField.put(perField.fieldInfo, l);
}
l.add(perField);
}
//记录写入的文件名,方便以后生成cfs文件。
final String normsFileName = state.segmentName + "." + IndexFileNames.NORMS_EXTENSION;
state.flushedFiles.add(normsFileName);
IndexOutput normsOut = state.directory.createOutput(normsFileName);
try {
//写入nrm文件头
normsOut.writeBytes(SegmentMerger.NORMS_HEADER, 0, SegmentMerger.NORMS_HEADER.length);
final int numField = fieldInfos.size();
int normCount = 0;
//对每一个域进行处理
for(int fieldNumber=0;fieldNumber
final FieldInfo fieldInfo = fieldInfos.fieldInfo(fieldNumber);
//得到同名域的链表
List toMerge = byField.get(fieldInfo);
int upto = 0;
if (toMerge != null) {
final int numFields = toMerge.size();
normCount++;
final NormsWriterPerField[] fields = new NormsWriterPerField[numFields];
int[] uptos = new int[numFields];
for(int j=0;j
fields[j] = toMerge.get(j);
int numLeft = numFields;
//处理同名的多个域
while(numLeft > 0) {
//得到所有的同名域中最小的文档号
int minLoc = 0;
int minDocID = fields[0].docIDs[uptos[0]];
for(int j=1;j
final int docID = fields[j].docIDs[uptos[j]];
if (docID = limit)
break;
reader.deleteDocument(docID);
any = true;
}
}
} finally {
docs.close();
}
//按照文档号删除。
for (Integer docIdInt : deletesFlushed.docIDs) {
int docID = docIdInt.intValue();
if (docID >= docIDStart && docID = limit)
break;
reader.deleteDocument(doc);
any = true;
}
}
}
searcher.close();
return any;
}
2010年12月23日
代码: 对段的合并将在后面的章节进行讨论,此处仅仅讨论将索引信息由写入磁盘的过程。
代码: 将索引写入磁盘包括以下几个过程: 得到要写入的段名:String segment = docWriter.getSegment();
DocumentsWriter将缓存的信息写入段:docWriter.flush(flushDocStores);
生成新的段信息对象:newSegment = new SegmentInfo(segment, flushedDocCount, directory, false, true, docStoreOffset, docStoreSegment, docStoreIsCompoundFile, docWriter.hasProx());
准备删除文档:docWriter.pushDeletes();
生成cfs段:docWriter.createCompoundFile(segment);
删除文档:applyDeletes();
代码: SegmentInfo newSegment = null; final int numDocs = docWriter.getNumDocsInRAM();//文档总数
String docStoreSegment = docWriter.getDocStoreSegment();//存储域和词向量所要要写入的段名,"_0"
int docStoreOffset = docWriter.getDocStoreOffset();//存储域和词向量要写入的段中的偏移量
String segment = docWriter.getSegment();//段名,"_0" 在Lucene的索引文件结构一章做过详细介绍,存储域和词向量可以和索引域存储在不同的段中。 代码: 此过程又包含以下两个阶段; 按照基本索引链关闭存储域和词向量信息
按照基本索引链的结构将索引结果写入段
6.2.1、按照基本索引链关闭存储域和词向量信息
代码为: 其主要是根据基本索引链结构,关闭存储域和词向量信息: consumer(DocFieldProcessor).closeDocStore(flushState);
consumer(DocInverter).closeDocStore(state);
consumer(TermsHash).closeDocStore(state);
consumer(FreqProxTermsWriter).closeDocStore(state);
if (nextTermsHash != null) nextTermsHash.closeDocStore(state);
consumer(TermVectorsTermsWriter).closeDocStore(state);
endConsumer(NormsWriter).closeDocStore(state);
fieldsWriter(StoredFieldsWriter).closeDocStore(state);
其中有实质意义的是以下两个closeDocStore: void closeDocStore(final SegmentWriteState state) throws IOException { if (tvx != null) {
//为不保存词向量的文档在tvd文件中写入零。即便不保存词向量,在tvx, tvd中也保留一个位置
fill(state.numDocsInStore - docWriter.getDocStoreOffset());
//关闭tvx, tvf, tvd文件的写入流
tvx.close();
tvf.close();
tvd.close();
tvx = null;
//记录写入的文件名,为以后生成cfs文件的时候,将这些写入的文件生成一个统一的cfs文件。
state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_INDEX_EXTENSION);
state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_FIELDS_EXTENSION);
state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_DOCUMENTS_EXTENSION);
//从DocumentsWriter的成员变量openFiles中删除,未来可能被IndexFileDeleter删除
docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_INDEX_EXTENSION);
docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_FIELDS_EXTENSION);
docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_DOCUMENTS_EXTENSION);
lastDocID = 0;
}
}
public void closeDocStore(SegmentWriteState state) throws IOException { //关闭fdx, fdt写入流
fieldsWriter.close();
--> fieldsStream.close();
--> indexStream.close();
fieldsWriter = null;
lastDocID = 0;
//记录写入的文件名
state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.FIELDS_EXTENSION);
state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.FIELDS_INDEX_EXTENSION);
state.docWriter.removeOpenFile(state.docStoreSegme ntName + "." + IndexFileNames.FIELDS_EXTENSION);
state.docWriter.removeOpenFile(state.docStoreSegme ntName + "." + IndexFileNames.FIELDS_INDEX_EXTENSION);
} 6.2.2、按照基本索引链的结构将索引结果写入段
代码为: consumer(DocFieldProcessor).flush(threads, flushState); //回收fieldHash,以便用于下一轮的索引,为提高效率,索引链中的对象是被复用的。
Map> childThreadsAndFields = new HashMap>();
for ( DocConsumerPerThread thread : threads) {
DocFieldProcessorPerThread perThread = (DocFieldProcessorPerThread) thread;
childThreadsAndFields.put(perThread.consumer, perThread.fields());
perThread.trimFields(state);
}
//写入存储域
--> fieldsWriter(StoredFieldsWriter).flush(state);
//写入索引域
--> consumer(DocInverter).flush(childThreadsAndFields, state);
//写入域元数据信息,并记录写入的文件名,以便以后生成cfs文件
--> final String fileName = state.segmentFileName(IndexFileNames.FIELD_INFOS_E XTENSION);
--> fieldInfos.write(state.directory, fileName);
--> state.flushedFiles.add(fileName); 此过程也是按照基本索引链来的: 6.2.2.1、写入存储域
代码为: 从代码中可以看出,是写入fdx, fdt两个文件,但是在上述的closeDocStore已经写入了,并且把state.numDocsInStore置零,fieldsWriter设为null,在这里其实什么也不做。
6.2.2.2、写入索引域
代码为: 6.2.2.2.1、写入倒排表及词向量信息
代码为: TermsHash.flush(Map>, SegmentWriteState) //写入倒排表信息
--> consumer(FreqProxTermsWriter).flush(childThreadsAndFields, state);
//回收RawPostingList
--> shrinkFreePostings(threadsAndFields, state);
//写入词向量信息
--> if (nextTermsHash != null) nextTermsHash.flush(nextThreadsAndFields, state);
--> consumer(TermVectorsTermsWriter).flush(childThreadsAndFields, state); 6.2.2.2.1.1、写入倒排表信息
代码为: FreqProxTermsWriter.flush(Map Collection>, SegmentWriteState) (a) 所有域按名称排序,使得同名域能够一起处理
Collections.sort(allFields);
final int numAllFields = allFields.size();
(b) 生成倒排表的写对象
final FormatPostingsFieldsConsumer consumer = new FormatPostingsFieldsWriter(state, fieldInfos);
int start = 0;
(c) 对于每一个域
while(start fms = mergeStates[i] = new FreqProxFieldMergeState(fields[i]); boolean result = fms.nextTerm(); //对所有的域,取第一个词(Term)
}
(1) 添加此域,虽然有多个域,但是由于是同名域,只取第一个域的信息即可。返回的是用于添加此域中的词的对象。
final FormatPostingsTermsConsumer termsConsumer = consumer.addField(fields[0].fieldInfo);
FreqProxFieldMergeState[] termStates = new FreqProxFieldMergeState[numFields];
final boolean currentFieldOmitTermFreqAndPositions = fields[0].fieldInfo.omitTermFreqAndPositions;
(2) 此while循环是遍历每一个尚有未处理的词的域,依次按照词典顺序处理这些域所包含的词。当一个域中的所有的词都被处理过后,则numFields减一,并从mergeStates数组中移除此域。直到所有的域的所有的词都处理完毕,方才退出此循环。
while(numFields > 0) {
(2-1) 找出所有域中按字典顺序的下一个词。可能多个同名域中,都包含同一个term,因而要遍历所有的numFields,得到所有的域里的下一个词,numToMerge即为有多少个域包含此词。
termStates[0] = mergeStates[0];
int numToMerge = 1;
for(int i=1;i
final char[] text = mergeStates[i].text;
final int textOffset = mergeStates[i].textOffset;
final int cmp = compareText(text, textOffset, termStates[0].text, termStates[0].textOffset);
if (cmp 0) {
(2-3-1) 找出最小的文档号
FreqProxFieldMergeState minState = termStates[0];
for(int i=1;i
if (termStates[i].docID > 1;
final int payloadLength;
// 如果此位置有payload信息,则从bytepool中读出,否则设为零。
if ((code & 1) != 0) {
payloadLength = prox.readVInt();
if (payloadBuffer == null || payloadBuffer.length >> 1;
if ((code & 1) != 0)
termFreq = 1;
else
termFreq = freq.readVInt();
}
return true;
} (2-3-2) 添加文档号及词频信息代码如下: FormatPostingsPositionsConsumer FormatPostingsDocsWriter.addDoc(int docID, int termDocFreq) { final int delta = docID - lastDocID; //当文档数量达到skipInterval倍数的时候,添加跳表项。
if ((++df % skipInterval) == 0) {
skipListWriter.setSkipData(lastDocID, storePayloads, posWriter.lastPayloadLength);
skipListWriter.bufferSkip(df);
}
lastDocID = docID;
if (omitTermFreqAndPositions)
out.writeVInt(delta);
else if (1 == termDocFreq)
out.writeVInt((delta 0)
out.writeBytes(payload, payloadLength);
} else
out.writeVInt(delta);
} (2-4) 将跳表和词典(tii, tis)写入文件 将跳表缓存写入文件: DefaultSkipListWriter(MultiLevelSkipListWriter).wr iteSkip(IndexOutput) { long skipPointer = output.getFilePointer(); if (skipBuffer == null || skipBuffer.length == 0) return skipPointer; //正如我们在索引文件格式中分析的那样, 高层在前,低层在后,除最低层外,其他的层都有长度保存。
for (int level = numberOfSkipLevels - 1; level > 0; level--) {
long length = skipBuffer[level].getFilePointer();
if (length > 0) {
output.writeVLong(length);
skipBuffer[level].writeTo(output);
}
}
//写入最低层
skipBuffer[0].writeTo(output);
return skipPointer;
} 将词典(terminfo)写入tii,tis文件: tii文件是tis文件的类似跳表的东西,是在tis文件中每隔indexInterval个词提取出一个词放在tii文件中,以便很快的查找到词。
因而TermInfosWriter类型中有一个成员变量other也是TermInfosWriter类型的,还有一个成员变量isIndex来表示此对象是用来写tii文件的还是用来写tis文件的。
如果一个TermInfosWriter对象的isIndex=false则,它是用来写tis文件的,它的other指向的是用来写tii文件的TermInfosWriter对象
如果一个TermInfosWriter对象的isIndex=true则,它是用来写tii文件的,它的other指向的是用来写tis文件的TermInfosWriter对象
TermInfosWriter.add (int fieldNumber, byte[] termBytes, int termBytesLength, TermInfo ti) { //如果词的总数是indexInterval的倍数,则应该写入tii文件
if (!isIndex && size % indexInterval == 0)
other.add(lastFieldNumber, lastTermBytes, lastTermBytesLength, lastTi);
//将词写入tis文件
writeTerm(fieldNumber, termBytes, termBytesLength);
output.writeVInt(ti.docFreq); // write doc freq
output.writeVLong(ti.freqPointer - lastTi.freqPointer); // write pointers
output.writeVLong(ti.proxPointer - lastTi.proxPointer);
if (ti.docFreq >= skipInterval) {
output.writeVInt(ti.skipOffset);
}
if (isIndex) {
output.writeVLong(other.output.getFilePointer() - lastIndexPointer);
lastIndexPointer = other.output.getFilePointer(); // write pointer
}
lastFieldNumber = fieldNumber;
lastTi.set(ti);
size++;
} 6.2.2.2.1.2、写入词向量信息
代码为: TermVectorsTermsWriter.flush (Map>
threadsAndFields, final SegmentWriteState state) { if (tvx != null) { if (state.numDocsInStore > 0) fill(state.numDocsInStore - docWriter.getDocStoreOffset()); tvx.flush(); tvd.flush(); tvf.flush(); } for (Map.Entry> entry :
threadsAndFields.entrySet()) { for (final TermsHashConsumerPerField field : entry.getValue() ) { TermVectorsTermsWriterPerField perField = (TermVectorsTermsWriterPerField) field; perField.termsHashPerField.reset(); perField.shrinkHash(); } TermVectorsTermsWriterPerThread perThread = (TermVectorsTermsWriterPerThread) entry.getKey(); perThread.termsHashPerThread.reset(true); } } 从代码中可以看出,是写入tvx, tvd, tvf三个文件,但是在上述的closeDocStore已经写入了,并且把tvx设为null,在这里其实什么也不做,仅仅是清空postingsHash,以便进行下一轮索引时重用此对象。
6.2.2.2.2、写入标准化因子
代码为: NormsWriter.flush(Map> threadsAndFields, SegmentWriteState state) { final Map> byField = new HashMap>(); for (final Map.Entry> entry :
threadsAndFields.entrySet()) { //遍历所有的域,将同名域对应的NormsWriterPerField放到同一个链表中。
final Collection fields = entry.getValue();
final Iterator fieldsIt = fields.iterator();
while (fieldsIt.hasNext()) {
final NormsWriterPerField perField = (NormsWriterPerField) fieldsIt.next();
List l = byField.get(perField.fieldInfo);
if (l == null) {
l = new ArrayList();
byField.put(perField.fieldInfo, l);
}
l.add(perField);
}
//记录写入的文件名,方便以后生成cfs文件。
final String normsFileName = state.segmentName + "." + IndexFileNames.NORMS_EXTENSION;
state.flushedFiles.add(normsFileName);
IndexOutput normsOut = state.directory.createOutput(normsFileName);
try {
//写入nrm文件头
normsOut.writeBytes(SegmentMerger.NORMS_HEADER, 0, SegmentMerger.NORMS_HEADER.length);
final int numField = fieldInfos.size();
int normCount = 0;
//对每一个域进行处理
for(int fieldNumber=0;fieldNumber
final FieldInfo fieldInfo = fieldInfos.fieldInfo(fieldNumber);
//得到同名域的链表
List toMerge = byField.get(fieldInfo);
int upto = 0;
if (toMerge != null) {
final int numFields = toMerge.size();
normCount++;
final NormsWriterPerField[] fields = new NormsWriterPerField[numFields];
int[] uptos = new int[numFields];
for(int j=0;j
fields[j] = toMerge.get(j);
int numLeft = numFields;
//处理同名的多个域
while(numLeft > 0) {
//得到所有的同名域中最小的文档号
int minLoc = 0;
int minDocID = fields[0].docIDs[uptos[0]];
for(int j=1;j
final int docID = fields[j].docIDs[uptos[j]];
if (docID = limit)
break;
reader.deleteDocument(docID);
any = true;
}
}
} finally {
docs.close();
}
//按照文档号删除。
for (Integer docIdInt : deletesFlushed.docIDs) {
int docID = docIdInt.intValue();
if (docID >= docIDStart && docID = limit)
break;
reader.deleteDocument(doc);
any = true;
}
}
}
searcher.close();
return any;
}