Lucene学习总结之四:Lucene索引过程分析(4)

Lucene学习总结之四:Lucene索引过程分析(4)
2010年12月23日
  代码: 对段的合并将在后面的章节进行讨论,此处仅仅讨论将索引信息由写入磁盘的过程。
  代码: 将索引写入磁盘包括以下几个过程: 得到要写入的段名:String segment = docWriter.getSegment();
  DocumentsWriter将缓存的信息写入段:docWriter.flush(flushDocStores);
  生成新的段信息对象:newSegment = new SegmentInfo(segment, flushedDocCount, directory, false, true, docStoreOffset, docStoreSegment, docStoreIsCompoundFile, docWriter.hasProx());
  准备删除文档:docWriter.pushDeletes();
  生成cfs段:docWriter.createCompoundFile(segment);
  删除文档:applyDeletes();
  代码: SegmentInfo newSegment = null; final int numDocs = docWriter.getNumDocsInRAM();//文档总数
  String docStoreSegment = docWriter.getDocStoreSegment();//存储域和词向量所要要写入的段名,"_0"
  int docStoreOffset = docWriter.getDocStoreOffset();//存储域和词向量要写入的段中的偏移量
  String segment = docWriter.getSegment();//段名,"_0" 在Lucene的索引文件结构一章做过详细介绍,存储域和词向量可以和索引域存储在不同的段中。 代码: 此过程又包含以下两个阶段; 按照基本索引链关闭存储域和词向量信息
  按照基本索引链的结构将索引结果写入段
  6.2.1、按照基本索引链关闭存储域和词向量信息
  代码为: 其主要是根据基本索引链结构,关闭存储域和词向量信息: consumer(DocFieldProcessor).closeDocStore(flushState);
  consumer(DocInverter).closeDocStore(state);
  consumer(TermsHash).closeDocStore(state);
  consumer(FreqProxTermsWriter).closeDocStore(state);
  if (nextTermsHash != null) nextTermsHash.closeDocStore(state);
  consumer(TermVectorsTermsWriter).closeDocStore(state);
  endConsumer(NormsWriter).closeDocStore(state);
  fieldsWriter(StoredFieldsWriter).closeDocStore(state);
  其中有实质意义的是以下两个closeDocStore: void closeDocStore(final SegmentWriteState state) throws IOException { if (tvx != null) {
  //为不保存词向量的文档在tvd文件中写入零。即便不保存词向量,在tvx, tvd中也保留一个位置
  fill(state.numDocsInStore - docWriter.getDocStoreOffset());
  //关闭tvx, tvf, tvd文件的写入流
  tvx.close();
  tvf.close();
  tvd.close();
  tvx = null;
  //记录写入的文件名,为以后生成cfs文件的时候,将这些写入的文件生成一个统一的cfs文件。
  state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_INDEX_EXTENSION);
  state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_FIELDS_EXTENSION);
  state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_DOCUMENTS_EXTENSION);
  //从DocumentsWriter的成员变量openFiles中删除,未来可能被IndexFileDeleter删除
  docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_INDEX_EXTENSION);
  docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_FIELDS_EXTENSION);
  docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_DOCUMENTS_EXTENSION);
  lastDocID = 0;
  }
  }
  public void closeDocStore(SegmentWriteState state) throws IOException { //关闭fdx, fdt写入流
  fieldsWriter.close();
  --> fieldsStream.close();
  --> indexStream.close();
  fieldsWriter = null;
  lastDocID = 0;
  //记录写入的文件名
  state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.FIELDS_EXTENSION);
  state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.FIELDS_INDEX_EXTENSION);
  state.docWriter.removeOpenFile(state.docStoreSegme ntName + "." + IndexFileNames.FIELDS_EXTENSION);
  state.docWriter.removeOpenFile(state.docStoreSegme ntName + "." + IndexFileNames.FIELDS_INDEX_EXTENSION);
  } 6.2.2、按照基本索引链的结构将索引结果写入段
  代码为: consumer(DocFieldProcessor).flush(threads, flushState); //回收fieldHash,以便用于下一轮的索引,为提高效率,索引链中的对象是被复用的。
  Map> childThreadsAndFields = new HashMap>();
  for ( DocConsumerPerThread thread : threads) {
  DocFieldProcessorPerThread perThread = (DocFieldProcessorPerThread) thread;
  childThreadsAndFields.put(perThread.consumer, perThread.fields());
  perThread.trimFields(state);
  }
  //写入存储域
  --> fieldsWriter(StoredFieldsWriter).flush(state);
  //写入索引域
  --> consumer(DocInverter).flush(childThreadsAndFields, state);
  //写入域元数据信息,并记录写入的文件名,以便以后生成cfs文件
  --> final String fileName = state.segmentFileName(IndexFileNames.FIELD_INFOS_E XTENSION);
  --> fieldInfos.write(state.directory, fileName);
  --> state.flushedFiles.add(fileName); 此过程也是按照基本索引链来的: 6.2.2.1、写入存储域
  代码为: 从代码中可以看出,是写入fdx, fdt两个文件,但是在上述的closeDocStore已经写入了,并且把state.numDocsInStore置零,fieldsWriter设为null,在这里其实什么也不做。
  6.2.2.2、写入索引域
  代码为: 6.2.2.2.1、写入倒排表及词向量信息
  代码为: TermsHash.flush(Map>, SegmentWriteState) //写入倒排表信息
  --> consumer(FreqProxTermsWriter).flush(childThreadsAndFields, state);
  //回收RawPostingList
  --> shrinkFreePostings(threadsAndFields, state);
  //写入词向量信息
  --> if (nextTermsHash != null) nextTermsHash.flush(nextThreadsAndFields, state);
  --> consumer(TermVectorsTermsWriter).flush(childThreadsAndFields, state); 6.2.2.2.1.1、写入倒排表信息
  代码为: FreqProxTermsWriter.flush(Map Collection>, SegmentWriteState) (a) 所有域按名称排序,使得同名域能够一起处理
  Collections.sort(allFields);
  final int numAllFields = allFields.size();
  (b) 生成倒排表的写对象
  final FormatPostingsFieldsConsumer consumer = new FormatPostingsFieldsWriter(state, fieldInfos);
  int start = 0;
  (c) 对于每一个域
  while(start fms = mergeStates[i] = new FreqProxFieldMergeState(fields[i]); boolean result = fms.nextTerm(); //对所有的域,取第一个词(Term)
  }
  (1) 添加此域,虽然有多个域,但是由于是同名域,只取第一个域的信息即可。返回的是用于添加此域中的词的对象。
  final FormatPostingsTermsConsumer termsConsumer = consumer.addField(fields[0].fieldInfo);
  FreqProxFieldMergeState[] termStates = new FreqProxFieldMergeState[numFields];
  final boolean currentFieldOmitTermFreqAndPositions = fields[0].fieldInfo.omitTermFreqAndPositions;
  (2) 此while循环是遍历每一个尚有未处理的词的域,依次按照词典顺序处理这些域所包含的词。当一个域中的所有的词都被处理过后,则numFields减一,并从mergeStates数组中移除此域。直到所有的域的所有的词都处理完毕,方才退出此循环。
  while(numFields > 0) {
  (2-1) 找出所有域中按字典顺序的下一个词。可能多个同名域中,都包含同一个term,因而要遍历所有的numFields,得到所有的域里的下一个词,numToMerge即为有多少个域包含此词。
  termStates[0] = mergeStates[0];
  int numToMerge = 1;
  for(int i=1;i
  final char[] text = mergeStates[i].text;
  final int textOffset = mergeStates[i].textOffset;
  final int cmp = compareText(text, textOffset, termStates[0].text, termStates[0].textOffset);
  if (cmp 0) {
  (2-3-1) 找出最小的文档号
  FreqProxFieldMergeState minState = termStates[0];
  for(int i=1;i
  if (termStates[i].docID > 1;
  final int payloadLength;
  // 如果此位置有payload信息,则从bytepool中读出,否则设为零。
  if ((code & 1) != 0) {
  payloadLength = prox.readVInt();
  if (payloadBuffer == null || payloadBuffer.length >> 1;
  if ((code & 1) != 0)
  termFreq = 1;
  else
  termFreq = freq.readVInt();
  }
  return true;
  } (2-3-2) 添加文档号及词频信息代码如下: FormatPostingsPositionsConsumer FormatPostingsDocsWriter.addDoc(int docID, int termDocFreq) { final int delta = docID - lastDocID; //当文档数量达到skipInterval倍数的时候,添加跳表项。
  if ((++df % skipInterval) == 0) {
  skipListWriter.setSkipData(lastDocID, storePayloads, posWriter.lastPayloadLength);
  skipListWriter.bufferSkip(df);
  }
  lastDocID = docID;
  if (omitTermFreqAndPositions)
  out.writeVInt(delta);
  else if (1 == termDocFreq)
  out.writeVInt((delta 0)
  out.writeBytes(payload, payloadLength);
  } else
  out.writeVInt(delta);
  } (2-4) 将跳表和词典(tii, tis)写入文件 将跳表缓存写入文件: DefaultSkipListWriter(MultiLevelSkipListWriter).wr iteSkip(IndexOutput) { long skipPointer = output.getFilePointer(); if (skipBuffer == null || skipBuffer.length == 0) return skipPointer; //正如我们在索引文件格式中分析的那样, 高层在前,低层在后,除最低层外,其他的层都有长度保存。
  for (int level = numberOfSkipLevels - 1; level > 0; level--) {
  long length = skipBuffer[level].getFilePointer();
  if (length > 0) {
  output.writeVLong(length);
  skipBuffer[level].writeTo(output);
  }
  }
  //写入最低层
  skipBuffer[0].writeTo(output);
  return skipPointer;
  } 将词典(terminfo)写入tii,tis文件: tii文件是tis文件的类似跳表的东西,是在tis文件中每隔indexInterval个词提取出一个词放在tii文件中,以便很快的查找到词。
  因而TermInfosWriter类型中有一个成员变量other也是TermInfosWriter类型的,还有一个成员变量isIndex来表示此对象是用来写tii文件的还是用来写tis文件的。
  如果一个TermInfosWriter对象的isIndex=false则,它是用来写tis文件的,它的other指向的是用来写tii文件的TermInfosWriter对象
  如果一个TermInfosWriter对象的isIndex=true则,它是用来写tii文件的,它的other指向的是用来写tis文件的TermInfosWriter对象
  TermInfosWriter.add (int fieldNumber, byte[] termBytes, int termBytesLength, TermInfo ti) { //如果词的总数是indexInterval的倍数,则应该写入tii文件
  if (!isIndex && size % indexInterval == 0)
  other.add(lastFieldNumber, lastTermBytes, lastTermBytesLength, lastTi);
  //将词写入tis文件
  writeTerm(fieldNumber, termBytes, termBytesLength);
  output.writeVInt(ti.docFreq); // write doc freq
  output.writeVLong(ti.freqPointer - lastTi.freqPointer); // write pointers
  output.writeVLong(ti.proxPointer - lastTi.proxPointer);
  if (ti.docFreq >= skipInterval) {
  output.writeVInt(ti.skipOffset);
  }
  if (isIndex) {
  output.writeVLong(other.output.getFilePointer() - lastIndexPointer);
  lastIndexPointer = other.output.getFilePointer(); // write pointer
  }
  lastFieldNumber = fieldNumber;
  lastTi.set(ti);
  size++;
  } 6.2.2.2.1.2、写入词向量信息
  代码为: TermVectorsTermsWriter.flush (Map>
  threadsAndFields, final SegmentWriteState state) { if (tvx != null) { if (state.numDocsInStore > 0) fill(state.numDocsInStore - docWriter.getDocStoreOffset()); tvx.flush(); tvd.flush(); tvf.flush(); } for (Map.Entry> entry :
  threadsAndFields.entrySet()) { for (final TermsHashConsumerPerField field : entry.getValue() ) { TermVectorsTermsWriterPerField perField = (TermVectorsTermsWriterPerField) field; perField.termsHashPerField.reset(); perField.shrinkHash(); } TermVectorsTermsWriterPerThread perThread = (TermVectorsTermsWriterPerThread) entry.getKey(); perThread.termsHashPerThread.reset(true); } } 从代码中可以看出,是写入tvx, tvd, tvf三个文件,但是在上述的closeDocStore已经写入了,并且把tvx设为null,在这里其实什么也不做,仅仅是清空postingsHash,以便进行下一轮索引时重用此对象。
  6.2.2.2.2、写入标准化因子
  代码为: NormsWriter.flush(Map> threadsAndFields, SegmentWriteState state) { final Map> byField = new HashMap>(); for (final Map.Entry> entry :
  threadsAndFields.entrySet()) { //遍历所有的域,将同名域对应的NormsWriterPerField放到同一个链表中。
  final Collection fields = entry.getValue();
  final Iterator fieldsIt = fields.iterator();
  while (fieldsIt.hasNext()) {
  final NormsWriterPerField perField = (NormsWriterPerField) fieldsIt.next();
  List l = byField.get(perField.fieldInfo);
  if (l == null) {
  l = new ArrayList();
  byField.put(perField.fieldInfo, l);
  }
  l.add(perField);
  }
  //记录写入的文件名,方便以后生成cfs文件。
  final String normsFileName = state.segmentName + "." + IndexFileNames.NORMS_EXTENSION;
  state.flushedFiles.add(normsFileName);
  IndexOutput normsOut = state.directory.createOutput(normsFileName);
  try {
  //写入nrm文件头
  normsOut.writeBytes(SegmentMerger.NORMS_HEADER, 0, SegmentMerger.NORMS_HEADER.length);
  final int numField = fieldInfos.size();
  int normCount = 0;
  //对每一个域进行处理
  for(int fieldNumber=0;fieldNumber
  final FieldInfo fieldInfo = fieldInfos.fieldInfo(fieldNumber);
  //得到同名域的链表
  List toMerge = byField.get(fieldInfo);
  int upto = 0;
  if (toMerge != null) {
  final int numFields = toMerge.size();
  normCount++;
  final NormsWriterPerField[] fields = new NormsWriterPerField[numFields];
  int[] uptos = new int[numFields];
  for(int j=0;j
  fields[j] = toMerge.get(j);
  int numLeft = numFields;
  //处理同名的多个域
  while(numLeft > 0) {
  //得到所有的同名域中最小的文档号
  int minLoc = 0;
  int minDocID = fields[0].docIDs[uptos[0]];
  for(int j=1;j
  final int docID = fields[j].docIDs[uptos[j]];
  if (docID = limit)
  break;
  reader.deleteDocument(docID);
  any = true;
  }
  }
  } finally {
  docs.close();
  }
  //按照文档号删除。
  for (Integer docIdInt : deletesFlushed.docIDs) {
  int docID = docIdInt.intValue();
  if (docID >= docIDStart && docID = limit)
  break;
  reader.deleteDocument(doc);
  any = true;
  }
  }
  }
  searcher.close();
  return any;
  }
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值