lucene索引源码分析1

最新推荐文章于 2024-11-23 14:13:34 发布

原创最新推荐文章于 2024-11-23 14:13:34 发布 · 1.4k 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#lucene #索引

lucene 专栏收录该内容

3 篇文章

订阅专栏

本文深入剖析了Lucene 5.3.1版本的索引流程，详细介绍了如何使用IndexWriter来写入索引，并重点分析了DocumentsWriter及其内部组件ThreadState在索引过程中的作用，包括内存索引段的构建及向磁盘刷新的过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

前面的一些文章主要分析了一些solr索引处理的流程，和索引文件打交道的是lucene的工作，下面我们基于lucene5.3.1对它的索引流程进行分析。

在开始前请允许我盗图一张，下面是lucene索引链的流程图</span>

bubuko.com,布布扣

我们一般用IndexWriter写索引的代码如下：indexWriter.addDocument(doc1);

或者

indexWriter.addDocuments(docs);

这个函数的代码如下：

 public void updateDocument(Term term, Iterable<? extends IndexableField> doc) throws IOException {
    ensureOpen();   //保证IndexWriter打开
    try {
      boolean success = false;
      try {
        if (docWriter.updateDocument(doc, analyzer, term)) {  //利用docWriter更新索引
          processEvents(true, false);
        }
        success = true;
      } finally {
        if (!success) {
          if (infoStream.isEnabled("IW")) {
            infoStream.message("IW", "hit exception updating document");
          }
        }
      }
    } catch (AbortingException | OutOfMemoryError tragedy) {
      tragicEvent(tragedy, "updateDocument");
    }
  }

可以看到这一步调用了DocumentsWriter的方法，

 boolean updateDocument(final Iterable<? extends IndexableField> doc, final Analyzer analyzer,
      final Term delTerm) throws IOException, AbortingException {

    boolean hasEvents = preUpdate();    //操作前先把之前的请求处理完，如更新删除操作

    final ThreadState perThread = flushControl.obtainAndLock(); //从对象池中获取一个ThreadState

    final DocumentsWriterPerThread flushingDWPT;
    try {
      // This must happen after we've pulled the ThreadState because IW.close
      // waits for all ThreadStates to be released:
      ensureOpen();
      ensureInitialized(perThread);   //保证perThread中的dwpt被初始化
      assert perThread.isInitialized();
      final DocumentsWriterPerThread dwpt = perThread.dwpt;
      final int dwptNumDocs = dwpt.getNumDocsInRAM();
      try {
        System.out.println(dwpt.hashCode());
        dwpt.updateDocument(doc, analyzer, delTerm);  //利用dwpt更新文档
      } catch (AbortingException ae) {
        flushControl.doOnAbort(perThread);
        dwpt.abort();
        throw ae;
      } finally {
        // We don't know whether the document actually
        // counted as being indexed, so we must subtract here to
        // accumulate our separate counter:
        numDocsInRAM.addAndGet(dwpt.getNumDocsInRAM() - dwptNumDocs);
      }
      final boolean isUpdate = delTerm != null;
      flushingDWPT = flushControl.doAfterDocument(perThread, isUpdate);
    } finally {
      perThreadPool.release(perThread);
    }

    return postUpdate(flushingDWPT, hasEvents);   //操作完成后的处理
  }

上面代码里面比较重要的就是ThreadState，lucene对他的解释如下

/**
* {@link ThreadState} references and guards a
* {@link DocumentsWriterPerThread} instance that is used during indexing to
* build a in-memory index segment. {@link ThreadState} also holds all flush
* related per-thread data controlled by {@link DocumentsWriterFlushControl}.
* <p>
* A {@link ThreadState}, its methods and members should only accessed by one
* thread a time. Users must acquire the lock via {@link ThreadState#lock()}
* and release the lock in a finally block via {@link ThreadState#unlock()}
* before accessing the state.
*/

大概意思就是ThreadState中维护的dwpt被用于在索引写入内存并生成段信息，同时ThreadState控制了各个相关线程的数据向磁盘刷新。

获取ThreadState的方式是看freeList是不是为空，如果是则新建一个返回，如果不为空则从中取一个返回。用完后还给freeList。最早的lucene默认freeList的大小为8，现在没有默认值了。获取到ThreadState后就利用它的dwpt更新文档代码如下：

public void updateDocument(Iterable<? extends IndexableField> doc, Analyzer analyzer, Term delTerm) throws IOException, AbortingException {
    testPoint("DocumentsWriterPerThread addDocument start");
    assert deleteQueue != null;
    reserveOneDoc();
    docState.doc = doc;
    docState.analyzer = analyzer;
    docState.docID = numDocsInRAM;
    if (INFO_VERBOSE && infoStream.isEnabled("DWPT")) {
      infoStream.message("DWPT", Thread.currentThread().getName() + " update delTerm=" + delTerm + " docID=" + docState.docID + " seg=" + segmentInfo.name);
    }
    // Even on exception, the document is still added (but marked
    // deleted), so we don't need to un-reserve at that point.
    // Aborting exceptions will actually "lose" more than one
    // document, so the counter will be "wrong" in that case, but
    // it's very hard to fix (we can't easily distinguish aborting
    // vs non-aborting exceptions):
    boolean success = false;
    try {
      try {
        consumer.processDocument(); //调用默认索引链更新
      } finally {
        docState.clear();
      }

上面的consumer.processDocument(); 实际上调用的就是DefaultIndexingChain的processDocument方法

public void processDocument() throws IOException, AbortingException {
    
    // How many indexed field names we've seen (collapses
    // multiple field instances by the same name):
    int fieldCount = 0;
    
    long fieldGen = nextFieldGen++;
    
    // NOTE: we need two passes here, in case there are
    // multi-valued fields, because we must process all
    // instances of a given field at once, since the
    // analyzer is free to reuse TokenStream across fields
    // (i.e., we cannot have more than one TokenStream
    // running "at once"):
    
    termsHash.startDocument(); // 开始前准备工作，将fields清空
    
    fillStoredFields(docState.docID);
    startStoredFields();
    
    boolean aborting = false;
    try {
      for (IndexableField field : docState.doc) {
        fieldCount = processField(field, fieldGen, fieldCount); // 开始对每个字段进行处理
      }
    } catch (AbortingException ae) {
      aborting = true;
      throw ae;
    } finally {
      if (aborting == false) {
        // Finish each indexed field name seen in the document:
        for (int i = 0; i < fieldCount; i++) {
          fields[i].finish();
        }
        finishStoredFields();
      }
    }
    
    try {
      termsHash.finishDocument();
    } catch (Throwable th) {
      // Must abort, on the possibility that on-disk term
      // vectors are now corrupt:
      throw AbortingException.wrap(th);
    }
  }
  
  private int processField(IndexableField field, long fieldGen, int fieldCount) throws IOException, AbortingException {
    String fieldName = field.name();
    IndexableFieldType fieldType = field.fieldType();
    
    PerField fp = null;
    
    if (fieldType.indexOptions() == null) {
      throw new NullPointerException("IndexOptions must not be null (field: \"" + field.name() + "\")");
    }
    
    // Invert indexed fields:
    if (fieldType.indexOptions() != IndexOptions.NONE) {
      
      // if the field omits norms, the boost cannot be indexed.
      if (fieldType.omitNorms() && field.boost() != 1.0f) {
        throw new UnsupportedOperationException("You cannot set an index-time boost: norms are omitted for field '"
            + field.name() + "'");
      }
      
      fp = getOrAddField(fieldName, fieldType, true); // 获取PerField对象，对于每个dwpt来说，每个field都有一个唯一的该对象，用于在缓存中建立索引
      boolean first = fp.fieldGen != fieldGen;
      fp.invert(field, first); // 开始调用分词器生成token流，分词形成
      
      if (first) {
        fields[fieldCount++] = fp;
        fp.fieldGen = fieldGen;
      }
    } else {
      verifyUnIndexedFieldType(fieldName, fieldType);
    }
    
    // Add stored fields:
    if (fieldType.stored()) {
      if (fp == null) {
        fp = getOrAddField(fieldName, fieldType, false);
      }
      if (fieldType.stored()) {
        try {
          storedFieldsWriter.writeField(fp.fieldInfo, field); // 如果是存储的字段，则写入存储doc文件里面
        } catch (Throwable th) {
          throw AbortingException.wrap(th);
        }
      }
    }
    
    DocValuesType dvType = fieldType.docValuesType(); // 对DocValues进行处理
    if (dvType == null) {
      throw new NullPointerException("docValuesType cannot be null (field: \"" + fieldName + "\")");
    }
    if (dvType != DocValuesType.NONE) {
      if (fp == null) {
        fp = getOrAddField(fieldName, fieldType, false);
      }
      indexDocValue(fp, dvType, field);
    }
    
    return fieldCount;
  }

 fp.invert（）这个方法里面主要是生成了一些索引结构信息，保存在内存缓冲区，结束后调用flush()操作刷新到磁盘上。这一块具体的结构我们后续分析，lucene系列创建索引的解析大概会分几章逐步介绍。上面是一些基本的索引链，有不对的请指正。