Lucene 6.6.1源码分析---添加文档

最新推荐文章于 2022-03-22 15:25:25 发布

道友，且慢

最新推荐文章于 2022-03-22 15:25:25 发布

阅读量486

点赞数

CC 4.0 BY-SA版权

分类专栏： Lucene

本文链接：https://blog.youkuaiyun.com/qqqq0199181/article/details/84030634

Lucene 专栏收录该内容

5 篇文章

订阅专栏

本文解析了Lucene中文档添加的流程，包括文档写入内存、触发flush机制及多线程支持。揭示了DocumentsWriterPerThread如何处理文档，涉及词项流分析与倒排索引构建。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

入口

Lucene 添加文档的功能由IndexWriter提供：
org.apache.lucene.index.IndexWriter#addDocument 添加单个文档
org.apache.lucene.index.IndexWriter#addDocuments 批量添加文档

本文以addDocument作为入口一窥文档添加流程

public long addDocument(Iterable<? extends IndexableField> doc) throws IOException {
    return updateDocument(null, doc);
  }
public long updateDocument(Term term, Iterable<? extends IndexableField> doc) throws IOException {
    ensureOpen();
    try {
      boolean success = false;
      try {
        long seqNo = docWriter.updateDocument(doc, analyzer, term);
        if (seqNo < 0) {
          seqNo = - seqNo;
          processEvents(true, false);
        }
        success = true;
        return seqNo;
      } finally {
        if (!success) {
          if (infoStream.isEnabled("IW")) {
            infoStream.message("IW", "hit exception updating document");
          }
        }
      }
    } catch (AbortingException | VirtualMachineError tragedy) {
      tragicEvent(tragedy, "updateDocument");

      // dead code but javac disagrees:
      return -1;
    }
  }

首先通过ensureOpen()方法确认索引是否开启，插入文档的前提是索引已经打开。
然后调用用DocumentsWriter#updateDocument方法插入文档，返回一个操作序列号。
插入文档的过程中会产生各种事件，插入文档之后用processEvents去处理这些事件。

那么updateDocument中做了什么操作呢？让我们进一步深入探索

long updateDocument(final Iterable<? extends IndexableField> doc, final Analyzer analyzer,
      final Term delTerm) throws IOException, AbortingException {

    boolean hasEvents = preUpdate();

    final ThreadState perThread = flushControl.obtainAndLock();

    final DocumentsWriterPerThread flushingDWPT;
    long seqNo;
    try {
      // This must happen after we've pulled the ThreadState because IW.close
      // waits for all ThreadStates to be released:
      ensureOpen();
      ensureInitialized(perThread);
      assert perThread.isInitialized();
      final DocumentsWriterPerThread dwpt = perThread.dwpt;
      final int dwptNumDocs = dwpt.getNumDocsInRAM();
      try {
        seqNo = dwpt.updateDocument(doc, analyzer, delTerm); 
      } catch (AbortingException ae) {
        flushControl.doOnAbort(perThread);
        dwpt.abort();
        throw ae;
      } finally {
        // We don't know whether the document actually
        // counted as being indexed, so we must subtract here to
        // accumulate our separate counter:
        numDocsInRAM.addAndGet(dwpt.getNumDocsInRAM() - dwptNumDocs);
      }
      final boolean isUpdate = delTerm != null;
      flushingDWPT = flushControl.doAfterDocument(perThread, isUpdate);

      assert seqNo > perThread.lastSeqNo: "seqNo=" + seqNo + " lastSeqNo=" + perThread.lastSeqNo;
      perThread.lastSeqNo = seqNo;

    } finally {
      perThreadPool.release(perThread);
    }

    if (postUpdate(flushingDWPT, hasEvents)) {
      seqNo = -seqNo;
    }
    
    return seqNo;
  }

这里首先调用了preUpdate方法，看方法名大概猜测是做一些插入前的准备工作。进去看看具体做了什么事情

private boolean preUpdate() throws IOException, AbortingException {
    ensureOpen();
    boolean hasEvents = false;
    if (flushControl.anyStalledThreads() || flushControl.numQueuedFlushes() > 0) {
      // Help out flushing any queued DWPTs so we can un-stall:
      do {
        // Try pick up pending threads here if possible
        DocumentsWriterPerThread flushingDWPT;
        while ((flushingDWPT = flushControl.nextPendingFlush()) != null) {
          // Don't push the delete here since the update could fail!
          hasEvents |= doFlush(flushingDWPT);
        }
        
        flushControl.waitIfStalled(); // block if stalled
      } while (flushControl.numQueuedFlushes() != 0); // still queued DWPTs try help flushing
    }
    return hasEvents;
  }

1、再次确认索引是开启的
2、如果有需要刷新的数据，就执行刷新doFlush

flush简介

此处有必要简单说明一下Lucene的flush。前一篇lucene索引的创建分析中已经说明，一个Lucene索引同时只能有一个IndexWriter实例操作。不过为了提高写入的性能，支持多线程并非写入那是必须的。所以IndexWriter 中维护了一个线程池perThreadPool，启动包含多个写入线程perThread。而perThread的写入也不是直接就把数据写入到磁盘中，否则频繁的IO操作也是必然降低性能。perThread会先将数据存放在内存当中，当内存中的数据量达到了一定的阈值，就开始flush,把内存中的数据刷新到磁盘中。刷新数据这种事情就是由flushControl全权负责的啦！上面的代码中flushControl.anyStalledThreads() || flushControl.numQueuedFlushes() > 0 这个条件是就是表示当内存中的数据量达到阈值，或者flushQueue队列中有待刷新的线程时，就执行flush操作。这里只是大概描述一下Lucene的刷新机制，具体会有专门的篇幅介绍。
综上所述preUpdate方法就是把需要刷新的数据做一次刷新。

preUpdate之后，需要从perThreadPool中获取一个线程来写文档：

final ThreadState perThread = flushControl.obtainAndLock();

ThreadState obtainAndLock() {
    final ThreadState perThread = perThreadPool.getAndLock(Thread
        .currentThread(), documentsWriter);
    boolean success = false;
    try {
      if (perThread.isInitialized() && perThread.dwpt.deleteQueue != documentsWriter.deleteQueue) {
        // There is a flush-all in process and this DWPT is
        // now stale -- enroll it for flush and try for
        // another DWPT:
        addFlushableState(perThread);
      }
      success = true;
      // simply return the ThreadState even in a flush all case sine we already hold the lock
      return perThread;
    } finally {
      if (!success) { // make sure we unlock if this fails
        perThreadPool.release(perThread);
      }
    }
  }

从上面代码可以看到perThreadPool中获取的perThread并非原本以为的DocumentsWriterPerThread类，而是ThreadState。

final static class ThreadState extends ReentrantLock {
    DocumentsWriterPerThread dwpt;
    // TODO this should really be part of DocumentsWriterFlushControl
    // write access guarded by DocumentsWriterFlushControl
    volatile boolean flushPending = false;
    // TODO this should really be part of DocumentsWriterFlushControl
    // write access guarded by DocumentsWriterFlushControl
    long bytesUsed = 0;

    // set by DocumentsWriter after each indexing op finishes
volatile long lastSeqNo;

从ThreadState类看出，他是DocumentsWriterPerThread的一个包装，它的flushPending属性用于记录perThread当前的刷新状态，bytesUsed属性用于记录该线程目前持有的数据大小，这些信息对于flushControl管理各个线程的flush操作很有帮助。除此之外ThreadState继承了重入锁，利用重入锁的特性保证线程安全。

获取到perThread后再取出其中的DocumentsWriterPerThread类实例dwpt,调用dwpt.updateDocument方法去插入数据。

之后再调用flushingDWPT = flushControl.doAfterDocument(perThread, isUpdate);方法，我们已经知道flushControl是管理flush操作的，想必这个方法也跟flush有关。让我们走进这个方法来验证这点。

synchronized DocumentsWriterPerThread doAfterDocument(ThreadState perThread, boolean isUpdate) {
    try {
      commitPerThreadBytes(perThread);
      if (!perThread.flushPending) {
        if (isUpdate) {
          flushPolicy.onUpdate(this, perThread);
        } else {
          flushPolicy.onInsert(this, perThread);
        }
        if (!perThread.flushPending && perThread.bytesUsed > hardMaxBytesPerDWPT) {
          // Safety check to prevent a single DWPT exceeding its RAM limit. This
          // is super important since we can not address more than 2048 MB per DWPT
          setFlushPending(perThread);
        }
      }
      final DocumentsWriterPerThread flushingDWPT;
      if (fullFlush) {
        if (perThread.flushPending) {
          checkoutAndBlock(perThread);
          flushingDWPT = nextPendingFlush();
        } else {
          flushingDWPT = null;
        }
      } else {
        flushingDWPT = tryCheckoutForFlush(perThread);
      }
      return flushingDWPT;
    } finally {
      boolean stalled = updateStallState();
      assert assertNumDocsSinceStalled(stalled) && assertMemory();
    }
  }

commitPerThreadBytes(perThread);这个方法用于更新perThread线程占用的数据大小，由于文档已经插入，因此线程占用的内存大小bytesUsed的值应该增加。接着就是设置flush的策略。最后用tryCheckoutForFlush方法对待flush的perThread做一个检查，如果检查通过，接下来就可以进行flush了。

postUpdate(flushingDWPT, hasEvents) 就是对perThread进行flush操作了。可见插入文档之后就是flush操作了。

然我们继续深入添加文档的操作，看看org.apache.lucene.index.DocumentsWriterPerThread#updateDocument是如何插入文档的

public long updateDocument(Iterable<? extends IndexableField> doc, Analyzer analyzer, Term delTerm) throws IOException, AbortingException {
    testPoint("DocumentsWriterPerThread addDocument start");
    assert deleteQueue != null;
    reserveOneDoc();
    docState.doc = doc;
    docState.analyzer = analyzer;
    docState.docID = numDocsInRAM;
    if (INFO_VERBOSE && infoStream.isEnabled("DWPT")) {
      infoStream.message("DWPT", Thread.currentThread().getName() + " update delTerm=" + delTerm + " docID=" + docState.docID + " seg=" + segmentInfo.name);
    }
    // Even on exception, the document is still added (but marked
    // deleted), so we don't need to un-reserve at that point.
    // Aborting exceptions will actually "lose" more than one
    // document, so the counter will be "wrong" in that case, but
    // it's very hard to fix (we can't easily distinguish aborting
    // vs non-aborting exceptions):
    boolean success = false;
    try {
      try {
        consumer.processDocument();
      } finally {
        docState.clear();
      }
      success = true;
    } finally {
      if (!success) {
        // mark document as deleted
        deleteDocID(docState.docID);
        numDocsInRAM++;
      }
    }

    return finishDocument(delTerm);
  }

这个方法将文档内容，分析器，和文档ID保存到docState，这样在consumer.processDocument()方法内就可以通过docState获取到文档信息了。Consumer是专门索引域的。因为Lucene的文档是由域（fields）组成的，因此对文档的索引过程就可以分解成对文档中包含的域的索引。

public void processDocument() throws IOException, AbortingException {

    // How many indexed field names we've seen (collapses
    // multiple field instances by the same name):
    int fieldCount = 0;

    long fieldGen = nextFieldGen++;

    // NOTE: we need two passes here, in case there are
    // multi-valued fields, because we must process all
    // instances of a given field at once, since the
    // analyzer is free to reuse TokenStream across fields
    // (i.e., we cannot have more than one TokenStream
    // running "at once"):

    termsHash.startDocument();

    startStoredFields(docState.docID);

    boolean aborting = false;
    try {
      for (IndexableField field : docState.doc) {
        fieldCount = processField(field, fieldGen, fieldCount);
      }
    } catch (AbortingException ae) {
      aborting = true;
      throw ae;
    } finally {
      if (aborting == false) {
        // Finish each indexed field name seen in the document:
        for (int i=0;i<fieldCount;i++) {
          fields[i].finish();
        }
        finishStoredFields();
      }
    }

    try {
      termsHash.finishDocument();
    } catch (Throwable th) {
      // Must abort, on the possibility that on-disk term
      // vectors are now corrupt:
      throw AbortingException.wrap(th);
    }
  }

在这个方法中Lucene会将文档分解成域，循环遍历处理

for (IndexableField field : docState.doc) {
        fieldCount = processField(field, fieldGen, fieldCount);
      }

processField方法中包含了词项流的分析，以及倒排索引的过程。这块的内容比较多，我们放在其他章节分析。

总结

本文描述了Lucene中一个文档插入的过程，主要讲述了以下3点：

文档首先写入到内存中，等到内存中的文档数量到达一定值就会启动flush将数据从内存刷新到磁盘中。
文档的插入是支持多线程的。
文档的插入最终是分解为域的插入。