Lucene 6.6.1源码分析---添加文档

本文解析了Lucene中文档添加的流程,包括文档写入内存、触发flush机制及多线程支持。揭示了DocumentsWriterPerThread如何处理文档,涉及词项流分析与倒排索引构建。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

入口

Lucene 添加文档的功能由IndexWriter提供:
org.apache.lucene.index.IndexWriter#addDocument 添加单个文档
org.apache.lucene.index.IndexWriter#addDocuments 批量添加文档

本文以addDocument作为入口一窥文档添加流程

public long addDocument(Iterable<? extends IndexableField> doc) throws IOException {
    return updateDocument(null, doc);
  }
public long updateDocument(Term term, Iterable<? extends IndexableField> doc) throws IOException {
    ensureOpen();
    try {
      boolean success = false;
      try {
        long seqNo = docWriter.updateDocument(doc, analyzer, term);
        if (seqNo < 0) {
          seqNo = - seqNo;
          processEvents(true, false);
        }
        success = true;
        return seqNo;
      } finally {
        if (!success) {
          if (infoStream.isEnabled("IW")) {
            infoStream.message("IW", "hit exception updating document");
          }
        }
      }
    } catch (AbortingException | VirtualMachineError tragedy) {
      tragicEvent(tragedy, "updateDocument");

      // dead code but javac disagrees:
      return -1;
    }
  }
  1. 首先通过ensureOpen()方法确认索引是否开启,插入文档的前提是索引已经打开。
  2. 然后调用用DocumentsWriter#updateDocument方法插入文档,返回一个操作序列号。
  3. 插入文档的过程中会产生各种事件,插入文档之后用processEvents去处理这些事件。

那么updateDocument中做了什么操作呢?让我们进一步深入探索

long updateDocument(final Iterable<? extends IndexableField> doc, final Analyzer analyzer,
      final Term delTerm) throws IOException, AbortingException {

    boolean hasEvents = preUpdate();

    final ThreadState perThread = flushControl.obtainAndLock();

    final DocumentsWriterPerThread flushingDWPT;
    long seqNo;
    try {
      // This must happen after we've pulled the ThreadState because IW.close
      // waits for all ThreadStates to be released:
      ensureOpen();
      ensureInitialized(perThread);
      assert perThread.isInitialized();
      final DocumentsWriterPerThread dwpt = perThread.dwpt;
      final int dwptNumDocs = dwpt.getNumDocsInRAM();
      try {
        seqNo = dwpt.updateDocument(doc, analyzer, delTerm); 
      } catch (AbortingException ae) {
        flushControl.doOnAbort(perThread);
        dwpt.abort();
        throw ae;
      } finally {
        // We don't know whether the document actually
        // counted as being indexed, so we must subtract here to
        // accumulate our separate counter:
        numDocsInRAM.addAndGet(dwpt.getNumDocsInRAM() - dwptNumDocs);
      }
      final boolean isUpdate = delTerm != null;
      flushingDWPT = flushControl.doAfterDocument(perThread, isUpdate);

      assert seqNo > perThread.lastSeqNo: "seqNo=" + seqNo + " lastSeqNo=" + perThread.lastSeqNo;
      perThread.lastSeqNo = seqNo;

    } finally {
      perThreadPool.release(perThread);
    }

    if (postUpdate(flushingDWPT, hasEvents)) {
      seqNo = -seqNo;
    }
    
    return seqNo;
  }

这里首先调用了preUpdate方法,看方法名大概猜测是做一些插入前的准备工作。进去看看具体做了什么事情

private boolean preUpdate() throws IOException, AbortingException {
    ensureOpen();
    boolean hasEvents = false;
    if (flushControl.anyStalledThreads() || flushControl.numQueuedFlushes() > 0) {
      // Help out flushing any queued DWPTs so we can un-stall:
      do {
        // Try pick up pending threads here if possible
        DocumentsWriterPerThread flushingDWPT;
        while ((flushingDWPT = flushControl.nextPendingFlush()) != null) {
          // Don't push the delete here since the update could fail!
          hasEvents |= doFlush(flushingDWPT);
        }
        
        flushControl.waitIfStalled(); // block if stalled
      } while (flushControl.numQueuedFlushes() != 0); // still queued DWPTs try help flushing
    }
    return hasEvents;
  }

1、 再次确认索引是开启的
2、 如果有需要刷新的数据,就执行刷新doFlush

flush简介

此处有必要简单说明一下Lucene的flush。前一篇lucene索引的创建分析中已经说明,一个Lucene索引同时只能有一个IndexWriter实例操作。不过为了提高写入的性能,支持多线程并非写入那是必须的。所以IndexWriter 中维护了一个线程池perThreadPool,启动包含多个写入线程perThread。而perThread的写入也不是直接就把数据写入到磁盘中,否则频繁的IO操作也是必然降低性能。perThread会先将数据存放在内存当中,当内存中的数据量达到了一定的阈值,就开始flush,把内存中的数据刷新到磁盘中。刷新数据这种事情就是由flushControl全权负责的啦!上面的代码中flushControl.anyStalledThreads() || flushControl.numQueuedFlushes() > 0 这个条件是就是表示当内存中的数据量达到阈值,或者flushQueue队列中有待刷新的线程时,就执行flush操作。这里只是大概描述一下Lucene的刷新机制,具体会有专门的篇幅介绍。
综上所述preUpdate方法就是把需要刷新的数据做一次刷新。

preUpdate之后,需要从perThreadPool中获取一个线程来写文档:

final ThreadState perThread = flushControl.obtainAndLock();

ThreadState obtainAndLock() {
    final ThreadState perThread = perThreadPool.getAndLock(Thread
        .currentThread(), documentsWriter);
    boolean success = false;
    try {
      if (perThread.isInitialized() && perThread.dwpt.deleteQueue != documentsWriter.deleteQueue) {
        // There is a flush-all in process and this DWPT is
        // now stale -- enroll it for flush and try for
        // another DWPT:
        addFlushableState(perThread);
      }
      success = true;
      // simply return the ThreadState even in a flush all case sine we already hold the lock
      return perThread;
    } finally {
      if (!success) { // make sure we unlock if this fails
        perThreadPool.release(perThread);
      }
    }
  }

从上面代码可以看到perThreadPool中获取的perThread并非原本以为的DocumentsWriterPerThread类,而是ThreadState。

final static class ThreadState extends ReentrantLock {
    DocumentsWriterPerThread dwpt;
    // TODO this should really be part of DocumentsWriterFlushControl
    // write access guarded by DocumentsWriterFlushControl
    volatile boolean flushPending = false;
    // TODO this should really be part of DocumentsWriterFlushControl
    // write access guarded by DocumentsWriterFlushControl
    long bytesUsed = 0;

    // set by DocumentsWriter after each indexing op finishes
volatile long lastSeqNo;

从ThreadState类看出,他是DocumentsWriterPerThread的一个包装,它的flushPending属性用于记录perThread当前的刷新状态,bytesUsed属性用于记录该线程目前持有的数据大小,这些信息对于flushControl管理各个线程的flush操作很有帮助。除此之外ThreadState继承了重入锁,利用重入锁的特性保证线程安全。

获取到perThread后再取出其中的DocumentsWriterPerThread类实例dwpt,调用dwpt.updateDocument方法去插入数据。

之后再调用flushingDWPT = flushControl.doAfterDocument(perThread, isUpdate);方法,我们已经知道flushControl是管理flush操作的,想必这个方法也跟flush有关。让我们走进这个方法来验证这点。

synchronized DocumentsWriterPerThread doAfterDocument(ThreadState perThread, boolean isUpdate) {
    try {
      commitPerThreadBytes(perThread);
      if (!perThread.flushPending) {
        if (isUpdate) {
          flushPolicy.onUpdate(this, perThread);
        } else {
          flushPolicy.onInsert(this, perThread);
        }
        if (!perThread.flushPending && perThread.bytesUsed > hardMaxBytesPerDWPT) {
          // Safety check to prevent a single DWPT exceeding its RAM limit. This
          // is super important since we can not address more than 2048 MB per DWPT
          setFlushPending(perThread);
        }
      }
      final DocumentsWriterPerThread flushingDWPT;
      if (fullFlush) {
        if (perThread.flushPending) {
          checkoutAndBlock(perThread);
          flushingDWPT = nextPendingFlush();
        } else {
          flushingDWPT = null;
        }
      } else {
        flushingDWPT = tryCheckoutForFlush(perThread);
      }
      return flushingDWPT;
    } finally {
      boolean stalled = updateStallState();
      assert assertNumDocsSinceStalled(stalled) && assertMemory();
    }
  }

commitPerThreadBytes(perThread);这个方法用于更新perThread线程占用的数据大小,由于文档已经插入,因此线程占用的内存大小bytesUsed的值应该增加。接着就是设置flush的策略。最后用tryCheckoutForFlush方法对待flush的perThread做一个检查,如果检查通过,接下来就可以进行flush了。

postUpdate(flushingDWPT, hasEvents) 就是对perThread进行flush操作了。可见插入文档之后就是flush操作了。

然我们继续深入添加文档的操作,看看org.apache.lucene.index.DocumentsWriterPerThread#updateDocument是如何插入文档的

public long updateDocument(Iterable<? extends IndexableField> doc, Analyzer analyzer, Term delTerm) throws IOException, AbortingException {
    testPoint("DocumentsWriterPerThread addDocument start");
    assert deleteQueue != null;
    reserveOneDoc();
    docState.doc = doc;
    docState.analyzer = analyzer;
    docState.docID = numDocsInRAM;
    if (INFO_VERBOSE && infoStream.isEnabled("DWPT")) {
      infoStream.message("DWPT", Thread.currentThread().getName() + " update delTerm=" + delTerm + " docID=" + docState.docID + " seg=" + segmentInfo.name);
    }
    // Even on exception, the document is still added (but marked
    // deleted), so we don't need to un-reserve at that point.
    // Aborting exceptions will actually "lose" more than one
    // document, so the counter will be "wrong" in that case, but
    // it's very hard to fix (we can't easily distinguish aborting
    // vs non-aborting exceptions):
    boolean success = false;
    try {
      try {
        consumer.processDocument();
      } finally {
        docState.clear();
      }
      success = true;
    } finally {
      if (!success) {
        // mark document as deleted
        deleteDocID(docState.docID);
        numDocsInRAM++;
      }
    }

    return finishDocument(delTerm);
  }

这个方法将文档内容,分析器,和文档ID保存到docState,这样在consumer.processDocument()方法内就可以通过docState获取到文档信息了。Consumer是专门索引域的。因为Lucene的文档是由域(fields)组成的,因此对文档的索引过程就可以分解成对文档中包含的域的索引。

public void processDocument() throws IOException, AbortingException {

    // How many indexed field names we've seen (collapses
    // multiple field instances by the same name):
    int fieldCount = 0;

    long fieldGen = nextFieldGen++;

    // NOTE: we need two passes here, in case there are
    // multi-valued fields, because we must process all
    // instances of a given field at once, since the
    // analyzer is free to reuse TokenStream across fields
    // (i.e., we cannot have more than one TokenStream
    // running "at once"):

    termsHash.startDocument();

    startStoredFields(docState.docID);

    boolean aborting = false;
    try {
      for (IndexableField field : docState.doc) {
        fieldCount = processField(field, fieldGen, fieldCount);
      }
    } catch (AbortingException ae) {
      aborting = true;
      throw ae;
    } finally {
      if (aborting == false) {
        // Finish each indexed field name seen in the document:
        for (int i=0;i<fieldCount;i++) {
          fields[i].finish();
        }
        finishStoredFields();
      }
    }

    try {
      termsHash.finishDocument();
    } catch (Throwable th) {
      // Must abort, on the possibility that on-disk term
      // vectors are now corrupt:
      throw AbortingException.wrap(th);
    }
  }

在这个方法中Lucene会将文档分解成域,循环遍历处理

for (IndexableField field : docState.doc) {
        fieldCount = processField(field, fieldGen, fieldCount);
      }

processField方法中包含了词项流的分析,以及倒排索引的过程。这块的内容比较多,我们放在其他章节分析。

总结

本文描述了Lucene中一个文档插入的过程,主要讲述了以下3点:

  • 文档首先写入到内存中,等到内存中的文档数量到达一定值就会启动flush将数据从内存刷新到磁盘中。
  • 文档的插入是支持多线程的。
  • 文档的插入最终是分解为域的插入。
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值