入口
Lucene 添加文档的功能由IndexWriter提供:
org.apache.lucene.index.IndexWriter#addDocument 添加单个文档
org.apache.lucene.index.IndexWriter#addDocuments 批量添加文档
本文以addDocument作为入口一窥文档添加流程
public long addDocument(Iterable<? extends IndexableField> doc) throws IOException {
return updateDocument(null, doc);
}
public long updateDocument(Term term, Iterable<? extends IndexableField> doc) throws IOException {
ensureOpen();
try {
boolean success = false;
try {
long seqNo = docWriter.updateDocument(doc, analyzer, term);
if (seqNo < 0) {
seqNo = - seqNo;
processEvents(true, false);
}
success = true;
return seqNo;
} finally {
if (!success) {
if (infoStream.isEnabled("IW")) {
infoStream.message("IW", "hit exception updating document");
}
}
}
} catch (AbortingException | VirtualMachineError tragedy) {
tragicEvent(tragedy, "updateDocument");
// dead code but javac disagrees:
return -1;
}
}
- 首先通过ensureOpen()方法确认索引是否开启,插入文档的前提是索引已经打开。
- 然后调用用DocumentsWriter#updateDocument方法插入文档,返回一个操作序列号。
- 插入文档的过程中会产生各种事件,插入文档之后用processEvents去处理这些事件。
那么updateDocument中做了什么操作呢?让我们进一步深入探索
long updateDocument(final Iterable<? extends IndexableField> doc, final Analyzer analyzer,
final Term delTerm) throws IOException, AbortingException {
boolean hasEvents = preUpdate();
final ThreadState perThread = flushControl.obtainAndLock();
final DocumentsWriterPerThread flushingDWPT;
long seqNo;
try {
// This must happen after we've pulled the ThreadState because IW.close
// waits for all ThreadStates to be released:
ensureOpen();
ensureInitialized(perThread);
assert perThread.isInitialized();
final DocumentsWriterPerThread dwpt = perThread.dwpt;
final int dwptNumDocs = dwpt.getNumDocsInRAM();
try {
seqNo = dwpt.updateDocument(doc, analyzer, delTerm);
} catch (AbortingException ae) {
flushControl.doOnAbort(perThread);
dwpt.abort();
throw ae;
} finally {
// We don't know whether the document actually
// counted as being indexed, so we must subtract here to
// accumulate our separate counter:
numDocsInRAM.addAndGet(dwpt.getNumDocsInRAM() - dwptNumDocs);
}
final boolean isUpdate = delTerm != null;
flushingDWPT = flushControl.doAfterDocument(perThread, isUpdate);
assert seqNo > perThread.lastSeqNo: "seqNo=" + seqNo + " lastSeqNo=" + perThread.lastSeqNo;
perThread.lastSeqNo = seqNo;
} finally {
perThreadPool.release(perThread);
}
if (postUpdate(flushingDWPT, hasEvents)) {
seqNo = -seqNo;
}
return seqNo;
}
这里首先调用了preUpdate方法,看方法名大概猜测是做一些插入前的准备工作。进去看看具体做了什么事情
private boolean preUpdate() throws IOException, AbortingException {
ensureOpen();
boolean hasEvents = false;
if (flushControl.anyStalledThreads() || flushControl.numQueuedFlushes() > 0) {
// Help out flushing any queued DWPTs so we can un-stall:
do {
// Try pick up pending threads here if possible
DocumentsWriterPerThread flushingDWPT;
while ((flushingDWPT = flushControl.nextPendingFlush()) != null) {
// Don't push the delete here since the update could fail!
hasEvents |= doFlush(flushingDWPT);
}
flushControl.waitIfStalled(); // block if stalled
} while (flushControl.numQueuedFlushes() != 0); // still queued DWPTs try help flushing
}
return hasEvents;
}
1、 再次确认索引是开启的
2、 如果有需要刷新的数据,就执行刷新doFlush
flush简介
此处有必要简单说明一下Lucene的flush。前一篇lucene索引的创建分析中已经说明,一个Lucene索引同时只能有一个IndexWriter实例操作。不过为了提高写入的性能,支持多线程并非写入那是必须的。所以IndexWriter 中维护了一个线程池perThreadPool,启动包含多个写入线程perThread。而perThread的写入也不是直接就把数据写入到磁盘中,否则频繁的IO操作也是必然降低性能。perThread会先将数据存放在内存当中,当内存中的数据量达到了一定的阈值,就开始flush,把内存中的数据刷新到磁盘中。刷新数据这种事情就是由flushControl全权负责的啦!上面的代码中flushControl.anyStalledThreads() || flushControl.numQueuedFlushes() > 0 这个条件是就是表示当内存中的数据量达到阈值,或者flushQueue队列中有待刷新的线程时,就执行flush操作。这里只是大概描述一下Lucene的刷新机制,具体会有专门的篇幅介绍。
综上所述preUpdate方法就是把需要刷新的数据做一次刷新。
preUpdate之后,需要从perThreadPool中获取一个线程来写文档:
final ThreadState perThread = flushControl.obtainAndLock();
ThreadState obtainAndLock() {
final ThreadState perThread = perThreadPool.getAndLock(Thread
.currentThread(), documentsWriter);
boolean success = false;
try {
if (perThread.isInitialized() && perThread.dwpt.deleteQueue != documentsWriter.deleteQueue) {
// There is a flush-all in process and this DWPT is
// now stale -- enroll it for flush and try for
// another DWPT:
addFlushableState(perThread);
}
success = true;
// simply return the ThreadState even in a flush all case sine we already hold the lock
return perThread;
} finally {
if (!success) { // make sure we unlock if this fails
perThreadPool.release(perThread);
}
}
}
从上面代码可以看到perThreadPool中获取的perThread并非原本以为的DocumentsWriterPerThread类,而是ThreadState。
final static class ThreadState extends ReentrantLock {
DocumentsWriterPerThread dwpt;
// TODO this should really be part of DocumentsWriterFlushControl
// write access guarded by DocumentsWriterFlushControl
volatile boolean flushPending = false;
// TODO this should really be part of DocumentsWriterFlushControl
// write access guarded by DocumentsWriterFlushControl
long bytesUsed = 0;
// set by DocumentsWriter after each indexing op finishes
volatile long lastSeqNo;
从ThreadState类看出,他是DocumentsWriterPerThread的一个包装,它的flushPending属性用于记录perThread当前的刷新状态,bytesUsed属性用于记录该线程目前持有的数据大小,这些信息对于flushControl管理各个线程的flush操作很有帮助。除此之外ThreadState继承了重入锁,利用重入锁的特性保证线程安全。
获取到perThread后再取出其中的DocumentsWriterPerThread类实例dwpt,调用dwpt.updateDocument方法去插入数据。
之后再调用flushingDWPT = flushControl.doAfterDocument(perThread, isUpdate);方法,我们已经知道flushControl是管理flush操作的,想必这个方法也跟flush有关。让我们走进这个方法来验证这点。
synchronized DocumentsWriterPerThread doAfterDocument(ThreadState perThread, boolean isUpdate) {
try {
commitPerThreadBytes(perThread);
if (!perThread.flushPending) {
if (isUpdate) {
flushPolicy.onUpdate(this, perThread);
} else {
flushPolicy.onInsert(this, perThread);
}
if (!perThread.flushPending && perThread.bytesUsed > hardMaxBytesPerDWPT) {
// Safety check to prevent a single DWPT exceeding its RAM limit. This
// is super important since we can not address more than 2048 MB per DWPT
setFlushPending(perThread);
}
}
final DocumentsWriterPerThread flushingDWPT;
if (fullFlush) {
if (perThread.flushPending) {
checkoutAndBlock(perThread);
flushingDWPT = nextPendingFlush();
} else {
flushingDWPT = null;
}
} else {
flushingDWPT = tryCheckoutForFlush(perThread);
}
return flushingDWPT;
} finally {
boolean stalled = updateStallState();
assert assertNumDocsSinceStalled(stalled) && assertMemory();
}
}
commitPerThreadBytes(perThread);这个方法用于更新perThread线程占用的数据大小,由于文档已经插入,因此线程占用的内存大小bytesUsed的值应该增加。接着就是设置flush的策略。最后用tryCheckoutForFlush方法对待flush的perThread做一个检查,如果检查通过,接下来就可以进行flush了。
postUpdate(flushingDWPT, hasEvents) 就是对perThread进行flush操作了。可见插入文档之后就是flush操作了。
然我们继续深入添加文档的操作,看看org.apache.lucene.index.DocumentsWriterPerThread#updateDocument是如何插入文档的
public long updateDocument(Iterable<? extends IndexableField> doc, Analyzer analyzer, Term delTerm) throws IOException, AbortingException {
testPoint("DocumentsWriterPerThread addDocument start");
assert deleteQueue != null;
reserveOneDoc();
docState.doc = doc;
docState.analyzer = analyzer;
docState.docID = numDocsInRAM;
if (INFO_VERBOSE && infoStream.isEnabled("DWPT")) {
infoStream.message("DWPT", Thread.currentThread().getName() + " update delTerm=" + delTerm + " docID=" + docState.docID + " seg=" + segmentInfo.name);
}
// Even on exception, the document is still added (but marked
// deleted), so we don't need to un-reserve at that point.
// Aborting exceptions will actually "lose" more than one
// document, so the counter will be "wrong" in that case, but
// it's very hard to fix (we can't easily distinguish aborting
// vs non-aborting exceptions):
boolean success = false;
try {
try {
consumer.processDocument();
} finally {
docState.clear();
}
success = true;
} finally {
if (!success) {
// mark document as deleted
deleteDocID(docState.docID);
numDocsInRAM++;
}
}
return finishDocument(delTerm);
}
这个方法将文档内容,分析器,和文档ID保存到docState,这样在consumer.processDocument()方法内就可以通过docState获取到文档信息了。Consumer是专门索引域的。因为Lucene的文档是由域(fields)组成的,因此对文档的索引过程就可以分解成对文档中包含的域的索引。
public void processDocument() throws IOException, AbortingException {
// How many indexed field names we've seen (collapses
// multiple field instances by the same name):
int fieldCount = 0;
long fieldGen = nextFieldGen++;
// NOTE: we need two passes here, in case there are
// multi-valued fields, because we must process all
// instances of a given field at once, since the
// analyzer is free to reuse TokenStream across fields
// (i.e., we cannot have more than one TokenStream
// running "at once"):
termsHash.startDocument();
startStoredFields(docState.docID);
boolean aborting = false;
try {
for (IndexableField field : docState.doc) {
fieldCount = processField(field, fieldGen, fieldCount);
}
} catch (AbortingException ae) {
aborting = true;
throw ae;
} finally {
if (aborting == false) {
// Finish each indexed field name seen in the document:
for (int i=0;i<fieldCount;i++) {
fields[i].finish();
}
finishStoredFields();
}
}
try {
termsHash.finishDocument();
} catch (Throwable th) {
// Must abort, on the possibility that on-disk term
// vectors are now corrupt:
throw AbortingException.wrap(th);
}
}
在这个方法中Lucene会将文档分解成域,循环遍历处理
for (IndexableField field : docState.doc) {
fieldCount = processField(field, fieldGen, fieldCount);
}
processField方法中包含了词项流的分析,以及倒排索引的过程。这块的内容比较多,我们放在其他章节分析。
总结
本文描述了Lucene中一个文档插入的过程,主要讲述了以下3点:
- 文档首先写入到内存中,等到内存中的文档数量到达一定值就会启动flush将数据从内存刷新到磁盘中。
- 文档的插入是支持多线程的。
- 文档的插入最终是分解为域的插入。