前面的一些文章主要分析了一些solr索引处理的流程,和索引文件打交道的是lucene的工作,下面我们基于lucene5.3.1对它的索引流程进行分析。
在开始前请允许我盗图一张,下面是lucene索引链的流程图</span>
我们一般用IndexWriter写索引的代码如下:indexWriter.addDocument(doc1);
或者
indexWriter.addDocuments(docs);
这个函数的代码如下:
public void updateDocument(Term term, Iterable<? extends IndexableField> doc) throws IOException {
ensureOpen(); //保证IndexWriter打开
try {
boolean success = false;
try {
if (docWriter.updateDocument(doc, analyzer, term)) { //利用docWriter更新索引
processEvents(true, false);
}
success = true;
} finally {
if (!success) {
if (infoStream.isEnabled("IW")) {
infoStream.message("IW", "hit exception updating document");
}
}
}
} catch (AbortingException | OutOfMemoryError tragedy) {
tragicEvent(tragedy, "updateDocument");
}
}
可以看到这一步调用了DocumentsWriter的方法,
boolean updateDocument(final Iterable<? extends IndexableField> doc, final Analyzer analyzer,
final Term delTerm) throws IOException, AbortingException {
boolean hasEvents = preUpdate(); //操作前先把之前的请求处理完,如更新删除操作
final ThreadState perThread = flushControl.obtainAndLock(); //从对象池中获取一个ThreadState
final DocumentsWriterPerThread flushingDWPT;
try {
// This must happen after we've pulled the ThreadState because IW.close
// waits for all ThreadStates to be released:
ensureOpen();
ensureInitialized(perThread); //保证perThread中的dwpt被初始化
assert perThread.isInitialized();
final DocumentsWriterPerThread dwpt = perThread.dwpt;
final int dwptNumDocs = dwpt.getNumDocsInRAM();
try {
System.out.println(dwpt.hashCode());
dwpt.updateDocument(doc, analyzer, delTerm); //利用dwpt更新文档
} catch (AbortingException ae) {
flushControl.doOnAbort(perThread);
dwpt.abort();
throw ae;
} finally {
// We don't know whether the document actually
// counted as being indexed, so we must subtract here to
// accumulate our separate counter:
numDocsInRAM.addAndGet(dwpt.getNumDocsInRAM() - dwptNumDocs);
}
final boolean isUpdate = delTerm != null;
flushingDWPT = flushControl.doAfterDocument(perThread, isUpdate);
} finally {
perThreadPool.release(perThread);
}
return postUpdate(flushingDWPT, hasEvents); //操作完成后的处理
}
上面代码里面比较重要的就是ThreadState,lucene对他的解释如下
/**
* {@link ThreadState} references and guards a
* {@link DocumentsWriterPerThread} instance that is used during indexing to
* build a in-memory index segment. {@link ThreadState} also holds all flush
* related per-thread data controlled by {@link DocumentsWriterFlushControl}.
* <p>
* A {@link ThreadState}, its methods and members should only accessed by one
* thread a time. Users must acquire the lock via {@link ThreadState#lock()}
* and release the lock in a finally block via {@link ThreadState#unlock()}
* before accessing the state.
*/
大概意思就是ThreadState中维护的dwpt被用于在索引写入内存并生成段信息,同时ThreadState控制了各个相关线程的数据向磁盘刷新。
获取ThreadState的方式是看freeList是不是为空,如果是则新建一个返回,如果不为空则从中取一个返回。用完后还给freeList。最早的lucene默认freeList的大小为8,现在没有默认值了。获取到ThreadState后就利用它的dwpt更新文档代码如下:
public void updateDocument(Iterable<? extends IndexableField> doc, Analyzer analyzer, Term delTerm) throws IOException, AbortingException {
testPoint("DocumentsWriterPerThread addDocument start");
assert deleteQueue != null;
reserveOneDoc();
docState.doc = doc;
docState.analyzer = analyzer;
docState.docID = numDocsInRAM;
if (INFO_VERBOSE && infoStream.isEnabled("DWPT")) {
infoStream.message("DWPT", Thread.currentThread().getName() + " update delTerm=" + delTerm + " docID=" + docState.docID + " seg=" + segmentInfo.name);
}
// Even on exception, the document is still added (but marked
// deleted), so we don't need to un-reserve at that point.
// Aborting exceptions will actually "lose" more than one
// document, so the counter will be "wrong" in that case, but
// it's very hard to fix (we can't easily distinguish aborting
// vs non-aborting exceptions):
boolean success = false;
try {
try {
consumer.processDocument(); //调用默认索引链更新
} finally {
docState.clear();
}
上面的consumer.processDocument(); 实际上调用的就是DefaultIndexingChain的processDocument方法public void processDocument() throws IOException, AbortingException {
// How many indexed field names we've seen (collapses
// multiple field instances by the same name):
int fieldCount = 0;
long fieldGen = nextFieldGen++;
// NOTE: we need two passes here, in case there are
// multi-valued fields, because we must process all
// instances of a given field at once, since the
// analyzer is free to reuse TokenStream across fields
// (i.e., we cannot have more than one TokenStream
// running "at once"):
termsHash.startDocument(); // 开始前准备工作,将fields清空
fillStoredFields(docState.docID);
startStoredFields();
boolean aborting = false;
try {
for (IndexableField field : docState.doc) {
fieldCount = processField(field, fieldGen, fieldCount); // 开始对每个字段进行处理
}
} catch (AbortingException ae) {
aborting = true;
throw ae;
} finally {
if (aborting == false) {
// Finish each indexed field name seen in the document:
for (int i = 0; i < fieldCount; i++) {
fields[i].finish();
}
finishStoredFields();
}
}
try {
termsHash.finishDocument();
} catch (Throwable th) {
// Must abort, on the possibility that on-disk term
// vectors are now corrupt:
throw AbortingException.wrap(th);
}
}
private int processField(IndexableField field, long fieldGen, int fieldCount) throws IOException, AbortingException {
String fieldName = field.name();
IndexableFieldType fieldType = field.fieldType();
PerField fp = null;
if (fieldType.indexOptions() == null) {
throw new NullPointerException("IndexOptions must not be null (field: \"" + field.name() + "\")");
}
// Invert indexed fields:
if (fieldType.indexOptions() != IndexOptions.NONE) {
// if the field omits norms, the boost cannot be indexed.
if (fieldType.omitNorms() && field.boost() != 1.0f) {
throw new UnsupportedOperationException("You cannot set an index-time boost: norms are omitted for field '"
+ field.name() + "'");
}
fp = getOrAddField(fieldName, fieldType, true); // 获取PerField对象,对于每个dwpt来说,每个field都有一个唯一的该对象,用于在缓存中建立索引
boolean first = fp.fieldGen != fieldGen;
fp.invert(field, first); // 开始调用分词器生成token流,分词形成
if (first) {
fields[fieldCount++] = fp;
fp.fieldGen = fieldGen;
}
} else {
verifyUnIndexedFieldType(fieldName, fieldType);
}
// Add stored fields:
if (fieldType.stored()) {
if (fp == null) {
fp = getOrAddField(fieldName, fieldType, false);
}
if (fieldType.stored()) {
try {
storedFieldsWriter.writeField(fp.fieldInfo, field); // 如果是存储的字段,则写入存储doc文件里面
} catch (Throwable th) {
throw AbortingException.wrap(th);
}
}
}
DocValuesType dvType = fieldType.docValuesType(); // 对DocValues进行处理
if (dvType == null) {
throw new NullPointerException("docValuesType cannot be null (field: \"" + fieldName + "\")");
}
if (dvType != DocValuesType.NONE) {
if (fp == null) {
fp = getOrAddField(fieldName, fieldType, false);
}
indexDocValue(fp, dvType, field);
}
return fieldCount;
}
fp.invert()这个方法里面主要是生成了一些索引结构信息,保存在内存缓冲区,结束后调用flush()操作刷新到磁盘上。这一块具体的结构我们后续分析,lucene系列创建索引的解析大概会分几章逐步介绍。上面是一些基本的索引链,有不对的请指正。