Lucene 合并倒排表算法之并集

最新推荐文章于 2025-06-08 22:35:14 发布

原创最新推荐文章于 2025-06-08 22:35:14 发布 · 4k 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#lucene #算法 #数据结构 #文档 #iterator #工作

java 同时被 2 个专栏收录

27 篇文章

订阅专栏

搜索引擎

2 篇文章

订阅专栏

本文深入探讨了Lucene在合并倒排表时如何实现并集操作。通过使用队列数据结构和ScorerDocQueue，Lucene逐步找到满足最少匹配条件的文档。核心方法advanceAfterCurrent()利用堆排序策略找到最小元素，确保高效地遍历并集。这一过程对于理解Lucene的查询优化至关重要。

上一篇中讲到lucene在合并倒排表时候的交集算法操作.本文继续对倒排表求并集的算法:

lucene处理交集时采用的数据结构是一个倒排表的数组,数组的元素是一个个的迭代器来表现每个倒排表.

而在求并集的时候则是采用了队列数据结构.在DisjunctionSumScorer类的构造函数中对队列进行了初始化操作:

Iterator si = subScorers.iterator(); scorerDocQueue = new ScorerDocQueue(nrScorers); while (si.hasNext()) { Scorer se = (Scorer) si.next(); if (se.nextDoc() != NO_MORE_DOCS) { // doc() method will be used in scorerDocQueue. scorerDocQueue.insert(se); } } 上述代码将一个个倒排表加入队列中,ScorerDocQueue则是一般的队列实现.

lucene在查询的时候,BooleanScorer2调用nextDoc()会调用DisjunctionSumScorer类中一个非常重要的方法:advanceAfterCurrent()

do { // repeat until minimum nr of matchers currentDoc = scorerDocQueue.topDoc(); currentScore = scorerDocQueue.topScore(); nrMatchers = 1; do { // Until all subscorers are after currentDoc if (!scorerDocQueue.topNextAndAdjustElsePop()) { if (scorerDocQueue.size() == 0) { break; // nothing more to advance, check for last match. } } if (scorerDocQueue.topDoc() != currentDoc) { break; // All remaining subscorers are after currentDoc. } currentScore += scorerDocQueue.topScore(); nrMatchers++; } while (true); if (nrMatchers >= minimumNrMatchers) { return true; } else if (scorerDocQueue.size() < minimumNrMatchers) { return false; } } while (true); 上述代码段使用两个do...while ,最后得到满足最少N个条件的文档比较重要的是其中的topNextAndAdjustElsePop()方法,该方法主要完成以下工作:

int i = 1; HeapedScorerDoc node = heap[i]; // save top node int j = i << 1; // find smaller child int k = j + 1; if ((k <= size) && (heap[k].doc < heap[j].doc)) { j = k; } while ((j <= size) && (heap[j].doc < node.doc)) { heap[i] = heap[j]; // shift up child i = j; j = i << 1; k = j + 1; if (k <= size && (heap[k].doc < heap[j].doc)) { j = k; } } heap[i] = node; // install saved node topHSD = heap[1]; 这段代码找出了队列中最小的元素,并将其置顶,其实就是用了我们常见的堆排序,时间复杂度为O(nlogn),空间复杂度为O(1);当当前文档和队列顶部元素不等时,说明满足该条件的倒排表中的文档已经统计结束,只要该文档出现在所有倒排表中的次数满足最小条件,该文档就是并集的一个元素之一...

当nextDoc()到最后一个doc时,说明并集操作结束

注:因为差集相对简单,不再累述