lucene检索得分模型

最新推荐文章于 2022-05-14 22:45:00 发布

iterate7

最新推荐文章于 2022-05-14 22:45:00 发布

阅读量617

点赞数

CC 4.0 BY-SA版权

分类专栏：算法分布式搜索文章标签： lucene vsm 布尔模型 search 评分公式

本文链接：https://blog.youkuaiyun.com/iterate7/article/details/79348412

算法同时被 2 个专栏收录

30 篇文章

订阅专栏

分布式搜索

4 篇文章

订阅专栏

本文介绍了Lucene的检索得分模型，结合布尔模型和向量空间模型，详细讲解了评分公式，包括VSM评分、coord、queryNorm、tf、idf、t.getBoost和norm等要素，并通过实例解释了各个部分的作用和计算过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

得分机制和思想

对于搜索，一般包括从库里通过query搜索出docs并排序。
本质上是一个排名问题，检索的话比较简单，可以通过倒排文档的思路，直接通过词找到包括该词的文档（最原始的思路）。
lucene也是利用了两类模型：布尔模型和向量空间模型； [布尔模型]{http://blog.youkuaiyun.com/iterate7/article/details/77206613}负责检索到数据；向量空间模型负责得分排序。
所谓的向量空间模型，可以理解为：query和doc都会映射为一个vector，通常情况下是term vectors；而权重则选择tf-idf，在同一个特征空间进行计算排序。

Lucene combines Boolean model (BM) of Information Retrieval with Vector Space Model (VSM) of Information Retrieval - documents “approved” by BM are scored by VSM.

评分公式

VSM评分公式

c o s i n e - s i m i l a r i t y = V ( q ) \cdot V ( d ) | V ( q ) | | V ( d ) |

$cosine-similarity = \frac{V(q)\cdot V(d)}{|V(q)||V(d)|}$

lucene概念评分公式

s c o r e (q, d) = c o o r d - f a c t o r (q, d) \cdot q u e r y - b o o s t (q) V ( q ) \cdot V ( d ) | V ( q ) | \cdot d o c - l e n - n o r m (d) \cdot d o c - b o o s t (d)

$score(q,d) =\color{#FF9933}{coord-factor(q,d) }\cdot \color{#CCCC00}{query-boost(q)} \frac{\color{#993399}{V(q)\cdot V(d)}}{\color{#FF33CC}{|V(q)|}} \cdot \color{#}{doc-len-norm(d)} \cdot \color{#3399FF}{doc-boost(d)}$

lucene实际评分公式

s c o r e (q, d) = c o o r d - f a c t o r (q, d) \cdot q u e r y N o r m (q) \sum t i n q {(t f (t i n d) \cdot i d f (t) 2 \cdot t . g e t B o o s t () \cdot n o r m (t, d)}

$score(q,d) =\color{#FF9933}{coord-factor(q,d) }\cdot \color{#FF33CC}{queryNorm(q)} \sum_{t~in~q}\{{\color{#993399}{(tf(t~in~d) \cdot idf(t)^2} \cdot \color{#CCCC00}{t.getBoost()} \cdot \color{#3399FF}{norm(t,d)} }\}$

coord(q,d)

协调因子，文档中出现查询项的个数越多，匹配度越高。

public float coord(int overlap, int maxOverlap) {
    return overlap / (float)maxOverlap;
}

overlap: 当前文档中满足检索条件的满足个数
maxOverlap: 检索条件的总个数
比如检索”english book”，现在有一个文档是”this is an chinese book”。
那么，这个搜索对应这个文档的overlap为1（因为匹配了book，满足一个条件），而maxOverlap为2（因为检索条件有两个book和english）。
最后得到的这个搜索对应这个文档的coord值为0.5。

queryNorm(q)

查询的标准化；只是对词的标准化，不影响文档排序。只是用于不同词之间得分比较的时候用的。
公式是：

q u e r y N o r m (q) = q u e r y N o r m (s u m O f S q u a r e d W e i g h t s) = 1 ( \sqrt ( q . g e t B o o s t ) 2 \cdot \sum ( t i n q ( i d f ( t ) \cdot ( t . g e t B o o s t ) ) 2

$queryNorm(q) = queryNorm(sumOfSquaredWeights) = \frac{1}{\sqrt((q.getBoost)^2\cdot \sum_{(t~in~q}(idf(t)\cdot (t.getBoost))^2}$
代码:

public float queryNorm(float sumOfSquaredWeights) {
    return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
}

tf(t in d)

t在d中出现的次数；取平方根，为了scale吧。
比如有个文档叫做”this is book about chinese book”，我的搜索项为”book”，那么这个搜索项对应文档的freq就为2，那么tf值就为根号2，即1.4142135
代码：

public float tf(float freq) {
    return (float)Math.sqrt(freq);
}

idf(t)

逆文档频度，主要是判定该词对文档的区分度。如果很大，说明该词可以区分文档；如果是0，基本上可以认为，每个文档都有这个词，无任何区分意义。

public float idf(long docFreq, long numDocs) {
    return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0);
}

这里的两个值解释下

docFreq 指的是项出现的文档数，就是有多少个文档符合这个搜索
numDocs 指的是索引中有多少个文档。

为了平滑（smooth）计算公式：

i d f (t) = 1 + log (n u m D o c s d o c F r e q + 1)

$idf(t) = 1+ \log (\frac{numDocs}{docFreq+1})$

比如我现在有三个文档，分别为:

this book is about english
this book is about chinese
this book is about japan
我要搜索的词语是”chinese”，那么对第二篇文档来说，docFreq值就是1，因为只有一个文档符合这个搜索，而numDocs就是3。最后算出idf的值是:

(float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0) = ln(3/(1+1)) + 1 = ln(1.5) + 1 = 0.40546510810816 + 1 = 1.40546510810816

t.getBoost

查询时期项t的加权，这个就是一个影响值，比如我希望匹配chinese的权重更高，就可以把它的boost设置为2

norm(t,d)

这个是term的加权因子，目的是将同样匹配的文档，比较短的放前面。
norm(t,d) = doc.getBoost()· lengthNorm· ∏ f.getBoost()
lengthNorm代码：

public float lengthNorm(FieldInvertState state) {
    final int numTerms;
    if (discountOverlaps)
        numTerms = state.getLength() - state.getNumOverlap();
    else
        numTerms = state.getLength();
    return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
}

doc.getBoost代表文档权重；
f.getBoost代表字段权重，越高代表越重要，一般默认是1.0；
lengthNorm：一个域中包含的Term总数越多，也即文档越长，此值越小，文档越短，此值越大。
所以基本上由lengthNorm来决定；

比如我现在有一个文档:chinese book
搜索的词语为chinese，那么numTerms为2，lengthNorm的值为 1/sqrt(2) = 0.71428571428571。

但是非常遗憾，如果你使用explain去查看es的时候，发现lengthNorm显示的只有0.625。
这个官方给出的原因是精度问题，norm在存储的时候会进行压缩，查询的时候进行解压，而这个解压是不可逆的，即decode(encode(0.714)) = 0.625。

注释：
索引的时候，把 norm 值压缩(encode)成一个 byte 保存在索引中。搜索的时候再把索引中 norm 值解压(decode)成一个 float 值，这个 encode/decode 由 Similarity 提供。官方说：这个过程由于精度问题，以至不是可逆的，如：decode(encode(0.89)) = 0.75。

接下来，查看Lucene的DefaultSimilarity类源码，看下核心的几个方法代码
Java代码收藏代码
/* Cache of decoded bytes. /
private static final float[] NORM_TABLE = new float[256];

static {
for (int i = 0; i < 256; i++) {
NORM_TABLE[i] = SmallFloat.byte315ToFloat((byte)i);
}
}
//索引期间执行，将norm编码成一个8位字节
public final long encodeNormValue(float f) {
return SmallFloat.floatToByte315(f);
}

//搜索期间执行，将norm，还原成具体的分数，参与评分
public final float decodeNormValue(long norm) {
return NORM_TABLE[(int) (norm & 0xFF)]; // & 0xFF maps negative bytes to positive above 127
}

仔细看decodeNormValue方法，这个代码，发现里面竟然有将float强制转换为int一个强转，这意味着，精度损失。至于为什么这样？为了快速？！留待讨论。
由于是直接存储了term在doc中的norm值，检索的时候只要解码即可，这样使得速度极快！

解释和例子

es中可以使用_explain接口进行评分解释查看。

比如现在我的文档为：

chinese book
搜索词为：

{
  "query": {
    "match": {
      "content": "chinese"
    }
  }
}

explain得到的结果为：

{
    "_index": "scoretest",
    "_type": "test",
    "_id": "2",
    "matched": true,
    "explanation": {
        "value": 0.8784157,
        "description": "weight(content:chinese in 1) [PerFieldSimilarity], result of:",
        "details": [
            {
                "value": 0.8784157,
                "description": "fieldWeight in 1, product of:",
                "details": [
                    {
                        "value": 1,
                        "description": "tf(freq=1.0), with freq of:",
                        "details": [
                            {
                                "value": 1,
                                "description": "termFreq=1.0"
                            }
                        ]
                    },
                    {
                        "value": 1.4054651,
                        "description": "idf(docFreq=1, maxDocs=3)"
                    },
                    {
                        "value": 0.625,
                        "description": "fieldNorm(doc=1)"
                    }
                ]
            }
        ]
    }
}