es评分 BM25

评分

准备

PUT /score_study
{
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "english"
            }
        }
    }
}
PUT /score_study/_doc/1
{
    "title": "Elastic Stack技术栈简介 2020 最新ElasticSearch教程"
}

评分机制详解

参数解释

tf, 词频
idf,逆向文档频率
freq, 词项在文档出现频率(occurrences of term within document)
b, 长度归一化参数,默认0.75(length normalization parameter)这个参数控制着字段长归一值所起的作用, 0.0 会禁用归一化, 1.0 会启用完全归一化。默认值为 0.75 。
k1,词项环境变量,默认1.2( term saturation parameter)这个参数控制着词频结果在词频饱和度中的上升速度。默认值为 1.2 。值越小饱和度变化越快,值越大饱和度变化越慢。
dl, token的总长度,(length of field),可用_analyze查看
dl, 所有文档token的总长度平均值,(avgdl, average length of field)
n, 包含匹配检索词项总数( number of documents containing term)
N, 总文档数(total number of documents with field)

当只有一篇文档时,只有docId为1的文档中elastic, 匹配搜索词项elastic

tf =  freq / (freq + k1 * (1 - b + b * dl / avgdl)) = 1 / (1 + 1.2 * (1- 0.75 + 0.75 * 13 / 13)) = 0.45454544
idf = log(1 + (N - n + 0.5) / (n + 0.5))  = log(1 + (1 -1 + 0.5)/ (1 + 0.5))= 0.2876821
score(freq=1.0) =  boost * idf * tf = 2.2 * 0.45454544 * 0.2876821 = 0.2876821

GET /score_study/_explain/1
{
    "query": {
        "match": {
            "title": "elastic"
        }
    }
}
返回
{
    "_index": "score_study",
    "_id": "1",
    "matched": true,
    "explanation": {
        "value": 0.2876821,
        "description": "weight(title:elast in 0) [PerFieldSimilarity], result of:",
        "details": [
            {
                "value": 0.2876821,
                "description": "score(freq=1.0), computed as boost * idf * tf from:",
                "details": [
                    {
                        "value": 2.2,
                        "description": "boost",
                        "details": []
                    },
                    {
                        "value": 0.2876821,
                        "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                        "details": [
                            {
                                "value": 1,
                                "description": "n, number of documents containing term",
                                "details": []
                            },
                            {
                                "value": 1,
                                "description": "N, total number of documents with field",
                                "details": []
                            }
                        ]
                    },
                    {
                        "value": 0.45454544,
                        "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                        "details": [
                            {
                                "value": 1.0,
                                "description": "freq, occurrences of term within document",
                                "details": []
                            },
                            {
                                "value": 1.2,
                                "description": "k1, term saturation parameter",
                                "details": []
                            },
                            {
                                "value": 0.75,
                                "description": "b, length normalization parameter",
                                "details": []
                            },
                            {
                                "value": 13.0,
                                "description": "dl, length of field",
                                "details": []
                            },
                            {
                                "value": 13.0,
                                "description": "avgdl, average length of field",
                                "details": []
                            }
                        ]
                    }
                ]
            }
        ]
    }
}

再增加一篇文档

PUT /score_study/_doc/2
{
    "title": "Elastic Stack技术栈简介 2020 Elastic"
}
然后查询
GET /score_study/_explain/2
{
    "query": {
        "match": {
            "title": "elastic"
        }
    }
}
返回
{
    "_index": "score_study",
    "_id": "2",
    "matched": true,
    "explanation": {
        "value": 0.2642025,
        "description": "weight(title:elast in 0) [PerFieldSimilarity], result of:",
        "details": [
            {
                "value": 0.2642025,
                "description": "score(freq=2.0), computed as boost * idf * tf from:",
                "details": [
                    {
                        "value": 2.2,
                        "description": "boost",
                        "details": []
                    },
                    {
                        "value": 0.18232156,
                        "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                        "details": [
                            {
                                "value": 2,
                                "description": "n, number of documents containing term",
                                "details": []
                            },
                            {
                                "value": 2,
                                "description": "N, total number of documents with field",
                                "details": []
                            }
                        ]
                    },
                    {
                        "value": 0.6586826,
                        "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                        "details": [
                            {
                                "value": 2.0,
                                "description": "freq, occurrences of term within document",
                                "details": []
                            },
                            {
                                "value": 1.2,
                                "description": "k1, term saturation parameter",
                                "details": []
                            },
                            {
                                "value": 0.75,
                                "description": "b, length normalization parameter",
                                "details": []
                            },
                            {
                                "value": 9.0,
                                "description": "dl, length of field",
                                "details": []
                            },
                            {
                                "value": 11.0,
                                "description": "avgdl, average length of field",
                                "details": []
                            }
                        ]
                    }
                ]
            }
        ]
    }
}

当增加一篇文档时

tf =  freq / (freq + k1 * (1 - b + b * dl / avgdl)) = 2 / (2 + 1.2 * (1- 0.75 + 0.75 * 9 / 11)) = 0.6586826
idf = log(1 + (N - n + 0.5) / (n + 0.5))  = log(1+ (2-2 + 0.5)/ (2+0.5))= 0.18232156
score(freq=1.0) =  boost * idf * tf = 2.2 * 0.6586826 * 0.18232156 = 0.2642025

原来的文档1返回

{
    "_index": "score_study",
    "_id": "1",
    "matched": true,
    "explanation": {
        "value": 0.16969931,
        "description": "weight(title:elast in 0) [PerFieldSimilarity], result of:",
        "details": [
            {
                "value": 0.16969931,
                "description": "score(freq=1.0), computed as boost * idf * tf from:",
                "details": [
                    {
                        "value": 2.2,
                        "description": "boost",
                        "details": []
                    },
                    {
                        "value": 0.18232156,
                        "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                        "details": [
                            {
                                "value": 2,
                                "description": "n, number of documents containing term",
                                "details": []
                            },
                            {
                                "value": 2,
                                "description": "N, total number of documents with field",
                                "details": []
                            }
                        ]
                    },
                    {
                        "value": 0.42307693,
                        "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                        "details": [
                            {
                                "value": 1.0,
                                "description": "freq, occurrences of term within document",
                                "details": []
                            },
                            {
                                "value": 1.2,
                                "description": "k1, term saturation parameter",
                                "details": []
                            },
                            {
                                "value": 0.75,
                                "description": "b, length normalization parameter",
                                "details": []
                            },
                            {
                                "value": 13.0,
                                "description": "dl, length of field",
                                "details": []
                            },
                            {
                                "value": 11.0,
                                "description": "avgdl, average length of field",
                                "details": []
                            }
                        ]
                    }
                ]
            }
        ]
    }
}

改变的有:

tf =  freq / (freq + k1 * (1 - b + b * dl / avgdl)) = 1 / (1 + 1.2 * (1- 0.75 + 0.75 * 13 / 11)) = 0.42307693
idf = log(1 + (N - n + 0.5) / (n + 0.5))  = log(1 + (2 -2 + 0.5)/ (2 + 0.5))= 0.18232156
score(freq=1.0) =  boost * idf * tf = 2.2 * 0.42307693 * 0.18232156 = 0.16969931

再增加一篇不相关的

PUT /score_study/_doc/3
{
    "title": "test Stack技术栈简介 2020 test"
}

此时查询文档1

{
    "_index": "score_study",
    "_id": "1",
    "matched": true,
    "explanation": {
        "value": 0.42512262,
        "description": "weight(title:elast in 0) [PerFieldSimilarity], result of:",
        "details": [
            {
                "value": 0.42512262,
                "description": "score(freq=1.0), computed as boost * idf * tf from:",
                "details": [
                    {
                        "value": 2.2,
                        "description": "boost",
                        "details": []
                    },
                    {
                        "value": 0.47000363,
                        "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                        "details": [
                            {
                                "value": 2,
                                "description": "n, number of documents containing term",
                                "details": []
                            },
                            {
                                "value": 3,
                                "description": "N, total number of documents with field",
                                "details": []
                            }
                        ]
                    },
                    {
                        "value": 0.41114056,
                        "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                        "details": [
                            {
                                "value": 1.0,
                                "description": "freq, occurrences of term within document",
                                "details": []
                            },
                            {
                                "value": 1.2,
                                "description": "k1, term saturation parameter",
                                "details": []
                            },
                            {
                                "value": 0.75,
                                "description": "b, length normalization parameter",
                                "details": []
                            },
                            {
                                "value": 13.0,
                                "description": "dl, length of field",
                                "details": []
                            },
                            {
                                "value": 10.333333,
                                "description": "avgdl, average length of field",
                                "details": []
                            }
                        ]
                    }
                ]
            }
        ]
    }
}

改变的有:

tf =  freq / (freq + k1 * (1 - b + b * dl / avgdl)) = 1 / (1 + 1.2 * (1- 0.75 + 0.75 * 13 / 10.333333)) = 0.41114056
idf = log(1 + (N - n + 0.5) / (n + 0.5))  = log(1 + (2 -3 + 0.5)/ (3 + 0.5))= 0.47000363
score(freq=1.0) =  boost * idf * tf = 2.2 * 0.42307693 * 0.18232156 = 0.16969931

log 默认以e为底

资料

https://en.wikipedia.org/wiki/Okapi_BM25
https://www.elastic.co/guide/cn/elasticsearch/guide/current/pluggable-similarites.html
Elasticsearch使用相关性评分来衡量查询结果与搜索查询的匹配程度。其中,两个主要的评分算法是TF-IDF和BM25。 TF-IDF(词频-逆文档频率)是一种传统的评分算法,它考虑了词项在文档中的频率和在整个文集中的重要性。它通过计算词项在文档中的出现频率(TF)和在文集中的逆文档频率(IDF),来决定一个词项的相关性得分。TF-IDF评分越高,表示词项在文档中的出现频率越高且在整个文集中的重要性越大。 BM25是一种更先进的评分算法,它综合考虑了词项频率、文档长度和文档频率等因素。BM25对于常见词项的匹配较为慎重,但在罕见词项的匹配上具有更好的效果。相对于TF-IDF,BM25更适用于大型的文档集合和长文档。 在Elasticsearch中,影响相关性评分的因素包括: - 词项频率(Term Frequency,TF):词项在文档中的出现频率越高,相关性评分越高。 - 逆文档频率(Inverse Document Frequency,IDF):衡量词项的重要性,常见词项的IDF较低,罕见词项的IDF较高。 - 文档长度(Field Length):较长的文档可能会被降低相关性评分,以避免长文档在相关性上的优势。 - 文档频率(Document Frequency,DF):词项在文档集合中的出现频率越高,相关性评分越低。 综上所述,Elasticsearch使用TF-IDF和BM25评分算法来衡量查询结果与搜索查询的相关性。评分算法会考虑词项频率、逆文档频率、文档长度和文档频率等因素,并根据这些因素为每个查询结果分配一个相关性得分。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* *2* *3* [【ElasticsearchElasticsearch自定义评分的N种方法](https://blog.youkuaiyun.com/qq_21383435/article/details/116569606)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_1"}}] [.reference_item style="max-width: 100%"] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值