评分
准备
PUT /score_study
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "english"
}
}
}
}
PUT /score_study/_doc/1
{
"title": "Elastic Stack技术栈简介 2020 最新ElasticSearch教程"
}
评分机制详解
参数解释
tf, 词频
idf,逆向文档频率
freq, 词项在文档出现频率(occurrences of term within document)
b, 长度归一化参数,默认0.75(length normalization parameter)这个参数控制着字段长归一值所起的作用, 0.0 会禁用归一化, 1.0 会启用完全归一化。默认值为 0.75 。
k1,词项环境变量,默认1.2( term saturation parameter)这个参数控制着词频结果在词频饱和度中的上升速度。默认值为 1.2 。值越小饱和度变化越快,值越大饱和度变化越慢。
dl, token的总长度,(length of field),可用_analyze查看
dl, 所有文档token的总长度平均值,(avgdl, average length of field)
n, 包含匹配检索词项总数( number of documents containing term)
N, 总文档数(total number of documents with field)
当只有一篇文档时,只有docId为1的文档中elastic, 匹配搜索词项elastic
tf = freq / (freq + k1 * (1 - b + b * dl / avgdl)) = 1 / (1 + 1.2 * (1- 0.75 + 0.75 * 13 / 13)) = 0.45454544
idf = log(1 + (N - n + 0.5) / (n + 0.5)) = log(1 + (1 -1 + 0.5)/ (1 + 0.5))= 0.2876821
score(freq=1.0) = boost * idf * tf = 2.2 * 0.45454544 * 0.2876821 = 0.2876821
GET /score_study/_explain/1
{
"query": {
"match": {
"title": "elastic"
}
}
}
返回
{
"_index": "score_study",
"_id": "1",
"matched": true,
"explanation": {
"value": 0.2876821,
"description": "weight(title:elast in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.2876821,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 0.2876821,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 1,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 1,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.45454544,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 13.0,
"description": "dl, length of field",
"details": []
},
{
"value": 13.0,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
}
}
再增加一篇文档
PUT /score_study/_doc/2
{
"title": "Elastic Stack技术栈简介 2020 Elastic"
}
然后查询
GET /score_study/_explain/2
{
"query": {
"match": {
"title": "elastic"
}
}
}
返回
{
"_index": "score_study",
"_id": "2",
"matched": true,
"explanation": {
"value": 0.2642025,
"description": "weight(title:elast in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.2642025,
"description": "score(freq=2.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 0.18232156,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 2,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 2,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.6586826,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 2.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 9.0,
"description": "dl, length of field",
"details": []
},
{
"value": 11.0,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
}
}
当增加一篇文档时
tf = freq / (freq + k1 * (1 - b + b * dl / avgdl)) = 2 / (2 + 1.2 * (1- 0.75 + 0.75 * 9 / 11)) = 0.6586826
idf = log(1 + (N - n + 0.5) / (n + 0.5)) = log(1+ (2-2 + 0.5)/ (2+0.5))= 0.18232156
score(freq=1.0) = boost * idf * tf = 2.2 * 0.6586826 * 0.18232156 = 0.2642025
原来的文档1返回
{
"_index": "score_study",
"_id": "1",
"matched": true,
"explanation": {
"value": 0.16969931,
"description": "weight(title:elast in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.16969931,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 0.18232156,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 2,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 2,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.42307693,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 13.0,
"description": "dl, length of field",
"details": []
},
{
"value": 11.0,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
}
}
改变的有:
tf = freq / (freq + k1 * (1 - b + b * dl / avgdl)) = 1 / (1 + 1.2 * (1- 0.75 + 0.75 * 13 / 11)) = 0.42307693
idf = log(1 + (N - n + 0.5) / (n + 0.5)) = log(1 + (2 -2 + 0.5)/ (2 + 0.5))= 0.18232156
score(freq=1.0) = boost * idf * tf = 2.2 * 0.42307693 * 0.18232156 = 0.16969931
再增加一篇不相关的
PUT /score_study/_doc/3
{
"title": "test Stack技术栈简介 2020 test"
}
此时查询文档1
{
"_index": "score_study",
"_id": "1",
"matched": true,
"explanation": {
"value": 0.42512262,
"description": "weight(title:elast in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.42512262,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 0.47000363,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 2,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 3,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.41114056,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 13.0,
"description": "dl, length of field",
"details": []
},
{
"value": 10.333333,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
}
}
改变的有:
tf = freq / (freq + k1 * (1 - b + b * dl / avgdl)) = 1 / (1 + 1.2 * (1- 0.75 + 0.75 * 13 / 10.333333)) = 0.41114056
idf = log(1 + (N - n + 0.5) / (n + 0.5)) = log(1 + (2 -3 + 0.5)/ (3 + 0.5))= 0.47000363
score(freq=1.0) = boost * idf * tf = 2.2 * 0.42307693 * 0.18232156 = 0.16969931
log 默认以e为底
资料
https://en.wikipedia.org/wiki/Okapi_BM25
https://www.elastic.co/guide/cn/elasticsearch/guide/current/pluggable-similarites.html