Elasticsearch实战(二十)---ES相关度分数评分算法分析及相关度分数优化

Elasticsearch实战—ES相关度分数评分算法分析

ES相关度评分算法靠三个部分来依次实现,没有先后顺序,是一个逐层推进的逻辑

  1. Boolean模型 根据过滤条件true,false来过滤doc
  2. TFIDF模型
  3. VSM空间向量模型

1.ES相关度分数评分算法

1.1 Booolean

boolean根据搜索条件过滤doc的国车过是不做相关度分数计算的,只是为了标记出来哪些doc是符合搜索条件要求的

1.2 TFIDF模型

了解文档分词处理的都听过TFIDF模型,TF词频,IDF逆文本频率,说白了就是单词term出现了再所有文档中出现了多少次,出现越多,说明这个单词越没有标识度,越不重要,和文档的相关度分数越低
比如下面多个文章ABC中出现多次吃饭,一个文章C中出现一次原子弹,那肯定原子弹肯定对文章C很重要 很有标识度,原子弹这个单词对C来说 权重很高,这就是TFIDF模型

文章DOC A :{ 吃饭, 喝酒, 喝茶}
文章DOC B: {吃饭,原子弹}
文章DOC C:{吃饭, 喝酒}

在这里插入图片描述

1.3 VSM空间向量模型

VSM这个就更为专业

  • 我们从document出发。document由若干个term组成,通过TF/IDF算法计算后,我们可以得知每一个term在document中的权重,而不同的term又会根据自己的权重影响当前document的相关度得分
  • 我们将当前document中出现的所有term的权重组合起来,形成一条向量 Document Vector, Document Vector可能会有多条
  • 有了向量可以根据 余弦函数cos计算两个向量的夹角,夹角越大,说明偏离越远,两个向量越不相似,进而得出文章不相似

Document = {term1, term2, …… ,termN}
Document Vector = {weight1, weight2, …… ,weightN}
在这里插入图片描述

2.ES相关度分数优化

2.1 准备数据

先构造 index:testquery, 然后构造mapping结构, 插入测试数据

#构建 库index testquer
put /testquery

#构建mapping结构
put /testquery/_mapping
{
    "properties" : {
      "address" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          },
          "copy_to" : [
            "info"
          ]
        },
      "age" : {
          "type" : "long"
        },
      "area" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
      "city" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
      "content" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
      "deptName" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          },
          "fielddata" : true
        },
      "empId" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
      "info" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
      "mobile" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          },
          "copy_to" : [
            "info"
          ]
        },
      "name" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          },
          "copy_to" : [
            "info"
          ]
        },
      "provice" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          },
          "fielddata" : true
        },
      "salary" : {
          "type" : "long"
        },
      "sex" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
 		 "addtime" : {
          "type":"date",
          //时间格式 epoch_millis表示毫秒
          "format":"yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
        }
    }
}

插入测试数据

put /testquery/_bulk
{"index":{"_id": 1},"addtime":"1658041203000"}
{"empId" : "111","name" : "员工1","age" : 20,"sex" : "男","mobile" : "19000001111","salary":1333,"deptName" : "技术部","provice" : "湖北省","city":"武汉","area":"光谷大道","address":"湖北省武汉市洪山区光谷大厦","content" : "i like to write best elasticsearch article", "addtime":"1658140003000"}
{"index":{"_id": 2}}
{"empId" : "222","name" : "员工2","age" : 25,"sex" : "男","mobile" : "19000002222","salary":15963,"deptName" : "销售部","provice" : "湖北省","city":"武汉","area":"江汉区","address" : "湖北省武汉市江汉路","content" : "i think java is the best programming language"}
{"index":{"_id": 3},"addtime":"1658040045600"}
{ "empId" : "333","name" : "员工3","age" : 30,"sex" : "男","mobile" : "19000003333","salary":20000,"deptName" : "技术部","provice" : "湖北省","city":"武汉","area":"经济技术开发区","address" : "湖北省武汉市经济开发区","content" : "i am only an elasticsearch beginner"}
{"index":{"_id": 4},"addtime":"1658040012000"}
{"empId" : "444","name" : "员工4","age" : 20,"sex" : "女","mobile" : "19000004444","salary":5600,"deptName" : "销售部","provice" : "湖北省","city":"武汉","area":"沌口开发区","address" : "湖北省武汉市沌口开发区","content" : "elasticsearch and hadoop are all very good solution, i am a beginner"}
{"index":{"_id": 5},"addtime":"1658040593000"}
{ "empId" : "555","name" : "员工5","age" : 20,"sex" : "男","mobile" : "19000005555","salary":9665,"deptName" : "测试部","provice" : "湖北省","city":"高新开发区","area":"武汉","address" : "湖北省武汉市东湖隧道","content" : "spark is best big data solution based on scala ,an programming language similar to java"}
{"index":{"_id": 6},"addtime":"1658043403000"}
{"empId" : "666","name" : "员工6","age" : 30,"sex" : "女","mobile" : "19000006666","salary":30000,"deptName" : "技术部","provice" : "武汉市","city":"湖北省","area":"江汉区","address" : "湖北省武汉市江汉路","content" : "i like java developer","addtime":"1658041003000"}
{"index":{"_id": 7}}
{"empId" : "777","name" : "员工7","age" : 60,"sex" : "女","mobile" : "19000007777","salary":52130,"deptName" : "测试部","provice" : "湖北省","city":"黄冈市","area":"边城区","address" : "湖北省黄冈市边城区","content" : "i like elasticsearch developer","addtime":"1658040008000"}
{"index":{"_id": 8}}
{"empId" : "888","name" : "员工8","age" : 19,"sex" : "女","mobile" : "19000008888","salary":60000,"deptName" : "技术部","provice" : "湖北省","city":"武汉","area":"汉阳区","address" : "湖北省武汉市江汉大学","content" : "i like spark language","addtime":"1656040003000"}
{"index":{"_id": 9}}
{"empId" : "999","name" : "员工9","age" : 40,"sex" : "男","mobile" : "19000009999","salary":23000,"deptName" : "销售部","provice" : "河南省","city":"郑州市","area":"二七区","address" : "河南省郑州市郑州大学","content" : "i like java developer","addtime":"1608040003000"}
{"index":{"_id": 10}}
{"empId" : "101010","name" : "张湖北","age" : 35,"sex" : "男","mobile" : "19000001010","salary":18000,"deptName" : "测试部","provice" : "湖北省","city":"武汉","area":"高新开发区","address" : "湖北省武汉市东湖高新","content" : "i like java developer i also like  elasticsearch","addtime":"1654040003000"}
{"index":{"_id": 11}}
{"empId" : "111111","name" : "王河南","age" : 61,"sex" : "男","mobile" : "19000001011","salary":10000,"deptName" : "销售部",,"provice" : "河南省","city":"开封市","area":"金明区","address" : "河南省开封市河南大学","content" : "i am not like  java ","addtime":"1658740003000"}
{"index":{"_id": 12}}
{"empId" : "121212","name" : "张大学","age" : 26,"sex" : "女","mobile" : "19000001012","salary":1321,"deptName" : "测试部",,"provice" : "河南省","city":"开封市","area":"金明区","address" : "河南省开封市河南大学","content" : "i am java developer  thing java is good","addtime":"165704003000"}
{"index":{"_id": 13}}
{"empId" : "131313","name" : "李江汉","age" : 36,"sex" : "男","mobile" : "19000001013","salary":1125,"deptName" : "销售部","provice" : "河南省","city":"郑州市","area":"二七区","address" : "河南省郑州市二七区","content" : "i like java and java is very best i like it do you like java ","addtime":"1658140003000"}
{"index":{"_id": 14}}
{"empId" : "141414","name" : "王技术","age" : 45,"sex" : "女","mobile" : "19000001014","salary":6222,"deptName" : "测试部",,"provice" : "河南省","city":"郑州市","area":"金水区","address" : "河南省郑州市金水区","content" : "i like c++","addtime":"1656040003000"}
{"index":{"_id": 15}}
{"empId" : "151515","name" : "张测试","age" : 18,"sex" : "男","mobile" : "19000001015","salary":20000,"deptName" : "技术部",,"provice" : "河南省","city":"郑州市","area":"高新开发区","address" : "河南省郑州高新开发区","content" : "i think spark is good","addtime":"1658040003000"}
2.2 Boost 增加搜索条件权重

设置boost查询条件权重可以实现影响搜索结果评分的目的,比如 查询条件后面加上boost,实现当前条件关联度倍增的效果

#不加boost条件查询
get /testquery/_search
{
  "query":{
   "bool": {
     "should": [
       {
         "match": {
           "provice.keyword": "湖北省"
         }
       },
       {
         "match": {
           "address": "开发区"
         }
       }
     ]
   }
  }
}

不加条件查询结果
在这里插入图片描述

现在给 address地址 加权重boost,认为address包含开发的排名更优先
然后员工4 中address包含开发,分数直接飙升到 12.44
员工1中address并没有开发 两个字,所以address 的 boost对员工1的分数没有影响依旧是 0.344分
在这里插入图片描述

2.3 Negative boost 削弱搜索条件权重

设置negative boost 削弱查询条件的权重 可以实现影响搜索结果评分的目的,削弱查询条件对分数的影响

#设置 negative_boost 权重为 1 看下结果
get /testquery/_search
{
  "query":{
   "boosting": {
     "positive": {
       "match": {
         "provice.keyword": "湖北省"
       }
     },
     "negative": {
       "match": {
         "deptName.keyword": "销售部"
       }
     },
     "negative_boost": 1
   }
  }
}

设置 negative_boost 权重为 1 看下结果
员工1:0.344 湖北省技术部
员工2:0.344 湖北省销售部
在这里插入图片描述

现在 negative_boost修改为 0.2 看下结果
员工1:0.344 湖北技术部 不受影响,因为他的部门deptname不是销售部,所以削弱销售部的权重不影响他
员工2:0.068 湖北销售部 受影响,分数明显降低,相关度降低
在这里插入图片描述

2.4 Function score 自定义相关分数算法

场景:
现在我想把 相关度分数和 文章的浏览量关联起来, 浏览量越大,分数越高,怎么实现

分数算法有几个关键点

  1. query内部使用 function_score 表明我要使用自定义相关度分数
  2. function_score内部 使用 field_value_factor 表明参与到分数计算的字段 设置,及按照什么来计算等
  3. function_score 的 field表示 对哪个字段进行积分
  4. modifier表示 对哪个字段进行积分 比如 ln, log1p, log2p log 等等算式
  5. factor 表示 对 你要计算的字段 field 的值 与 factor 相乘 处理
  6. boost_mode表示 分数 旧分数和新分数 如何处理 累加/减/乘/除/max/min 等等
  7. max_boost表示 限制计算出来的分数不要超过max_boost指定的值 , 不是最终得分不超过多少
    我们下一篇文章 单独讲解一下 如何实现这种场景及 自定义相关度分数算法如何实现, 每个参数都是如何使用的详解

至此 我们已经学习了 ES相关度分数评分算法分析, 也了解了 ES 实现相关度分析底层原理 使用 boolean模型,TFIDF,VSM空间向量模型计算相关度,也会使用 boost, negativeboost 来增加,削弱 查询条件权重 等等

下一篇我们着重讲解下 如何实现自定义算法 function score ES相关度分数评分优化及FunctionScore 自定义相关度分数算法

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值