进阶-第14__深度探秘搜索技术_使用most_fields策略进行cross-fields search弊端大揭秘

两点一刻

于 2019-03-10 10:40:16 发布

阅读量185

点赞数

CC 4.0 BY-SA版权

分类专栏： elasticsearch 文章标签： elasticsearch

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.youkuaiyun.com/qq_35524586/article/details/88375613

elasticsearch 专栏收录该内容

180 篇文章

订阅专栏

探讨了在Elasticsearch中使用most_fields类型进行cross-fields搜索的挑战，包括如何处理多个字段中的标识符，如人名和地址，以及TF/IDF算法在匹配特定term时的局限性。

cross-fields搜索，一个唯一标识，跨了多个field。比如一个人，标识，是姓名；一个建筑，它的标识是地址。姓名可以散落在多个field中，比如first_name和last_name中，地址可以散落在country，province，city中。

跨多个field搜索一个标识，比如搜索一个人名，或者一个地址，就是cross-fields搜索

初步来说，如果要实现，可能用most_fields比较合适。因为best_fields是优先搜索单个field最匹配的结果，cross-fields本身就不是一个field的问题了。

插入测试数据

POST /forum/article/_bulk

{ "update": { "_id": "1"} }

{ "doc" : {"author_first_name" : "Peter", "author_last_name" : "Smith"} }

{ "update": { "_id": "2"} }

{ "doc" : {"author_first_name" : "Smith", "author_last_name" : "Williams"} }

{ "update": { "_id": "3"} }

{ "doc" : {"author_first_name" : "Jack", "author_last_name" : "Ma"} }

{ "update": { "_id": "4"} }

{ "doc" : {"author_first_name" : "Robbin", "author_last_name" : "Li"} }

{ "update": { "_id": "5"} }

{ "doc" : {"author_first_name" : "Tonny", "author_last_name" : "Peter Smith"} }//这个是最期望排到最前面的

Multi_match mos_fields 搜索

GET /forum/article/_search

{

"query": {

"multi_match": {

"query": "Peter Smith",

"type": "most_fields",

"fields": [ "author_first_name", "author_last_name" ]

}

}

}

搜索结果

{

"took": 1,

"timed_out": false,

"_shards": {

"total": 5,

"successful": 5,

"failed": 0

},

"hits": {

"total": 3,

"max_score": 0.6931472,

"hits": [

{

"_index": "forum",

"_type": "article",

"_id": "2",

"_score": 0.6931472,

"_source": {

"articleID": "KDKE-B-9947-#kL5",

"userID": 1,

"hidden": false,

"postDate": "2017-01-02",

"tag": [

"java"

],

"tag_cnt": 1,

"view_cnt": 50,

"title": "this is java blog",

"content": "i think java is the best programming language",

"sub_title": "learned a lot of course",

"author_first_name": "Smith",

"author_last_name": "Williams"

}

},

{

"_index": "forum",

"_type": "article",

"_id": "1",

"_score": 0.5753642,

"_source": {

"articleID": "XHDK-A-1293-#fJ3",

"userID": 1,

"hidden": false,

"postDate": "2017-01-01",

"tag": [

"java",

"hadoop"

],

"tag_cnt": 2,

"view_cnt": 30,

"title": "this is java and elasticsearch blog",

"content": "i like to write best elasticsearch article",

"sub_title": "learning more courses",

"author_first_name": "Peter",

"author_last_name": "Smith"

}

},

{

"_index": "forum",

"_type": "article",

"_id": "5",

"_score": 0.51623213,

"_source": {

"articleID": "DHJK-B-1395-#Ky5",

"userID": 3,

"hidden": false,

"postDate": "2017-03-01",

"tag": [

"elasticsearch"

],

"tag_cnt": 1,

"view_cnt": 10,

"title": "this is spark blog",

"content": "spark is best big data solution based on scala ,an programming language similar to java",

"sub_title": "haha, hello world",

"author_first_name": "Tonny",

"author_last_name": "Peter Smith"

}

}

]

}

}

Peter Smith，匹配author_first_name，匹配到了Smith，这时候它的分数很高，为什么啊？？？

因为IDF分数高，IDF分数要高，那么这个匹配到的term（Smith），在所有doc中的出现频率要低，author_first_name field中，Smith就出现过1次

Peter Smith这个人，doc 1，Smith在author_last_name中，但是author_last_name出现了两次Smith，所以导致doc 1的IDF分数较低

不要有过多的疑问，一定是这样吗？

问题1：只是找到尽可能多的field匹配的doc，而不是某个field完全匹配的doc

问题2：most_fields，没办法用minimum_should_match去掉长尾数据，就是匹配的特别少的结果

问题3：TF/IDF算法，比如Peter Smith和Smith Williams，搜索Peter Smith的时候，由于first_name中很少有Smith的，所以query在所有document中的频率很低，得到的分数很高，可能Smith Williams反而会排在Peter Smith前面

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。