问题:
索引中有『第十人民医院』这个字段,使用IK分词结果如下 :
POST http://localhost:9200/development_hospitals/_analyze?pretty&field=hospital.names&analyzer=ik
{
"tokens": [
{
"token": "第十",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "十人",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 1
},
{
"token": "十",
"start_offset": 1,
"end_offset": 2,
"type": "TYPE_CNUM",
"position": 2
},
{
"token": "人民医院",
"start_offset": 2,
"end_offset": 6,
"type": "CN_WORD",
"position": 3
},
{
"token": "人民",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 4
},
{
"token": "人",
"start_offset": 2,
"end_offset": 3,
"type": "COUNT",
"position": 5
},
{
"token": "民医院",
"start_offset": 3,
"end_offset": 6,
"type": "CN_WORD",
"position": 6
},
{
"token": "医院",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 7
}
]
}
使用Postman构建match查询:

问题分析:
参考文档 The Definitive Guide [2.x] | Elastic
phrase搜索跟关键字的位置有关, 『第十』采用ik_max_word分词结果如下
{
"tokens": [
{
"token": "第十",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "十",
"start_offset": 1,
"end_offset": 2,
"type": "TYPE_CNUM",
"position": 1
}
]
}
解决方案:
采用ik_smart分词可以避免这样的问题,对『第十人民医院』和『第十』采用ik_smart分词的结果分别是:
{
"tokens": [
{
"token": "第十",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "人民医院",
"start_offset": 2,
"end_offset": 6,
"type": "CN_WORD",
"position": 1
}
]
}
{
"tokens": [
{
"token": "第十",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
}
]
}
稳稳命中
最佳实践:
采用match_phrase匹配,结果会非常严格,但是也会漏掉相关的结果,个人觉得混合两种方式进行bool查询比较好,并且对match_phrase匹配采用boost加权,比如对name进行2种分词并索引,ik_smart分词采用match_phrase匹配,ik_max_word分词采用match匹配,如:
{
"query": {
"bool": {
"should": [
{"match_phrase": {"name1": {"query": "第十", "boost": 2}}},
{"match": {"name2": "第十"}}
]
}
},
explain: true
}