拼音搜索的关键是汉字与拼音的转换,只要找到这样的elasticsearch插件就可以了。在GitHub上恰好有这样的拼音插件
相关学习链接
安装拼音插件
在GitHub页面中找到releases:
首先下载ES版本对应的拼音插件
安装位置放到这个位置:
然后重启你的elasticsearch即可。
测试
在kibana中,输入命令测试:
POST _analyze
{
"text": ["张学友", "刘德华"],
"analyzer": "pinyin"
}
结果:
{
"tokens" : [
{
"token" : "zhang",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "zxy",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "xue",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 1
},
{
"token" : "you",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 2
},
{
"token" : "liu",
"start_offset" : 1,
"end_offset" : 1,
"type" : "word",
"position" : 3
},
{
"token" : "ldh",
"start_offset" : 1,
"end_offset" : 1,
"type" : "word",
"position" : 3
},
{
"token" : "de",
"start_offset" : 1,
"end_offset" : 1,
"type" : "word",
"position" : 4
},
{
"token" : "hua",
"start_offset" : 1,
"end_offset" : 1,
"type" : "word",
"position" : 5
}
]
}
组合分词器
在分词处理时,会用到analyzer,我们以前称它为分词器。但其实它叫分析器,一般包含两部分:
-Tokenizer:分词器,对文本内容分词,得到词条Term
- filter:过滤器,对分好的词条做进一步处理,例如拼音转换、同义词转换等
我们可以把各种下载的分词插件组合,作为tokenizer或者filter,来完成自定义分词效果。
示例:
PUT /goods
{
"settings": {
"analysis": {
"analyzer": {
"my_pinyin": {
"tokenizer": "ik_smart",
"filter": [
"py"
]
}
},
"filter": {
"py": {
"type": "pinyin",
"keep_full_pinyin": false,
"keep_joined_full_pinyin": true,
"keep_original": true,
"limit_first_letter_length": 16,
"none_chinese_pinyin_tokenize": false
}
}
}
},
"mappings": {
"properties": {
"id": {
"type": "keyword"
},
"name": {
"type": "completion",
"analyzer": "my_pinyin",
"search_analyzer": "ik_smart"
},
"title":{
"type": "text",
"analyzer": "my_pinyin",
"search_analyzer": "ik_smart"
},
"price":{
"type": "long"
}
}
}
}
说明:【注意一下拼音分词器的设置内容】
测试自定义分词器
我们在kibana中运行测试,看看分词效果:
POST /goods/_analyze
{
"text": "你好,华为",
"analyzer": "my_pinyin"
}
结果:
{
"tokens" : [
{
"token" : "你好",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "nihao",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "nh",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "华为",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "huawei",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "hw",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 1
}
]
}
测试拼音补全
一旦有了拼音分词器,尽管用户使用拼音,我们也能完成自动补全了。
先插入一部分数据:
PUT /goods/_bulk
{ "index" : {"_id":1 } }
{ "id": 1, "name": ["小米","手机"],"title":"小米10手机"}
{ "index" : {"_id":2 } }
{"id": 2,"name": ["小米", "空调"] ,"title":"小米空调"}
{ "index" : {"_id":3 } }
{"id": 3,"name": ["sony", "mp3"],"title":"sony播放器"}
{ "index" : {"_id":4 } }
{"id": 4,"name": ["松下", "电视"],"title":"松下电视"}
然后来一个自动补全的查询:使用prefix前缀来进行自动补全查询
POST /goods/_search
{
"suggest": {
"name_suggest": {
"prefix": "s",
"completion": {
"field": "name"
}
}
}
}
注意,我们输入的关键字是字母:s
看结果:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"suggest" : {
"name_suggest" : [
{
"text" : "s",
"offset" : 0,
"length" : 1,
"options" : [
{
"text" : "sony",
"_index" : "goods",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"id" : 3,
"name" : "sony",
"title" : "sony播放器"
}
},
{
"text" : "手机",
"_index" : "goods",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"id" : 1,
"name" : "手机",
"title" : "小米手机"
}
},
{
"text" : "松下",
"_index" : "goods",
"_type" : "_doc",
"_id" : "4",
"_score" : 1.0,
"_source" : {
"id" : 4,
"name" : "松下",
"title" : "松下电视"
}
}
]
}
]
}
}
返回的提示包括:sony、松下、手机,都是以s开头,是不是很酷炫呢!