自定义词典:
一、添加词典
mkdir -p elasticsearch-2.4.4/plugins/analysis-ik/config/custom
vi elasticsearch-2.4.4/plugins/analysis-ik/config/custom/ext_word.txt
博世
bosch
注意事项:
1,每个单词一行
2,编码为utf-8 无bom
二、修改ik配置
| <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <comment>IK Analyzer 扩展配置</comment> <!--用户可以在这里配置自己的扩展字典 --> <entry key="ext_dict">custom/ext_word.dic;custom/single_word_low_freq.dic</entry> <!--用户可以在这里配置自己的扩展停止词字典--> <entry key="ext_stopwords">custom/ext_stopword.dic</entry> <!--用户可以在这里配置远程扩展字典 --> <!-- <entry key="remote_ext_dict">words_location</entry> --> <!--用户可以在这里配置远程扩展停止词字典--> <!-- <entry key="remote_ext_stopwords">words_location</entry> --> </properties> |
三、重启es
同义词配置:
一、添加词典:
mkdir -p elasticsearch-2.4.4/config/analysis
vi elasticsearch-2.4.4/config/analysis/synonym.txt
博世,bosch
注意事项:
1,每行一组同义词,以逗号分隔
2,编码为utf-8 无bom
3,修改synonym.txt后需要重启es
二、索引配置修改
新建业务索引2_syn,添加同义词过滤器synonym_filter
setting设置如下:
| { "index": { "analysis": { "filter": { "light_english_stemmer": { "type": "stemmer", "language": "light_english" }, "special_character_spliter": { "type": "word_delimiter", "preserve_original": "true" }, "synonym_filter": { "type": "synonym", "synonyms_path": "analysis/synonym.txt" } }, "analyzer": { "charSplit": { "filter": [ "lowercase", "synonym_filter" ], "type": "custom", "tokenizer": "ngram_tokenizer" }, "optik_smart": { "filter": [ "lowercase", "light_english_stemmer", "special_character_spliter", "synonym_filter" ], "type": "custom", "tokenizer": "ik_smart" }, "optik": { "filter": [ "lowercase", "light_english_stemmer", "special_character_spliter", "synonym_filter" ], "type": "custom", "tokenizer": "ik" } }, "tokenizer": { "ngram_tokenizer": { "token_chars": [ "letter", "digit", "punctuation" ], "min_gram": "1", "type": "nGram", "max_gram": "30" } } } } } |
三、测试同义词
GET /2_syn/_analyze?analyzer=optik&pretty=true&text=博世
结果:
| { "tokens": [ { "token": "博世", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 0 }, { "token": "bosch", "start_offset": 0, "end_offset": 2, "type": "SYNONYM", "position": 0 } ] } |
四、数据迁移
使用reindex api迁移数据
| POST _reindex { "source": { "index": "2" }, "dest": { "index": "2_syn" } } |
问题:
1,修改同义词词典synonym.txt 需要重启es
2,ik无法正确分词的token无法找到同义词,需要配合自定义词库使用