elasticsearch6.x ik中文分词集成

最新推荐文章于 2024-04-21 20:01:31 发布

原创最新推荐文章于 2024-04-21 20:01:31 发布 · 1.4k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#IK中文分词 #elasticsearch

ELK 专栏收录该内容

2 篇文章

订阅专栏

本文介绍如何在Elasticsearch中集成IK分词器，包括版本选择、在线安装、重启服务、以及通过curl命令进行分词测试。演示了ik_max_word与ik_smart两种分词方式的区别，并详细展示了如何利用IK分词器进行中文文本的索引和检索。

Elasticsearch是一个基于Apache Lucene(TM)的开源、实时分布式搜索和分析引擎。它用于全文搜索、结构化搜索、分析以及将这三者混合使用。IK Analysis插件将Lucene IK分析器集成到elasticsearch中，支持自定义词典。

1. 选择ik版本

IK版本安装是由Elasticsearch版本决定的，如下图所示。

IK版本	ES版本
主	6.x - >主人
6.3.0	6.3.0
6.2.4	6.2.4
6.1.3	6.1.3
5.6.8	5.6.8
5.5.3	5.5.3
5.4.3	5.4.3
5.3.3	5.3.3
5.2.2	5.2.2
5.1.2	5.1.2
1.10.6	2.4.6
1.9.5	2.3.5
1.8.1	2.2.1
1.7.0	2.1.1
1.5.0	2.0.0
1.2.6	1.0.0
1.2.5	0.90.x
1.1.3	0.20.x
1.0.0	0.16.2 - > 0.19.0

在ELK 6.3.1安装与部署中，已经介绍elasticsearch6.3.1安装部署，因此与之对应IK版本也选择为6.3.1。

2. 在线安装

bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.1/elasticsearch-analysis-ik-6.3.1.zip

3. 重启es

ps -ef | grep elasticsearch   #查询es进程号

kill -9 **   #杀掉es进程

bin/elasticsearch -d && tail -f logs/elasticsearch.log   #重启es，log打印

4. IK测试

ik中文分词支持ik_smart和ik_max_word两种方式，区别在于：

ik_max_word: 会将文本做最细粒度的拆分，比如会将“内地港澳同胞:港珠澳大桥让港澳与国家融合更紧密”拆分为“内地、港澳同胞、港澳、同胞、港、珠、澳、大桥、让、港澳、与国、国家、融合、更紧、紧密”，会穷尽各种可能的组合；

ik_smart: 会做最粗粒度的拆分，比如会将“内地港澳同胞:港珠澳大桥让港澳与国家融合更紧密”拆分为“内地、港澳同胞、港、珠、澳、大桥、让、港澳、与、国家、融合、更、紧密”。

4.1 ik_max_word分词

输入文本json：

curl -XGET http://lee:9200/_analyze?pretty -H 'Content-Type:application/json' -d '

{

"analyzer": "ik_max_word",

"text": "内地港澳同胞:港珠澳大桥让港澳与国家融合更紧密"

}'

输出分词结果：

{

"tokens" : [

{

"token" : "内地",

"start_offset" : 0,

"end_offset" : 2,

"type" : "CN_WORD",

"position" : 0

},

{

"token" : "港澳同胞",

"start_offset" : 2,

"end_offset" : 6,

"type" : "CN_WORD",

"position" : 1

},

{

"token" : "港澳",

"start_offset" : 2,

"end_offset" : 4,

"type" : "CN_WORD",

"position" : 2

},

{

"token" : "同胞",

"start_offset" : 4,

"end_offset" : 6,

"type" : "CN_WORD",

"position" : 3

},

{

"token" : "港",

"start_offset" : 7,

"end_offset" : 8,

"type" : "CN_CHAR",

"position" : 4

},

{

"token" : "珠",

"start_offset" : 8,

"end_offset" : 9,

"type" : "CN_CHAR",

"position" : 5

},

{

"token" : "澳",

"start_offset" : 9,

"end_offset" : 10,

"type" : "CN_CHAR",

"position" : 6

},

{

"token" : "大桥",

"start_offset" : 10,

"end_offset" : 12,

"type" : "CN_WORD",

"position" : 7

},

{

"token" : "让",

"start_offset" : 12,

"end_offset" : 13,

"type" : "CN_CHAR",

"position" : 8

},

{

"token" : "港澳",

"start_offset" : 13,

"end_offset" : 15,

"type" : "CN_WORD",

"position" : 9

},

{

"token" : "与国",

"start_offset" : 15,

"end_offset" : 17,

"type" : "CN_WORD",

"position" : 10

},

{

"token" : "国家",

"start_offset" : 16,

"end_offset" : 18,

"type" : "CN_WORD",

"position" : 11

},

{

"token" : "融合",

"start_offset" : 18,

"end_offset" : 20,

"type" : "CN_WORD",

"position" : 12

},

{

"token" : "更紧",

"start_offset" : 20,

"end_offset" : 22,

"type" : "CN_WORD",

"position" : 13

},

{

"token" : "紧密",

"start_offset" : 21,

"end_offset" : 23,

"type" : "CN_WORD",

"position" : 14

}

]

}

4.2 ik_smart分词

输入本文json:

curl -XGET http://lee:9200/_analyze?pretty -H 'Content-Type:application/json' -d '

{

"analyzer": "ik_smart",

"text": "内地港澳同胞:港珠澳大桥让港澳与国家融合更紧密"

}'

输出分词结果：

{

"tokens" : [

{

"token" : "内地",

"start_offset" : 0,

"end_offset" : 2,

"type" : "CN_WORD",

"position" : 0

},

{

"token" : "港澳同胞",

"start_offset" : 2,

"end_offset" : 6,

"type" : "CN_WORD",

"position" : 1

},

{

"token" : "港",

"start_offset" : 7,

"end_offset" : 8,

"type" : "CN_CHAR",

"position" : 2

},

{

"token" : "珠",

"start_offset" : 8,

"end_offset" : 9,

"type" : "CN_CHAR",

"position" : 3

},

{

"token" : "澳",

"start_offset" : 9,

"end_offset" : 10,

"type" : "CN_CHAR",

"position" : 4

},

{

"token" : "大桥",

"start_offset" : 10,

"end_offset" : 12,

"type" : "CN_WORD",

"position" : 5

},

{

"token" : "让",

"start_offset" : 12,

"end_offset" : 13,

"type" : "CN_CHAR",

"position" : 6

},

{

"token" : "港澳",

"start_offset" : 13,

"end_offset" : 15,

"type" : "CN_WORD",

"position" : 7

},

{

"token" : "与",

"start_offset" : 15,

"end_offset" : 16,

"type" : "CN_CHAR",

"position" : 8

},

{

"token" : "国家",

"start_offset" : 16,

"end_offset" : 18,

"type" : "CN_WORD",

"position" : 9

},

{

"token" : "融合",

"start_offset" : 18,

"end_offset" : 20,

"type" : "CN_WORD",

"position" : 10

},

{

"token" : "更",

"start_offset" : 20,

"end_offset" : 21,

"type" : "CN_CHAR",

"position" : 11

},

{

"token" : "紧密",

"start_offset" : 21,

"end_offset" : 23,

"type" : "CN_WORD",

"position" : 12

}

]

}

4..3 分词检索

4.3.1 创建索引

curl -XPUT http://lee:9200/index

4.3.2 索引映射

curl -XPOST http://lee:9200/index/fulltext/_mapping -H 'Content-Type:application/json' -d '

{

"properties": {

"content": {

"type": "text",

"analyzer": "ik_max_word",

"search_analyzer": "ik_max_word"

}

}

}'

4.3.3 索引文档

curl -XPOST http://lee:9200/index/fulltext/1 -H 'Content-Type:application/json' -d '

{"content":"美国留给伊拉克的是个烂摊子吗"}

'

curl -XPOST http://lee:9200/index/fulltext/2 -H 'Content-Type:application/json' -d '

{"content":"公安部：各地校车将享最高路权"}

'

curl -XPOST http://lee:9200/index/fulltext/3 -H 'Content-Type:application/json' -d '

{"content":"中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"}

'

curl -XPOST http://lee:9200/index/fulltext/4 -H 'Content-Type:application/json' -d '

{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}

'

4.3.4 查询

curl -XPOST http://lee:9200/index/fulltext/_search -H 'Content-Type:application/json' -d '

{

"query" : { "match" : { "content" : "中国" }},

"highlight" : {

"pre_tags" : ["<tag1>", "<tag2>"],

"post_tags" : ["</tag1>", "</tag2>"],

"fields" : {

"content" : {}

}

}

}

'

4.3.5 查询结果

{

"took": 136,

"timed_out": false,

"_shards": {

"total": 5,

"successful": 5,

"skipped": 0,

"failed": 0

},

"hits": {

"total": 2,

"max_score": 0.6489038,

"hits": [{

"_index": "index",

"_type": "fulltext",

"_id": "4",

"_score": 0.6489038,

"_source": {

"content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"

},

"highlight": {

"content": ["<tag1>中国</tag1>驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"]

}

}, {

"_index": "index",

"_type": "fulltext",

"_id": "3",

"_score": 0.2876821,

"_source": {

"content": "中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"

},

"highlight": {

"content": ["中韩渔警冲突调查：韩警平均每天扣1艘<tag1>中国</tag1>渔船"]

}

}]

}

}

参考资料

https://blog.youkuaiyun.com/baymax_007/article/details/81670082

https://github.com/medcl/elasticsearch-analysis-ik/