elasticsearch6.x ik中文分词集成

本文介绍如何在Elasticsearch中集成IK分词器,包括版本选择、在线安装、重启服务、以及通过curl命令进行分词测试。演示了ik_max_word与ik_smart两种分词方式的区别,并详细展示了如何利用IK分词器进行中文文本的索引和检索。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Elasticsearch是一个基于Apache Lucene(TM)的开源、实时分布式搜索和分析引擎。它用于全文搜索、结构化搜索、分析以及将这三者混合使用。IK Analysis插件将Lucene IK分析器集成到elasticsearch中,支持自定义词典。

1. 选择ik版本

IK版本安装是由Elasticsearch版本决定的,如下图所示。

IK版本ES版本
6.x - >主人
6.3.0   6.3.0
6.2.4    6.2.4
6.1.3    6.1.3
5.6.8   5.6.8
5.5.3   5.5.3
5.4.3   5.4.3
5.3.3    5.3.3
5.2.2  5.2.2
5.1.2    5.1.2
1.10.6    2.4.6
1.9.5    2.3.5
1.8.1    2.2.1
1.7.0    2.1.1
1.5.0    2.0.0
1.2.6    1.0.0
1.2.5    0.90.x
1.1.3    0.20.x
1.0.0    0.16.2 - > 0.19.0

ELK 6.3.1安装与部署中,已经介绍elasticsearch6.3.1安装部署,因此与之对应IK版本也选择为6.3.1。

2. 在线安装

bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.1/elasticsearch-analysis-ik-6.3.1.zip

3. 重启es

ps -ef | grep elasticsearch   #查询es进程号

kill -9 **   #杀掉es进程

bin/elasticsearch -d && tail -f logs/elasticsearch.log   #重启es,log打印

4. IK测试

ik中文分词支持ik_smart和ik_max_word两种方式,区别在于:

ik_max_word: 会将文本做最细粒度的拆分,比如会将“内地港澳同胞:港珠澳大桥让港澳与国家融合更紧密”拆分为“内地、港澳同胞、港澳、同胞、港、珠、澳、大桥、让、港澳、与国、国家、融合、更紧、紧密”,会穷尽各种可能的组合;

ik_smart: 会做最粗粒度的拆分,比如会将“内地港澳同胞:港珠澳大桥让港澳与国家融合更紧密”拆分为“内地、港澳同胞、港、珠、澳、大桥、让、港澳、与、国家、融合、更、紧密”。

4.1 ik_max_word分词

输入文本json:

curl -XGET http://lee:9200/_analyze?pretty -H 'Content-Type:application/json' -d '

{

"analyzer": "ik_max_word",

"text": "内地港澳同胞:港珠澳大桥让港澳与国家融合更紧密"

}'

输出分词结果:

{

"tokens" : [

{

"token" : "内地",

"start_offset" : 0,

"end_offset" : 2,

"type" : "CN_WORD",

"position" : 0

},

{

"token" : "港澳同胞",

"start_offset" : 2,

"end_offset" : 6,

"type" : "CN_WORD",

"position" : 1

},

{

"token" : "港澳",

"start_offset" : 2,

"end_offset" : 4,

"type" : "CN_WORD",

"position" : 2

},

{

"token" : "同胞",

"start_offset" : 4,

"end_offset" : 6,

"type" : "CN_WORD",

"position" : 3

},

{

"token" : "港",

"start_offset" : 7,

"end_offset" : 8,

"type" : "CN_CHAR",

"position" : 4

},

{

"token" : "珠",

"start_offset" : 8,

"end_offset" : 9,

"type" : "CN_CHAR",

"position" : 5

},

{

"token" : "澳",

"start_offset" : 9,

"end_offset" : 10,

"type" : "CN_CHAR",

"position" : 6

},

{

"token" : "大桥",

"start_offset" : 10,

"end_offset" : 12,

"type" : "CN_WORD",

"position" : 7

},

{

"token" : "让",

"start_offset" : 12,

"end_offset" : 13,

"type" : "CN_CHAR",

"position" : 8

},

{

"token" : "港澳",

"start_offset" : 13,

"end_offset" : 15,

"type" : "CN_WORD",

"position" : 9

},

{

"token" : "与国",

"start_offset" : 15,

"end_offset" : 17,

"type" : "CN_WORD",

"position" : 10

},

{

"token" : "国家",

"start_offset" : 16,

"end_offset" : 18,

"type" : "CN_WORD",

"position" : 11

},

{

"token" : "融合",

"start_offset" : 18,

"end_offset" : 20,

"type" : "CN_WORD",

"position" : 12

},

{

"token" : "更紧",

"start_offset" : 20,

"end_offset" : 22,

"type" : "CN_WORD",

"position" : 13

},

{

"token" : "紧密",

"start_offset" : 21,

"end_offset" : 23,

"type" : "CN_WORD",

"position" : 14

}

]

}

4.2 ik_smart分词

输入本文json:

curl -XGET http://lee:9200/_analyze?pretty -H 'Content-Type:application/json' -d '

{

"analyzer": "ik_smart",

"text": "内地港澳同胞:港珠澳大桥让港澳与国家融合更紧密"

}'

输出分词结果:

{

"tokens" : [

{

"token" : "内地",

"start_offset" : 0,

"end_offset" : 2,

"type" : "CN_WORD",

"position" : 0

},

{

"token" : "港澳同胞",

"start_offset" : 2,

"end_offset" : 6,

"type" : "CN_WORD",

"position" : 1

},

{

"token" : "港",

"start_offset" : 7,

"end_offset" : 8,

"type" : "CN_CHAR",

"position" : 2

},

{

"token" : "珠",

"start_offset" : 8,

"end_offset" : 9,

"type" : "CN_CHAR",

"position" : 3

},

{

"token" : "澳",

"start_offset" : 9,

"end_offset" : 10,

"type" : "CN_CHAR",

"position" : 4

},

{

"token" : "大桥",

"start_offset" : 10,

"end_offset" : 12,

"type" : "CN_WORD",

"position" : 5

},

{

"token" : "让",

"start_offset" : 12,

"end_offset" : 13,

"type" : "CN_CHAR",

"position" : 6

},

{

"token" : "港澳",

"start_offset" : 13,

"end_offset" : 15,

"type" : "CN_WORD",

"position" : 7

},

{

"token" : "与",

"start_offset" : 15,

"end_offset" : 16,

"type" : "CN_CHAR",

"position" : 8

},

{

"token" : "国家",

"start_offset" : 16,

"end_offset" : 18,

"type" : "CN_WORD",

"position" : 9

},

{

"token" : "融合",

"start_offset" : 18,

"end_offset" : 20,

"type" : "CN_WORD",

"position" : 10

},

{

"token" : "更",

"start_offset" : 20,

"end_offset" : 21,

"type" : "CN_CHAR",

"position" : 11

},

{

"token" : "紧密",

"start_offset" : 21,

"end_offset" : 23,

"type" : "CN_WORD",

"position" : 12

}

]

}

4..3 分词检索

4.3.1 创建索引

curl -XPUT http://lee:9200/index

4.3.2 索引映射

curl -XPOST http://lee:9200/index/fulltext/_mapping -H 'Content-Type:application/json' -d '

{

"properties": {

"content": {

"type": "text",

"analyzer": "ik_max_word",

"search_analyzer": "ik_max_word"

}

}

}'

4.3.3 索引文档

curl -XPOST http://lee:9200/index/fulltext/1 -H 'Content-Type:application/json' -d '

{"content":"美国留给伊拉克的是个烂摊子吗"}

'

curl -XPOST http://lee:9200/index/fulltext/2 -H 'Content-Type:application/json' -d '

{"content":"公安部:各地校车将享最高路权"}

'

curl -XPOST http://lee:9200/index/fulltext/3 -H 'Content-Type:application/json' -d '

{"content":"中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"}

'

curl -XPOST http://lee:9200/index/fulltext/4 -H 'Content-Type:application/json' -d '

{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}

'

4.3.4 查询

curl -XPOST http://lee:9200/index/fulltext/_search -H 'Content-Type:application/json' -d '

{

"query" : { "match" : { "content" : "中国" }},

"highlight" : {

"pre_tags" : ["<tag1>", "<tag2>"],

"post_tags" : ["</tag1>", "</tag2>"],

"fields" : {

"content" : {}

}

}

}

'

4.3.5 查询结果

{

"took": 136,

"timed_out": false,

"_shards": {

"total": 5,

"successful": 5,

"skipped": 0,

"failed": 0

},

"hits": {

"total": 2,

"max_score": 0.6489038,

"hits": [{

"_index": "index",

"_type": "fulltext",

"_id": "4",

"_score": 0.6489038,

"_source": {

"content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"

},

"highlight": {

"content": ["<tag1>中国</tag1>驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"]

}

}, {

"_index": "index",

"_type": "fulltext",

"_id": "3",

"_score": 0.2876821,

"_source": {

"content": "中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"

},

"highlight": {

"content": ["中韩渔警冲突调查:韩警平均每天扣1艘<tag1>中国</tag1>渔船"]

}

}]

}

}

参考资料

https://blog.youkuaiyun.com/baymax_007/article/details/81670082

https://github.com/medcl/elasticsearch-analysis-ik/

 

 

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值