一个使用中文分词的完整Demo

最新推荐文章于 2025-05-25 09:42:37 发布

原创

最新推荐文章于 2025-05-25 09:42:37 发布 · 4.3k 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#ElasticSearch 中文分词器

本文详细介绍了如何在ElasticSearch中集成并使用IK中文分词器，通过对比IK与standard分词器的区别，展示了IK的优势。同时提供了一个Java访问ES并进行分词操作的完整示例。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本文将首先将介绍如何在ES中使用中文分词器IK，然后对比一下IK分词器和默认的standard分词器的差别，最后给出一个Java访问ES的demo

1，安装中文分词器。

下载与ES版本相对应的IK版本。IK下载以及IK和ES版本对应关系见：https://github.com/medcl/elasticsearch-analysis-ik
解压下载的zip文件，进入解压文件的根目录，使用maven进行编译：mvn package
编译成功之后，进入target/release目录下，能看到一个elasticsearch-analysis-ik-xx.xx.xx.zip文件，解压该文件。
在ES的目录下的plugins目录中创建ik目录，把上一步中解压的所有文件拷贝到ik目录下。
重启ES，大功告成。

2，测试中文分词器

我们使用IK分词器和standard分词器来对同一句话进行分词，看看有何不同。

首先使用IK进行分词

GET /_analyze
{
"analyzer": "ik",
"text": "只有社会主义才能救中国"
}

输出结果如下，可见IK会根据中文语义来对句子进行token化。

{
"tokens": [
{
"token": "只有",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "社会主义",
"start_offset": 2,
"end_offset": 6,
"type": "CN_WORD",
"position": 1
},
{
"token": "社会",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 2
},
{
"token": "主义",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 3
},
{
"token": "才能救",
"start_offset": 6,
"end_offset": 9,
"type": "CN_WORD",
"position": 4
},
{
"token": "才能",
"start_offset": 6,
"end_offset": 8,
"type": "CN_WORD",
"position": 5
},
{
"token": "救",
"start_offset": 8,
"end_offset": 9,
"type": "CN_CHAR",
"position": 6
},
{
"token": "中国",
"start_offset": 9,
"end_offset": 11,
"type": "CN_WORD",
"position": 7
}
]
}

然后我们使用standard分词器，结果如下，可见standard分词器只会对汉字进行单个拆分

GET /_analyze
{
"analyzer": "standard",
"text": "只有社会主义才能救中国"
}

输出结果是：
{
"tokens": [
{
"token": "只",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "有",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "社",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "会",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "主",
"start_offset": 4,
"end_offset": 5,
"type": "<IDEOGRAPHIC>",
"position": 4
},
{
"token": "义",
"start_offset": 5,
"end_offset": 6,
"type": "<IDEOGRAPHIC>",
"position": 5
},
{
"token": "才",
"start_offset": 6,
"end_offset": 7,
"type": "<IDEOGRAPHIC>",
"position": 6
},
{
"token": "能",
"start_offset": 7,
"end_offset": 8,
"type": "<IDEOGRAPHIC>",
"position": 7
},
{
"token": "救",
"start_offset": 8,
"end_offset": 9,
"type": "<IDEOGRAPHIC>",
"position": 8
},
{
"token": "中",
"start_offset": 9,
"end_offset": 10,
"type": "<IDEOGRAPHIC>",
"position": 9
},
{
"token": "国",
"start_offset": 10,
"end_offset": 11,
"type": "<IDEOGRAPHIC>",
"position": 10
}
]
}