04-IK中文分词器

最新推荐文章于 2021-05-20 17:35:46 发布

软件开发初学者

最新推荐文章于 2021-05-20 17:35:46 发布

阅读量470

点赞数

分类专栏： elasticsearch 文章标签： elasticsearch

本文链接：https://blog.youkuaiyun.com/u011743790/article/details/106001969

版权

elasticsearch 专栏收录该内容

6 篇文章

订阅专栏

1 环境准备

IK分词器下载地址： https://github.com/medcl/elasticsearch-analysis-ik/releases
在这里插入图片描述
注：IK分词器版本一定要和Elasticsearch服务版本号一致，否则服务启动不了。

2 将插件拷贝到Elasticsearch的plugins目录

在这里插入图片描述
注：在plugins目录下新建ik插件目录，将插件内容拷贝到此文件夹下。并且在拷贝之前先停掉Elasticsearch服务。

3 重启elasticsearch服务

在这里插入图片描述
注：发现ik中文分词插件被加载。可在elasticsearch的安装目录下bin目录中，通过命令来查看加载的插件

注：命令”elasticsearch-plugin list“查看加载的插件列表。

4 使用kibana测试分词插件

IK分词器提供两种分词算法：ik_smart（最小切分）、ik_max_word（最细粒度划分）
在这里插入图片描述

4.1 ik_smart（最小切分）

//请求
GET _analyze
{
  "analyzer": "ik_smart",
  "text": "中国共产党"
}

//响应结果
{
  "tokens" : [
    {
      "token" : "中国共产党",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 0
    }
  ]
}

注：可以发现只分了一个词。start_offset（起始下标）、end_offset（结束下标）

4.2 ik_max_word（最细粒度划分）

//请求
GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "中国共产党"
}

//响应结果
{
  "tokens" : [
    {
      "token" : "中国共产党",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "中国",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "国共",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "共产党",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "共产",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "党",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 5
    }
  ]
}