Elasticsearch中文分词ik使用

最新推荐文章于 2021-03-11 17:04:14 发布

weixin_33913377

最新推荐文章于 2021-03-11 17:04:14 发布

阅读量129

点赞数

CC 4.0 BY-SA版权

文章标签：大数据 c/c++

原文链接：https://my.oschina.net/tkyuan/blog/734055

本文介绍Elasticsearch自带的分词器对中文处理不佳的问题，并详细讲解了IK分词器的安装配置及使用方法，包括两种分词策略：ik_max_word与ik_smart。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、背景

Elasticsearch本身自带分词器standard，但对中文的支持不是很好。比如针对“我是中国人”，会分词成“我”、“是”、“中”、“国”、“人”，这显然不是我们想要的结果，我们更希望“中国”、“中国人”这样的分词。因此，我们需要中文分词插件，elasticsearch-analysis-ik就是一款开源的能符合我们需求的中文分词插件。

二、ik安装部署

根据elasticsearch的版本，下载对应的ik中文分词包，地址：https://github.com/medcl/elasticsearch-analysis-ik。
将elasticsearch-analysis-ik-1.8.1.zip解压（假设你使用的es版本是2.2.1），并将解压出来的内容全部复制到your-es-root/plugins/ik，如果没有ik目录就自行手动创建；目录结构如下：
如果你用的es版本是2.2.0，但下载的ik分词包是1.8.1，那么请更改文件plugin-descriptor.properties里的配置，改成：

elasticsearch.version＝2.2.0

重新启动es完成安装，没报错即代表成功。

三、ik分词使用

1.分词策略

ik包含两种分词策略：

ik_max_word: 会将文本做最细粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”，会穷尽各种可能的组合；
ik_smart: 会做最粗粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”；

2.创建映射时指定字段的分词策略

创建映射mapping:

curl -XPUT localhost:9200/test?pretty -d '
{
    "mappings": {
        "user": {
            "properties": {
                "id": {
                    "type": "long",
                    "index": "no"
                },
                "name": {
                    "type": "string",
                    "store": "no",
                    "term_vector": "with_positions_offsets",
                    "analyzer": "ik_max_word",
                    "search_analyzer": "ik_max_word",
                    "include_in_all": "true",
                    "boost": 8
                },
                "sex": {
                    "type": "string",
                    "index": "not_analyzed"
                },
                "cityName": {
                    "type": "string",
                    "index": "not_analyzed"
                },
                "favorite": {
                    "type": "string"
                },
                "age": {
                    "type": "long",
                    "index": "no"
                },
                "location": {
                    "type": "geo_point"
                }
            }
        }
    }
}'

其中name字段的分词策略使用ik_max_word。

ik_max_word实测：

> curl -XGET localhost:9200/testik/_analyze?analyzer=ik_max_word -d '中华人民共和国国歌'
{
    "tokens": [
        {
            "token": "中华人民共和国",
            "start_offset": 0,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "中华人民",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "中华",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "华人",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "人民共和国",
            "start_offset": 2,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "人民",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "共和国",
            "start_offset": 4,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 6
        },
        {
            "token": "共和",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 7
        },
        {
            "token": "国",
            "start_offset": 6,
            "end_offset": 7,
            "type": "CN_CHAR",
            "position": 8
        },
        {
            "token": "国歌",
            "start_offset": 7,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 9
        }
    ]
}

ik_smart实测：

> curl -XGET localhost:9200/testik/_analyze?analyzer=ik_smart -d '中华人民共和国国歌'
{
    "tokens": [
        {
            "token": "中华人民共和国",
            "start_offset": 0,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "国歌",
            "start_offset": 7,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}