ElasticSearch 分词

最新推荐文章于 2025-06-14 11:52:12 发布

原创最新推荐文章于 2025-06-14 11:52:12 发布 · 2.1k 阅读

6 ·

CC 4.0 BY-SA版权

elasticsearch 同时被 2 个专栏收录

2 篇文章

订阅专栏

SpringBoot

2 篇文章

订阅专栏

本文深入探讨Elasticsearch中的分词技术，包括分词器的组成、原理及配置方式，并详细介绍了各种内置分词器的功能特性，如Standard、Simple、Whitespace等，还涉及中文分词难点与解决方案。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

分词

将文本转换成一系列单词的过程,转换成的单词叫term or token

原理: 倒排索引(b+)

分词器的组成以及调用顺序

1.Character Filter 单词过滤器,对原始的文本进行处理
2.Tokenizer  将原始文本按照一定的规则切分成不同的单词
3.Token Filter  针对2过程处理的单词在进行加工,例如大小写转换等

1.指定analyzer进行测试

请求:

POST _analyze
{
  "analyzer": "standard",    //指定默认的分词器
  "text": "hello world"      //分词文本
}

返回值:

被分成两个单词.

{
  "tokens": [                //分词结果
    {
      "token": "hello",     //
      "start_offset": 0,     //起始偏移量
      "end_offset": 5,      //结束偏移量
      "type": "<ALPHANUM>",
      "position": 0         //位置
    },
    {
      "token": "world",
      "start_offset": 6,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

2.指定filed进行测试

请求

POST /test_index/_analyze        //  test_index   测试的索引.如果没有该索引或报错.
{
  "field": "name",              //测试的字段
  "text": "hello world"
}

返回值

{
  "tokens": [                //分词结果
    {
      "token": "hello",     //
      "start_offset": 0,     //起始偏移量
      "end_offset": 5,      //结束偏移量
      "type": "<ALPHANUM>",
      "position": 0         //位置
    },
    {
      "token": "world",
      "start_offset": 6,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

3.指定自定义的分词器进行测试

POST _analyze
{
  "tokenizer": "standard",     
  "filter": ["lowercase"],    //自定义的分词器,全部转化成小写
  "text": "heLLo World"
}

es 自带的分词器

Standard
- 按词切分,支持多语言 Tokenizer:statandar
- 小写处理 Token Filter: statandar,lowerCase,Stop(disabled by default)

POST _analyze
{
  "analyzer": "standard",    //指定默认的分词器
  "text": "Hello World"      //分词文本
}
最后返回hello , world

Simple
- 按照非字母切分
- 小写处理 Tokenizer:lowercase

POST _analyze
{
  "analyzer": "simple",
  "text": "heLLo World  2018"
}
最后结果只含有字母,数字下划线等都没有了

Whitespace
- 按照空格来切分 Tokenizer:Whitespace

POST _analyze
{
  "analyzer": "whitespace",
  "text": "heLLo World  2018"
}
结果为hello , World ,2018  不区分大小写

Stop
- stop word指语气助词等修饰性的词语,比如the,an,的,这等等
- 与simple 多了stop word的处理 Token Filter: stop

POST _analyze
{
  "analyzer": "stop",
  "text": "the heLLo World  2018"
}
返回 hello world

Keyword
- 不分词,不想对文本进行分词的时候使用

POST _analyze
{
  "analyzer": "keyword",
  "text": "the heLLo World  2018"
}
返回
{
  "tokens": [
    {
      "token": "the heLLo World  2018",
      "start_offset": 0,
      "end_offset": 21,
      "type": "word",
      "position": 0
    }
  ]
}

Pattern
- 通过正则表达式区分 Tokenize: Pattern Token Filters:lowercase,stop(disabled by deafult)
- 默认是\w+,即非字词的符号作为分隔符

POST _analyze
{
  "analyzer": "pattern",
  "text": "the heLLo 'World  -2018"
}
返回
{
  "tokens": [
    {
      "token": "the",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 0
    },
    {
      "token": "hello",
      "start_offset": 4,
      "end_offset": 9,
      "type": "word",
      "position": 1
    },
    {
      "token": "world",
      "start_offset": 11,
      "end_offset": 16,
      "type": "word",
      "position": 2
    },
    {
      "token": "2018",
      "start_offset": 19,
      "end_offset": 23,
      "type": "word",
      "position": 3
    }
  ]
}

Language
- es提供的30多种常见语言的分词器

4.中文分词

难点
- 中文分词是将一句话切分成一个一个单独的词,在英文中,单词之间是以空格为自然分解符,汉字中词没有一个形式上的分解符.
- 上下文不同,分词结果迥异,比如交叉歧义问题.比如下面两种分词都合理
```
  今天/民政局/发/放女朋友
  今天/民政/局/发放/女/朋友
```

4.1 IK分词器

1.实现中英文单词的切分,支持ik_smart,ik_maxword等模式
2.可自定义词库,支持热更新分词字典

https://github.com/medcl/elasticsearch-analysis-ik

5.自定义分词

当自带的分词无法满足需求时,可以自定义分词.
通过自定义Character Filters,Tokenizer和Tokenizer Filters来实现

5.1 Character Filters

example

POST _analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    "html_strip"      //去掉html中的符号
  ],
  "text": "<p>I&apos;m so <b>happy</b>!</p>"
}

Tokenizer 自带的如下

standard 按照标准单词进分割
letter 按照非字符进行分割
whitespace 按照空格进行分割
UAX URL Email 按照standard分割,但是不会分割邮箱和url

example:
POST _analyze
{
  "tokenizer": "uax_url_email",
  "text": "www.baidu.com 1112@qq.com hello world"
}

result:
{
  "tokens": [
    {
      "token": "www.baidu.com",
      "start_offset": 0,
      "end_offset": 13,
      "type": "<URL>",
      "position": 0
    },
    {
      "token": "1112@qq.com",
      "start_offset": 14,
      "end_offset": 25,
      "type": "<EMAIL>",
      "position": 1
    },
    {
      "token": "hello",
      "start_offset": 26,
      "end_offset": 31,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "world",
      "start_offset": 32,
      "end_offset": 37,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}

NGram 和 Edge NGram连词分割类似百度谷歌等搜索

POST _analyze
{
  "tokenizer": "ngram",   //会一次查询出来你每个字后面的词,edge_ngram只会查询出第一个词后面的词
  "text": "你好"
}

{
  "tokens": [
    {
      "token": "你",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "你好",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "好",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 2
    }
  ]
}

Path Hierarchy按照文件路径进行分割

example:
POST _analyze
{
  "tokenizer": "path_hierarchy",
  "text": "/baidu/com"
}

result:
{
  "tokens": [
    {
      "token": "/baidu",
      "start_offset": 0,
      "end_offset": 6,
      "type": "word",
      "position": 0
    },
    {
      "token": "/baidu/com",
      "start_offset": 0,
      "end_offset": 10,
      "type": "word",
      "position": 0
    }
  ]
}

5.2 Token Filter

对于tokenizer 输出的单词(term)进行增删改等操作

自带的,可以有多个,按照顺序执行

lowercase 全部小写
stop 删除stop words
NGram和Edge NGram 连词分割
Synonym 添加近义词term

POST _analyze
{
  "tokenizer": "standard",
  "filter": [
    "stop",
    "lowercase",
    {
      "type":"ngram",
      "min_gram":4,    最小四个
      "max_gram":4      最大四个
    }
    ], 
  "text": "a hello world"
}

5.3 自定义分词

需要在索引的配置中设定

结构:
PUT test_index
{
  "settings": {
    "analysis": {
      "char_filter": {}, 
      "tokenizer": {},
      "filter": {},
      "analyzer": {}
    }
  }
}

创建一个自己的分词器

PUT test_index1
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_first_analyzer":{
          "type":"custom",
          "tokenizer":"standard",
          "char_filter":[
              "html_strip"
            ],
          "filter":[
              "lowercase",
              "asciifolding"
            ]
        }
      }
    }
  }
}

自定义分词验证

POST /test_index1/_analyze
{
  "analyzer": "my_first_analyzer",
  "text": "<p>I&apos;m so <b>happy</b>!</p>"
}

结果
{
  "tokens": [
    {
      "token": "i'm",
      "start_offset": 3,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "so",
      "start_offset": 12,
      "end_offset": 14,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "happy",
      "start_offset": 18,
      "end_offset": 27,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

6.分词使用说明

创建或者更新文档时(Index Time)候会对应的文档进行分词处理
查询时(Search Time),会对查询语句进行分词

6.1索引的分词通过配置Index

Mapping中每个字段的analyzer属性实现的,不指定的时候默认是standard

demo

PUT test_index2
{
  "mappings": {
    "doc":{
      "properties": {
        "title":{
          "type":"text",
          "analyzer":"whitespace"   //指定分词器
        }
      }
    }
  }
}

6.2 查询时候指定分词

1.通过analyzer指定分词器

POST /test_index/_search
{
  "query": {
    "match": {
      "message": {
        "query": "hello",
        "analyzer": "standard"    //指定分词器
      }
    }
  }
}

2.通过index mapping 设置search_analyzer实现

PUT /test_index
{
  "mappings": {
    "doc": {
      "properties": {
        "title":{
          "type": "text",
          "analyzer": "whitespace", 
          "search_analyzer": "standard"
        }
      }
    }
  }
}