Elasticsearch-分词器详解

原创

已于 2024-12-18 16:25:41 修改 · 1k 阅读

12 ·

CC 4.0 BY-SA版权

文章标签：

#elasticsearch #大数据 #搜索引擎 #全文检索

于 2024-12-18 16:24:03 首次发布

什么是分词器

1、分词器介绍 对文本进行分析处理的一种手段，基本处理逻辑为按照预先制定的分词规则，把原始文档分割成若干更小粒度的词项，粒度大小取决于分词器规则。
常用的中文分词器有ik按照切词的粒度粗细又分为:ik_max_word和ik_smart；英文分词器standard
ik_max_word会将文本做最细粒度的拆分，会穷尽各种可能的组合，适合 Term Query；
ik_smart:会做最粗粒度的拆分，适合 Phrase 查询
下面是对分词器使用的语句:

GET _analyze
{
   
   
  "text": ["布布努力学习编程"]
  ,"analyzer": "ik_max_word"
}
{
   
   
  "tokens" : [
    {
   
   
      "token" : "布",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
   
   
      "token" : "布",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
   
   
      "token" : "努力学习",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
   
   
      "token" : "努力",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
   
   
      "token" : "力学",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
   
   
      "token" : "学习",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
   
   
      "token" : "编程",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 6
    }
  ]
}

GET _analyze
{
   
   
  "text": ["布布努力学习编程"]
  ,"analyzer": "ik_smart"
}
{
   
   
  "tokens" : [
    {
   
   
      "token" : "布",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" :