ElasticSearch50:索引管理_快速上机动手实战修改分词器以及定制自己的分词器

本文详细介绍了Elasticsearch中的默认分词器及其配置方式,包括如何启用基于英语的停用词过滤器,并提供了定制化分词器的具体步骤与实例。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1.默认的分词器
standard
standard tokenizer:以单词的边界进行切分
standard token filter:什么都不做
lowercase token filter:将所有字母转换成小写
stop token filter(默认被禁用),移除停用词,比如a the it等等

2.修改分词器的设置

例子:启用standard的基于english的分词器的停用词token filter
其中,es_std是这个分词器的名称
PUT /index0
{
  "settings": {
    "analysis": {
      "analyzer": {
        "es_std":{
          "type":"standard",
          "stopwords":"_english_"
        }
      }
    }
  }
}



测试:

使用standard分词器分词a little dog

GET /index0/_analyze
{
  "analyzer":"standard",
  "text":"a little dog"
}
执行结果:
{
  "tokens": [
    {
      "token": "a",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "little",
      "start_offset": 2,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "dog",
      "start_offset": 9,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}


使用设置的es_std分词器分词a little dog,可以看到结果中,停用词过滤了

GET /index0/_analyze
{
  "analyzer":"es_std",
  "text":"a little dog"
}
执行结果

{
  "tokens": [
    {
      "token": "little",
      "start_offset": 2,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "dog",
      "start_offset": 9,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}





3.定制化自己的分词器
例子
char_filter:类型为mapping,定义自己的替换过滤器,这里我们将&转换为and,并将这个过滤器起名为&_to_and
my_stopwords:类型为stop,定义自己的停用词,这里我们设置了两个停用词a和the
my_analyzer:类型为customer,自定义分词器,分词前操作:html_strip过滤html代码标签,&_to_and是我们自己定义的字符过滤器(将&提换成and),分词使用standard,停用词使用my_stopwords,并将所有的词转成小写

PUT /index0
{
  "settings": {
    "analysis": {
      "char_filter": {
        "&_to_and":{
          "type":"mapping",
          "mappings":["&=> and"]
        }
      },
      "filter":{
        "my_stopwords":{
          "type":"stop",
          "stopwords":["a","the"]
        }
      },
      "analyzer":{
        "my_analyzer":{
          "type":"custom",
          "char_filter":["html_strip","&_to_and"],
          "tokenizer":"standard",
          "filter":["lowercase","my_stopwords"]
        }
      }
    }
  }
}


执行:报错,索引已存在,
{
  "error": {
    "root_cause": [
      {
        "type": "index_already_exists_exception",
        "reason": "index [index0/zeKanPhhTR-6fiUjKRoe9g] already exists",
        "index_uuid": "zeKanPhhTR-6fiUjKRoe9g",
        "index": "index0"
      }
    ],
    "type": "index_already_exists_exception",
    "reason": "index [index0/zeKanPhhTR-6fiUjKRoe9g] already exists",
    "index_uuid": "zeKanPhhTR-6fiUjKRoe9g",
    "index": "index0"
  },
  "status": 400
}

我们先删除这个索引 DELETE /index0,然后再执行
执行成功:
{
  "acknowledged": true,
  "shards_acknowledged": true
}



测试我们的分词器my_analyzer:
模拟一段文本:tom and jery in the a house <a> & me HAHA
从执行结果中可以看出,a和the过滤了,HAHA转成了小写,&转成了and,<a>标签过滤了

GET /index0/_analyze
{
  "analyzer": "my_analyzer",
  "text":"tom and jery in the a house <a> & me HAHA"
}

执行结果

{
  "tokens": [
    {
      "token": "tom",
      "start_offset": 0,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "and",
      "start_offset": 4,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "jery",
      "start_offset": 8,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "in",
      "start_offset": 13,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "house",
      "start_offset": 22,
      "end_offset": 27,
      "type": "<ALPHANUM>",
      "position": 6
    },
    {
      "token": "and",
      "start_offset": 32,
      "end_offset": 33,
      "type": "<ALPHANUM>",
      "position": 7
    },
    {
      "token": "me",
      "start_offset": 34,
      "end_offset": 36,
      "type": "<ALPHANUM>",
      "position": 8
    },
    {
      "token": "haha",
      "start_offset": 37,
      "end_offset": 41,
      "type": "<ALPHANUM>",
      "position": 9
    }
  ]
}





4.在我们的索引中使用我们自定义的分词器
设置mytype中的字段content使用我们的自定义的分词器my_analyzer
GET /index0/_mapping/my_type
{
    "properties":{
        "content":{
            "type":"text",
            "analyzer":"my_analyzer"
        }
    }
}










评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值