Elastic Search个人学习(4) 分词器 1 char_filter

Elastic Search分词器

Analysis与Analyser:
Analysis是将全文分词的过程, Analysis是通过Analyser实现的。

Analyser由三部分组成: Character Filters, Tokenizers, Token Filters

(1) Character filter:
对要分词的文本进行预处理,比如去掉html标签, 替换字符。
自带的有:
“html_strip” 去除html标签
“mapping” 进行字符替换,
“pattern replace” 进行正则替换

预处理会影响到后面的tokenizer解析的position和offset信息

Analyzer: html_strip

POST _analyze
{
"tokenizer":"keyword",
"char_filter":"[html_strip]",
"text":"<p>hello world</p>"
}

输出为

{
  "tokens" : [
    {
      "token" : """
hello world
""",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    }
  ]
}

可见去除了html标签

Analyzer: mapping

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "test_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "test_char_filter"
          ]
        }
      },
      "char_filter": {
        "test_char_filter": {
          "type": "mapping",
          "mappings": [
            "1 => 2"
          ]
        }
      }
    }
  }
}

POST my_index/_analyze
{
"analyzer": "test_analyzer",
"text": "My license plate is 111"
}

输出为

{
  "tokens" : [
    {
      "token" : "My license plate is 222",
      "start_offset" : 0,
      "end_offset" : 23,
      "type" : "word",
      "position" : 0
    }
  ]
}

可见1全部被替换成了2

Analyzer: pattern

PUT test_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "test_pattern_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "test_pattern_char_filter"
          ]
        }
      },
      "char_filter": {
        "test_pattern_char_filter": {
          "type": "pattern_replace",
          "pattern": "111",
          "replacement": "2"
        }
      }
    }
  }
}

POST test_index/_analyze
{
"analyzer": "test_pattern_analyzer",
"text": "My license plate is 111"
}

输出为

{
  "tokens" : [
    {
      "token" : "My license plate is 2",
      "start_offset" : 0,
      "end_offset" : 23,
      "type" : "word",
      "position" : 0
    }
  ]
}

可见符合正则的数据都被替换成了2

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值