Elastic Search分词器
Analysis与Analyser:
Analysis是将全文分词的过程, Analysis是通过Analyser实现的。
Analyser由三部分组成: Character Filters, Tokenizers, Token Filters。
(1) Character filter:
对要分词的文本进行预处理,比如去掉html标签, 替换字符。
自带的有:
“html_strip” 去除html标签
“mapping” 进行字符替换,
“pattern replace” 进行正则替换
预处理会影响到后面的tokenizer解析的position和offset信息
Analyzer: html_strip
POST _analyze
{
"tokenizer":"keyword",
"char_filter":"[html_strip]",
"text":"<p>hello world</p>"
}
输出为
{
"tokens" : [
{
"token" : """
hello world
""",
"start_offset" : 0,
"end_offset" : 18,
"type" : "word",
"position" : 0
}
]
}
可见去除了html标签
Analyzer: mapping
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"test_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"test_char_filter"
]
}
},
"char_filter": {
"test_char_filter": {
"type": "mapping",
"mappings": [
"1 => 2"
]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "test_analyzer",
"text": "My license plate is 111"
}
输出为
{
"tokens" : [
{
"token" : "My license plate is 222",
"start_offset" : 0,
"end_offset" : 23,
"type" : "word",
"position" : 0
}
]
}
可见1全部被替换成了2
Analyzer: pattern
PUT test_index
{
"settings": {
"analysis": {
"analyzer": {
"test_pattern_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"test_pattern_char_filter"
]
}
},
"char_filter": {
"test_pattern_char_filter": {
"type": "pattern_replace",
"pattern": "111",
"replacement": "2"
}
}
}
}
}
POST test_index/_analyze
{
"analyzer": "test_pattern_analyzer",
"text": "My license plate is 111"
}
输出为
{
"tokens" : [
{
"token" : "My license plate is 2",
"start_offset" : 0,
"end_offset" : 23,
"type" : "word",
"position" : 0
}
]
}
可见符合正则的数据都被替换成了2