定义两种过滤html标记后,自动生成的没有html标签的indexPUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"html_text_analyzer": {
"tokenizer": "standard",
"char_filter": ["html_char_filter"]
},
"html_keyword_analyzer": {
"tokenizer": "keyword",
"filter":["trim"],
"char_filter": ["html_char_filter"]
}
},
"char_filter": {
"html_char_filter": {
"type": "html_strip"
}
}
}
},
"mappings": {
"properties": {
"content":{
"type": "text",
"fields": {
"html_text":{
"search_analyzer": "simple",
"analyzer":"html_text_analyzer",
"type":"text"
},
"html_keyword":{
"analyzer":"html_keyword_analyzer",
"type":"text"
}
}
}
}
}}
mappings 设置字段子类型,一个字段建立三种index (默认:analyzer,html_text,html_keyword),方便比较
查看刚才设置的mappingsGET my_index/_mapping
下面我们测试一下html_text_analyzer//测试
POST my_index/_analyze
{
"analyzer": "html_text_analyzer",
"text": "
I'm so happy!
"}
// 返回结果
{
"tokens" : [
{
"token" : "I'm",
"start_offset" : 3,
"end_offset" : 11,
"type" : "",
"position" : 0
},
{
"token" : "so",
"start_offset" : 12,
"end_offset" : 14,
"type" : "",
"position" : 1
},
{
"token" : "happy",
"start_offset" : 18,
"end_offset" : 27,
"type" : "",
"position" : 2
}
]
}
下面我们测试一下html_keyword_analyzerPOST my_index/_analyze
{
"analyzer": "html_keyword_analyzer",
"text": "
I'm so happy!
"}
// 返回结果 去除html标记,全文被索引为一个keyword
{
"tokens" : [
{
"token" : "I'm so happy!",
"start_offset" : 0,
"end_offset" : 32,
"type" : "word",
"position" : 0
}
]
}
录入数据测试比较一下: