1. 先新增分词器 my_hanlp,分析器 my_hanlp_analyzer,并设置过滤器小写字母;
// 在hanlp基础上自定义分词器
PUT demo1/
{
"settings": {
"analysis": {
"analyzer": {
"my_hanlp_analyzer": {
"tokenizer": "my_hanlp",
"filter":"lowercase"
}
},
"tokenizer": {
"my_hanlp": {
"type": "hanlp_index",
"enable_stop_dictionary": true,
"enable_custom_config": true
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "my_hanlp_analyzer"
}
}
}
}
2. 增加两条测试数据
POST /demo1/_doc/1
{
"title":"我们 us wELCOME To Beijing.",
"content":"1 this is a test CONTENT。 在对阵匈牙利的赛前新闻发布会上,C罗将自己面前的可口可乐瓶子移开。"
}
POST /demo1/_doc/2
{
"title":"2 我们 us welcome To Beijing. ",
"content":"2 This is a test content 2 Welcome TO BEIJING。 在对阵匈牙利的赛前新闻发布会上,C罗将自己面前的可口可乐瓶子移开"
}
3. 测试大小写查询语句
GET /demo1/_search
{
"query": {
"multi_match": {
"query": "BEIjing",
"fields": [
"title",
"content"
]
}
},
"highlight": {
"fields": {
"title": {},
"content": {}
}
}
}
结果:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.6602099,
"hits" : [
{
"_index" : "demo1",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.6602099,
"_source" : {
"title" : "2 我们 us welcome To Beijing. ",
"content" : "2 This is a test content 2 Welcome TO BEIJING。 在对阵匈牙利的赛前新闻发布会上,C罗将自己面前的可口可乐瓶子移开"
},
"highlight" : {
"title" : [
"2 我们 us welcome To <em>Beijing</em>."
],
"content" : [
"2 This is a test content 2 Welcome TO <em>BEIJING</em>。 在对阵匈牙利的赛前新闻发布会上,C罗将自己面前的可口可乐瓶子移开"
]
}
},
{
"_index" : "demo1",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.18824537,
"_source" : {
"title" : "我们 us wELCOME To Beijing

本文介绍了如何在Elasticsearch中创建自定义分词器my_hanlp,它基于hanlp_index并去停词、使用自定义字典。通过测试数据和查询语句展示了大小写敏感性的影响。同时,分析了my_hanlp_analyzer如何在my_hanlp基础上进行小写转换,确保搜索不区分大小写。案例详尽,适合Elasticsearch和自然语言处理的学习者参考。
最低0.47元/天 解锁文章
5559

被折叠的 条评论
为什么被折叠?



