Elasticsearch模糊与分词

最新推荐文章于 2025-04-24 18:12:56 发布

苗小刀

最新推荐文章于 2025-04-24 18:12:56 发布

阅读量2k

点赞数 1

文章标签：数据库 java elasticsearch

本文链接：https://blog.youkuaiyun.com/weixin_44487662/article/details/106553267

版权

Elasticsearch一般我们用到的都是分词,但是也有时候需要用到模糊查询，此次记录的是在做模糊查询时的问题
项目背景
原先数据在pg数据库上,但是后来考虑到数据越来越多，而且公司是做日志处理，主要是对日志的分析与聚合，故而考虑使用es，然后将原有数据迁移到es上，本身用于模糊查询的，到es中还用模糊查询
索引结构如下

PUT test5
{
  "settings": {
    "refresh_interval": "30s"
  },
  "mappings": {
    "properties": {
      "name": {    #字段名称
        "type": "text"  #字段类型
      }
    }
  }
}
添加数据
POST test5/_doc/1
{
  "name":"这是一个名称123456"
}
POST test5/_doc/2
{
  "name":"这是一个名称dfd"
}
POST test5/_doc/3
{
  "name":"这是一个名称dfd123456"
}
POST test5/_doc/4
{
  "name":"dfd123456"
}

然后利用es提供的模糊查询wildcard(关键字)

DSL如下:
GET test5/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "wildcard": {   #模糊查询关键字 
            "name": {
              "value": "*称df*"    #要模糊查找的内容
            }
          }
        }
      ]
    }
  }
}
结果:
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}
竟然模糊没有找到内容，很是诧异，

于是便找寻原因，因为业务定义的是类似与这样的搜索是可以被找到的，后来想到es是会将text里面的内容分词的。
一段文本进入es会被分词到如下:

DSL如下:

GET test5/_analyze
{
  "text": ["这是一个名称dfd"]
}
结果 :
{
  "tokens" : [
    {
      "token" : "这",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "一",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "个",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "名",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "称",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "dfd",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 6
    }
  ]
}
发现es默认的分词器会把一段文本每个字拆开(拆开后的内容名为词项)，我们传入的内容会对比每个词项，如果有匹配的则返回结果,反之无结果,类似于我们刚传入的内容(称df)，这是肯定匹配不到的，如果想要对此模糊，只能一个字输入，但显然我们的业务不允许这样

接下来那既然发现是分词的问题，那所想的就是更改分词器，后来找了一下，换上ik分词GitHub地址如下
下载之后，直接解压然后放到elasticsearch下plugins下就行了，创建个ik文件夹直接解压
elasticsearch-7.4.2\plugins\ik

解压完成之后重新启动es
直接查看分词结果

请求 :
GET test5/_analyze
{
  "analyzer": "ik_max_word",
  "text": "我是中国人"
}
结果 ：
{
  "tokens" : [
    {
      "token" : "这是",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "一个",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "一",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "TYPE_CNUM",
      "position" : 2
    },
    {
      "token" : "个",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "COUNT",
      "position" : 3
    },
    {
      "token" : "名称",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "dfd",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "ENGLISH",
      "position" : 5
    }
  ]
}

可以看到结果友好了许多,不在是每个字分开,但是这样依旧不满足我们的初衷(可以模糊搜索,因为它们依旧是对词项进行模糊搜索)

此时已经考虑到估计不是分词的原因，因为所有的模糊查询都是在分词项上面做的，那么现在就想的是给原有text字段，在添加一个类型keyword，利用keyword进行模糊(keyword类型字段不会分词)
DSL如下

 PUT test5/_mapping
    {
    "properties": {
            "name": {
              "type": "text",  #原有字段为text类型
              "fields": { #此处相当于给name又增加了一种类型，keyword
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            }
    }
}
查询dsl
GET test5/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "wildcard": {
            "name.keyword": {   #使用name字段keyword类型
              "value": "*个名称dfd*"
            }
          }
        }
      ]
    }
  }
}
结果如下:
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test5",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "name" : "这是一个名称dfd"
        }
      },
      {
        "_index" : "test5",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "name" : "这是一个名称dfd123456"
        }
      }
    ]
  }
}
成功模糊查询

注意：这个只是能支持模糊，但建议不要用模糊，数据量大的情况下使用模糊，效率会很低
可能大家会奇怪一个问题:为什么es只匹配词项，而不匹配我的整段文本呢，我的个人见解是es本身就是做搜索的，它关心的只是关键字，也就是词项，一旦产生了词项，便不会再管文本本身

这只是个人的一点见解，会很浅显，那位理解的深刻可以给讲一下，万分感谢