elasticsearch你懂的,为了搜索

最新推荐文章于 2023-02-23 14:25:02 发布

原创最新推荐文章于 2023-02-23 14:25:02 发布 · 235 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#elasticsearch搜索 #elasticsearch term #全文搜索

elasticsearch 专栏收录该内容

5 篇文章

订阅专栏

elasticsearch你懂的,为了搜索

参考文章
https://www.elastic.co/guide/cn/elasticsearch/guide/current/index.html

elasticsearch搜索中两种操作

评分

filter 过滤操作，不进行评分和相关度计算
search 搜索，需要评分和相关度计算

查询

精确查询
全文搜索

精确查询

term 基于词项的查询

由于倒排索引表自身的特性，整个字段是否相等会难以计算，如果确定某个特定文档是否只（only）包含我们想要查找的词呢？首先我们需要在倒排索引中找到相关的记录并获取文档 ID，然后再扫描倒排索引中的每行记录，查看它们是否包含其他的 terms 。

可以想象，这样不仅低效，而且代价高昂。正因如此， term 和 terms 是必须包含（must contain）操作，而不是必须精确相等（must equal exactly）。

精确相等

如果一个字段是数组的话，通过控制字段数组的长度来判断

{ "tags" : ["search"], "tag_count" : 1 }
{ "tags" : ["search", "open_source"], "tag_count" : 2 }

GET /my_index/my_type/_search
{
    "query": {
        "constant_score" : {
            "filter" : {
                 "bool" : {
                    "must" : [
                        { "term" : { "tags" : "search" } }, 
                        { "term" : { "tag_count" : 1 } } 
                    ]
                }
            }
        }
    }
}

返回查询

gt: > 大于（greater than）
lt: < 小于（less than）
gte: >= 大于或等于（greater than or equal to）
lte: <= 小于或等于（less than or equal to）

GET /my_store/products/_search
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "range" : {
                    "price" : {
                        "gte" : 20,
 
 
 gt: > 大于（greater than）
lt: < 小于（less than）
gte: >= 大于或等于（greater than or equal to）
lte: <= 小于或等于（less than or equal to）                       "lt"  : 40
                    }
                }
            }
        }
    }
}

日期范围

"range" : {
    "timestamp" : {
        "gt" : "2014-01-01 00:00:00",
        "lt" : "2014-01-07 00:00:00"
    }
}

当使用它处理日期字段时， range 查询支持对 日期计算（date math 进行操作，比方说，如果我们想查找时间戳在过去一小时内的所有文档：

"range" : {
    "timestamp" : {
        "gt" : "now-1h"
    }
}

is null and is not null

GET /my_index/posts/_search
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "exists" : { "field" : "tags" }
            }
        }
    }
}

GET /my_index/posts/_search
{
    "query" : {
        "constant_score" : {
            "filter": {
                "missing" : { "field" : "tags" }
            }
        }
    }
}

对于对象的判断

"people":{
    "name":"z",
    "age":20
}

实际存储的是：

{
    "people.nmae":"z",
    "people.age":20
}

所以判断people是否存在

{
    "exists" : { "field" : "people" }
}

===>转化为

{
    "bool": {
        "should": [
            { "exists": { "field": "people.name" }},
            { "exists": { "field": "people.age" }}
        ]
    }
}

缓存

自动缓存行为编辑
在 Elasticsearch 的较早版本中，默认的行为是缓存一切可以缓存的对象。这也通常意味着系统缓存 bitsets 太富侵略性，从而因为清理缓存带来性能压力。不仅如此，尽管很多过滤器都很容易被评价，但本质上是慢于缓存的（以及从缓存中复用）。缓存这些过滤器的意义不大，因为可以简单地再次执行过滤器。

检查一个倒排是非常快的，然后绝大多数查询组件却很少使用它。例如 term 过滤字段 “user_id” ：如果有上百万的用户，每个具体的用户 ID 出现的概率都很小。那么为这个过滤器缓存 bitsets 就不是很合算，因为缓存的结果很可能在重用之前就被剔除了。

这种缓存的扰动对性能有着严重的影响。更严重的是，它让开发者难以区分有良好表现的缓存以及无用缓存。

为了解决问题，Elasticsearch 会基于使用频次自动缓存查询。如果一个非评分查询在最近的 256 次查询中被使用过（次数取决于查询类型），那么这个查询就会作为缓存的候选。但是，并不是所有的片段都能保证缓存 bitset 。只有那些文档数量超过 10,000 （或超过总文档数量的 3% )才会缓存 bitset 。因为小的片段可以很快的进行搜索和合并，这里缓存的意义不大。

一旦缓存了，非评分计算的 bitset 会一直驻留在缓存中直到它被剔除。剔除规则是基于 LRU 的：一旦缓存满了，最近最少使用的过滤器会被剔除。

全文搜索

如果查询日期（date）或整数（integer）字段，它们会将查询字符串分别作为日期或整数对待。
如果查询一个（ not_analyzed ）未分析的精确值字符串字段，它们会将整个查询字符串作为单个词项对待。
但如果要查询一个（ analyzed ）已分析的全文字段，它们会先将查询字符串传递到一个合适的分析器，然后生成一个供查询的词项列表。

对于keyword的字段，使用非评分filter更好。

GET /my_index/my_type/_search
{
    "query": {
        "match": {
            "title": "QUICK!"
        }
    }
}
            --
            ||
            ||
            \/
    
GET /my_index/my_type/_search
{
  "query": {
    "bool": {
      "must": [
        {"term": {
          "title": "quick"
        }}
      ]
    }
  }
}

提高精度

多词匹配的时候，查询字符串通过分析器，得到多个token,再转化成底层term查询，这多个term匹配是or关系，通过operator来确定是or或and

GET /my_index/my_type/_search
{
    "query": {
        "match": {
            "title": {      
                "query":    "BROWN DOG!",
                "operator": "and"
            }
        }
    }
}

通过minimum_should_match来控制term的最小匹配率

GET /my_index/my_type/_search
{
  "query": {
    "match": {
      "title": {
        "query":                "quick brown dog",
        "minimum_should_match": "75%"
      }
    }
  }
}

参数	示例	解释
整数	3	无论可选子句的数量如何，都表示固定值。
负整数	-2	表示可选子句的总数减去此数字应该是必需的。
百分比	75%	表示可选子句总数的百分比是必需的。从百分比计算的数字向下舍入并用作最小值。
负百分比	-25%	表示可以丢失可选子句总数的百分比。从百分比计算的数字向下舍入，然后从总数中减去以确定最小值。
组合	3<90%	正整数，后跟小于号，后跟任何前面提到的说明符是条件规范。它表示如果可选子句的数量等于（或小于）整数，则它们都是必需的，但如果它大于整数，则适用规范。在这个例子中：如果有1到3个条款，则它们都是必需的，但是对于4个或更多条款，只需要90％。
多种组合	2<-25% 9<-3	多个条件规范可以用空格分隔，每个空格仅对大于之前的数字有效。在这个例子中：如果需要1或2个子句，如果有3-9个子句，则除了25％之外都需要，如果有9个以上的子句，则除了3个子句外都需要。

注意：

处理百分比时，负值可用于在边缘情况下获得不同的行为。在处理4个条款时，75％和-25％的含义相同，但在处理5个条款时，75％表示需要3个，但-25％表示需要4个。

如果基于规范的计算确定不需要可选子句，则关于BooleanQueries的常规规则仍然适用于搜索时间（包含没有必需子句的BooleanQuery必须仍然匹配至少一个可选子句）

无论计算到达的是什么数字，都将永远不会使用大于可选子句数的值或小于1的值。（即：无论计算结果的结果有多低或多高，所需匹配的最小数量永远不会低于1或大于子句数。

查询组合

表示查询该类型下的所有title含有quick不含有lazy的文档，should只是增加了相关性，should不会影响文档的匹配

GET /my_index/my_type/_search
{
  "query": {
    "bool": {
      "must":     { "match": { "title": "quick" }},
      "must_not": { "match": { "title": "lazy"  }},
      "should": [
                  { "match": { "title": "brown" }},
                  { "match": { "title": "dog"   }}
      ]
    }
  }
}

控制精度

GET /my_index/my_type/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "title": "brown" }},
        { "match": { "title": "fox"   }},
        { "match": { "title": "dog"   }}
      ],
      //表示最少要匹配两条match
      "minimum_should_match": 2 
    }
  }
}

bool和match

{
    "match": {
        "title": {
            "query":    "brown fox",
            "operator": "and"
        }
    }
}
            --
            ||
            ||
            \/
{
  "bool": {
    "must": [
      { "term": { "title": "brown" }},
      { "term": { "title": "fox"   }}
    ]
  }
}

{
    "match": {
        "title": {
            "query":                "quick brown fox",
            "minimum_should_match": "75%"
        }
    }
}
            --
            ||
            ||
            \/
{
  "bool": {
    "should": [
      { "term": { "title": "brown" }},
      { "term": { "title": "fox"   }},
      { "term": { "title": "quick" }}
    ],
    "minimum_should_match": 2 
  }
}

提高权重

GET /_search
{
    "query": {
        "bool": {
            "must": {
                "match": {  
                    "content": {
                        "query":    "full text search",
                        "operator": "and"
                    }
                }
            },
            "should": [
                { "match": {
                    "content": {
                        "query": "Elasticsearch",
                        "boost": 3 
                    }
                }},
                { "match": {
                    "content": {
                        "query": "Lucene",
                        "boost": 2 
                    }
                }}
            ]
        }
    }
}

boost是相关性权重,这种提升和降低并不是线性的。

搜索文本的分析器

查询自己定义的 analyzer ，否则
字段映射里定义的 search_analyzer ，否则
字段映射里定义的 analyzer ，否则
索引设置中名为 default_search 的分析器，默认为
索引设置中名为 default 的分析器，默认为

standard 标准分析器

索引模板

PUT _template/template_1
{
//index的名称以te开头将使用此模板
  "template": "te*",
  "settings": {
    "number_of_shards": 1
  },
  "mappings": {
    "type1": {
      "_source": {
        "enabled": false
      },
      "properties": {
        "host_name": {
          "type": "keyword"
        },
        "created_at": {
          "type": "date",
          "format": "EEE MMM dd HH:mm:ss Z YYYY"
        }
      }
    }
  }
}

删除模板

DELETE /_template/template_1

查询模板

GET /_template/template_1
GET /_template/temp*
GET /_template/template_1,template_2

多字段文本搜索

boost是查询权重，提高权重就要>1

GET /_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { 
            "title":  {
              "query": "War and Peace",
              "boost": 2
        }}},
        { "match": { 
            "author":  {
              "query": "Leo Tolstoy",
              "boost": 2
        }}},
        { "bool":  { 
            "should": [
              { "match": { "translator": "Constance Garnett" }},
              { "match": { "translator": "Louise Maude"      }}
            ]
        }}
      ]
    }
  }
}

最佳字段

索引数据
PUT /my_index/my_type/1
{
    "title": "Quick brown rabbits",
    "body":  "Brown rabbits are commonly seen."
}

PUT /my_index/my_type/2
{
    "title": "Keeping pets healthy",
    "body":  "My quick brown fox eats rabbits on a regular basis."
}

它会执行 should 语句中的两个查询。
加和两个查询的评分。as s
乘以匹配语句的总数。as count
除以所有语句总数（这里为：2）。as num
(s1 + s2)*count/num

文档 1 的两个字段都包含 brown 这个词，所以两个 match 语句都能成功匹配并且有一个评分。文档 2 的 body 字段同时包含 brown 和 fox 这两个词，但 title 字段没有包含任何词。这样， body 查询结果中的高分，加上 title 查询中的 0 分，然后乘以二分之一，就得到比文档 1 更低的整体评分。

{
    "query": {
        "bool": {
            "should": [
                { "match": { "title": "Brown fox" }},
                { "match": { "body":  "Brown fox" }}
            ]
        }
    }
}

那么如何解决这个问题

不使用 bool 查询，可以使用 dis_max 即分离最大化查询（Disjunction Max Query）。分离（Disjunction）的意思是或（or），这与可以把结合（conjunction）理解成与（and）相对应。分离最大化查询（Disjunction Max Query）指的是：将任何与任一查询匹配的文档作为结果返回，但只将最佳匹配的评分作为查询的评分结果返回：

{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Brown fox" }},
                { "match": { "body":  "Brown fox" }}
            ]
        }
    }
}

dis_max调优

可以通过指定 tie_breaker 这个参数将其他匹配语句的评分也考虑其中

{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ],
            "tie_breaker": 0.3
        }
    }
}

tie_breaker
参数提供了一种 dis_max 和 bool 之间的折中选择，它的评分方式如下：

获得最佳匹配语句的评分 _score 。
将其他匹配语句的评分结果与 tie_breaker 相乘。
对以上评分求和并规范化。

tie_breaker 可以是 0 到 1 之间的浮点数，其中 0 代表使用 dis_max 最佳匹配语句的普通逻辑，** 1** 表示所有匹配语句同等重要。最佳的精确值需要根据数据与查询调试得出，但是合理值应该与零接近（处于** 0.1 - 0.4** 之间），这样就不会颠覆 dis_max 最佳匹配性质的根本。

多数字段跨字段

{
    "multi_match": {
        "query":                "Quick brown fox",
        "type":                 "best_fields", 
        "fields":               [ "title", "body" ],
        "tie_breaker":          0.3,
        "minimum_should_match": "30%" 
    }
}

模糊匹配

{
    "multi_match": {
        "query":  "Quick brown fox",
        "fields": "*_title"
    }
}

提高字段的权重

{
    "multi_match": {
        "query":  "Quick brown fox",
        "fields": [ "*_title", "chapter_title^2" ] 
    }
}

跨字段

{
  "query": {
    "bool": {
      "should": [
        { "match": { "street":    "Poland Street W1V" }},
        { "match": { "city":      "Poland Street W1V" }},
        { "match": { "country":   "Poland Street W1V" }},
        { "match": { "postcode":  "Poland Street W1V" }}
      ]
    }
  }
}

GET my_index/my_type/_search
{
  "query": {
    "multi_match": {
      "query": "brown fox",
      "type": "most_fields",
       "operator":    "or",
       "minimum_should_match":"75%",
      "fields": ["title","body"]
    }
  }
}

most_fields 方式的问题编辑

用 most_fields 这种方式搜索也存在某些问题，这些问题并不会马上显现：

它是为多数字段匹配任意词设计的，而不是在所有字段中找到最匹配的。
它不能使用 operator 或 minimum_should_match 参数来降低次相关结果造成的长尾效应。
词频对于每个字段是不一样的，而且它们之间的相互影响会导致不好的排序结果。

解决方案编辑

存在这些问题仅仅是因为我们在处理着多个字段，如果将所有这些字段组合成单个字段，问题就会消失。可以为 person 文档添加 full_name 字段来解决这个问题：

{
    "first_name":  "Peter",
    "last_name":   "Smith",
    "full_name":   "Peter Smith"
}

自定义_all字段

PUT /my_index
{
    "mappings": {
        "person": {
            "properties": {
                "first_name": {
                    "type":     "string",
                    "copy_to":  "full_name" 
                },
                "last_name": {
                    "type":     "string",
                    "copy_to":  "full_name" 
                },
                "full_name": {
                    "type":     "string"
                }
            }
        }
    }
}

copy_to 对多字段（一个字段有不同的索引方式）无效，只能针对主字段，主字段的多字段是没有自己的数据源的，依存于主字段

PUT /my_index
{
    "mappings": {
        "person": {
            "properties": {
                "first_name": {
                    "type":     "string",
                    "copy_to":  "full_name", 
                    "fields": {
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed"
                        }
                    }
                },
                "full_name": {
                    "type":     "string"
                }
            }
        }
    }
}

跨字段

GET /books/_search
{
    "query": {
        "multi_match": {
            "query":       "peter smith",
            "type":        "cross_fields",
            "fields":      [ "title^2", "description" ] 
        }
    }
}

对于匹配的文档， peter 和 smith 都必须同时出现在相同字段中，要么是 first_name 字段，要么 last_name 字段：

字段中心 会导致几个字段同事匹配一个词要比几个字段匹配不同的要高
(+first_name:peter +first_name:smith)
(+last_name:peter  +last_name:smith)

词中心式 会使用以下逻辑：
+(first_name:peter last_name:peter)
+(first_name:smith last_name:smith)

换句话说，词 peter 和 smith 都必须出现，但是可以出现在任意字段中。

自定义 _all 的方式是一个好的解决方案，只需在索引文档前为其设置好映射。不过， Elasticsearch 还在搜索时提供了相应的解决方案：使用 cross_fields 类型进行 multi_match 查询。 cross_fields 使用词中心式（term-centric）的查询方式，这与 best_fields 和 most_fields 使用字段中心式（field-centric）的查询方式非常不同，它将所有字段当成一个大字段，并在每个字段中查找每个词。
cross_fields 使用词中心式（term-centric）的查询方式，这与 best_fields 和 most_fields 使用字段中心式（field-centric）的查询方式非常不同，它将所有字段当成一个大字段，并在每个字段中查找每个词。

为了让 cross_fields 查询以最优方式工作，所有的字段都须使用相同的分析器， 具有相同分析器的字段会被分组在一起作为混合字段使用。

如果包括了不同分析链的字段，它们会以 best_fields 的相同方式被加入到查询结果中。例如：我们将 title 字段加到之前的查询中（假设它们使用的是不同的分析器）， explanation 的解释结果如下：

(+title:peter +title:smith)
(
  +blended("peter", fields: [first_name, last_name])
  +blended("smith", fields: [first_name, last_name])
)
当在使用 minimum_should_match 和 operator 参数时，这点尤为重要

权重

GET /books/_search
{
    "query": {
        "multi_match": {
            "query":       "peter smith",
            "type":        "cross_fields",
            "fields":      [ "title^2", "description" ] 
        }
    }
}

精确值 not_analyzed 未分析字段,将 not_analyzed 字段与 multi_match 中 analyzed 字段混在一起没有多大用处。

参考文章
https://www.elastic.co/guide/cn/elasticsearch/guide/current/index.html