Elasticsearch5.5.1(核心10)

最新推荐文章于 2025-09-10 10:06:31 发布

转载最新推荐文章于 2025-09-10 10:06:31 发布 · 91 阅读

0 ·

CC 4.0 BY-SA版权

原文链接：https://my.oschina.net/mdxlcj/blog/1529501

文章标签：

#大数据

本文详细介绍了Elasticsearch中的高级搜索技巧，包括多索引和多类型搜索、分页问题、精确匹配与全文检索的区别、倒排索引的工作原理及自定义分析器等内容。

为什么80%的码农都做不了架构师？>>>

1、multi-index和multi-type搜索模式

告诉你如何一次性搜索多个index和多个type下的数据
/_search：所有索引，所有type下的所有数据都搜索出来
/index1/_search：指定一个index，搜索其下所有type的数据
/index1,index2/_search：同时搜索两个index下的数据
/*1,*2/_search：按照通配符去匹配多个索引
/index1/type1/_search：搜索一个index下指定的type的数据
/index1/type1,type2/_search：可以搜索一个index下多个type的数据
/index1,index2/type1,type2/_search：搜索多个index下的多个type的数据
/_all/type1,type2/_search：_all，可以代表搜索所有index下的指定type的数据

但是需要注意的一点就是，搜索的时候，client发送请求到primary shard请求上，但是搜索请求也可以给replica shard发送请求，提高效率。

2、分页带来的问题

GET /test_index/test_type/_search?from=0&size=3

会涉及到deep paging 因为分布式es有多个shard的缘故，需要一个协调节点（coordinate node），它不包含index，通过协调节点将搜索请求发送到包含index的节点上去，比如总共6000条数据，想要获取到100页的数据，而每页都有10条数据，那么每个shard都要拿出第1000--1010条数据，总共三十条数据到coordinate shard上进行相关度排序，以最高的排序。

但是这个会有cpu，性能消耗，耗费内存，建议不要出现这种状况。

3、+ -的却别，+必须包含，-的话只要含有就可以

GET /test_index/test_type/_search?q=test_field:test
GET /test_index/test_type/_search?q=+test_field:test（和上面一样的功能）
GET /test_index/test_type/_search?q=-test_field:test

4、_all metadata的原理和作用

GET /test_index/test_type/_search?q=test 这个并不是对所有的field进行搜索，而是在创建的时候，我们就已经先把所有的field组合起来以一个字符串的形式保存起来，同时建立索引，当没有指定field搜索的时候，我们就搜索这个值。

举个例子

{
"name": "jack",
"age": 26,
"email": "jack@sina.com",
"address": "guamgzhou"
}

"jack 26 jack@sina.com guangzhou"，作为这一条document的_all field的值，同时进行分词后建立对应的倒排索引

5、es自动建立的mapping

GET /website/article/_search?q=2017           3条结果
GET /website/article/_search?q=2017-01-01     3条结果（会拆分成2017 01 01）
GET /website/article/_search?q=post_date:2017-01-01    1条结果
GET /website/article/_search?q=post_date:2017    1条结果

搜索结果为什么不一致，因为es自动建立mapping的时候，设置了不同的field不同的data type。不同的data type的分词、搜索等行为是不一样的。所以出现了_all field和post_date field的搜索表现完全不一样。

GET /website/_mapping/article

{
"website": {
"mappings": {
"article": {
"properties": {
"author_id": {
"type": "long"
},
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"post_date": {
"type": "date"
},
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}

6、exact value 和 full text

exact value：

2017-01-01，exact value，搜索的时候，必须输入2017-01-01，才能搜索出来
如果你输入一个01，是搜索不出来的

full text ：

2017-01-01，2017 01 01，搜索2017，或者01，都可以搜索出来
china，搜索cn，也可以将china搜索出来
likes，搜索like，也可以将likes搜索出来
Tom，搜索tom，也可以将Tom搜索出来
like，搜索love，同义词，也可以将like搜索出来

就不是说单纯的只是匹配完整的一个值，而是可以对值进行拆分词语后（分词）进行匹配，也可以通过缩写、时态、大小写、同义词等进行匹配

7、倒排索引

虽然单独的分词，但是比如mom和mother这样的词很接近，是区分不出来的，我们可以重新建立倒排索引，加入normalization

那接下来我们讲下什么是分词器

切分词语，normalization（提升recall召回率）

给你一段句子，然后将这段句子拆分成一个一个的单个的单词，同时对每个单词进行normalization（时态转换，单复数转换）
recall，召回率：搜索的时候，增加能够搜索到的结果的数量

character filter：在一段文本进行分词之前，先进行预处理，比如说最常见的就是，过滤html标签（<span>hello<span> --> hello），& --> and（I&you --> I and you）

tokenizer：分词，hello you and me --> hello, you, and, me

token filter：lowercase，stop word，synonymom，dogs --> dog，liked --> like，Tom --> tom，a/the/an --> 干掉，mother --> mom，small --> little

8、内置分词器的介绍

Set the shape to semi-transparent by calling set_trans(5)

standard analyzer：set, the, shape, to, semi, transparent, by, calling, set_trans, 5（默认的是standard）
simple analyzer：set, the, shape, to, semi, transparent, by, calling, set, trans
whitespace analyzer：Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
language analyzer（特定的语言的分词器，比如说，english，英语分词器）：set, shape, semi, transpar, call, set_tran, 5

9、query string分词

query string必须以和index建立时相同的analyzer进行分词
query string对exact value和full text的区别对待

date：exact value
_all：full text

比如我们有一个document，其中有一个field，包含的value是：hello you and me，建立倒排索引
我们要搜索这个document对应的index，搜索文本是hell me，这个搜索文本就是query string
query string，默认情况下，es会使用它对应的field建立倒排索引时相同的分词器去进行分词，分词和normalization，只有这样，才能实现正确的搜索

我们建立倒排索引的时候，将dogs --> dog，结果你搜索的时候，还是一个dogs，那不就搜索不到了吗？所以搜索的时候，那个dogs也必须变成dog才行。才能搜索到。

知识点：不同类型的field，可能有的就是full text，有的就是exact value

post_date，date：exact value
_all：full text，分词，normalization

10、测试分词器

GET /_analyze
{
"analyzer": "standard",
"text": "Text to analyze"
}

11、前10总结

（1）往es里面直接插入数据，es会自动建立索引，同时建立type以及对应的mapping
（2）mapping中就自动定义了每个field的数据类型
（3）不同的数据类型（比如说text和date），可能有的是exact value，有的是full text
（4）exact value，在建立倒排索引的时候，分词的时候，是将整个值一起作为一个关键词建立到倒排索引中的；full text，会经历各种各样的处理，分词，normaliztion（时态转换，同义词转换，大小写转换），才会建立到倒排索引中
（5）同时呢，exact value和full text类型的field就决定了，在一个搜索过来的时候，对exact value field或者是full text field进行搜索的行为也是不一样的，会跟建立倒排索引的行为保持一致；比如说exact value搜索的时候，就是直接按照整个值进行匹配，full text query string，也会进行分词和normalization再去倒排索引中去搜索
（6）可以用es的dynamic mapping，让其自动建立mapping，包括自动设置数据类型；也可以提前手动创建index和type的mapping，自己对各个field进行设置，包括数据类型，包括索引行为，包括分词器，等等

mapping，就是index的type的元数据，每个type都有一个自己的mapping，决定了数据类型，建立倒排索引的行为，还有进行搜索的行为

12、说下数据类型

string
byte，short，integer，long
float，double
boolean
date

查看mapping

GET /index/_mapping/type

13、修改mapping

只能创建index时手动建立mapping，或者新增field mapping，但是不能update field mapping

14、接下来说下api的语法

GET /search
{}

GET /index1,index2/type1,type2/search
{}

GET /_search
{
"from": 0,
"size": 10
}

这两种方式都可以用

GET /_search?from=0&size=10

POST /_search
{
"from":0,
"size":10
}

GET /test_index/test_type/_search
{
"query": {
"match": {
"test_field": "test"
}
}
}

搜索需求：title必须包含elasticsearch，content可以包含elasticsearch也可以不包含，author_id必须不为111

GET /website/article/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"title": "elasticsearch"
}
}
],
"should": [
{
"match": {
"content": "elasticsearch"
}
}
],
"must_not": [
{
"match": {
"author_id": 111
}
}
]
}
}
}

GET /test_index/_search
{
"query": {
"bool": {
"must": { "match": { "name": "tom" }},
"should": [
{ "match": { "hired": true }},
{ "bool": {
"must": { "match": { "personality": "good" }},
"must_not": { "match": { "rude": true }}
}}
],
"minimum_should_match": 1
}
}
}