Elasticsearch 入门教程：从概念到实践-优快云博客

本文链接：https://blog.youkuaiyun.com/WeiLanooo/article/details/102394930

本文是一篇关于Elasticsearch的入门教程，涵盖了从基础概念如倒排索引、Analyzer到实战操作如索引的增删查、Logstash数据导入、Python中使用Elasticsearch的方法，还包括了中文分词器IK的配置和Suggest建议查询的实现，旨在帮助读者全面了解并掌握Elasticsearch的使用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

学习目标

1.能够知道倒排索引是什么
	倒排索引是一种数据结构

2.能够知道搜索引擎依赖倒排索引、标准化分析和相关性排序的原理

3.能够在终端中使用curl命令发送http请求

4.能够在elasticsearch索引库进行增删查
	curl -X PUT localhost:9200/articles -d ''
	curl -X DELETE localhost:9200/articles
	curl localhost:9200/_cat/indices
	
5.能够在elasticsearch中使用ik中文分析器
	"analyzer": "ik_max_word",

6.能够在elasticsearch中创建和查询类型映射
	curl -X PUT localhost:9200/articles/_mapping/article -d ''
	curl -X GET localhost:9200/articles/_mapping/article
	
7.能够知道修改elasticsearch索引库类型映射的方式
	0停机的方案
	1.新建一个新的索引库，结构个原来的一模一样。
	2.从索引库导数据到新的索引库
	3.删除原来的索引库
	4.给新的索引库起别名

8.能够对elasticsearch文档数据进行增删改查
	增加: 格式PUT /索引库/类型(表)/id
	修改操作实际上是先删再增加

9.能够使用logstash从mysql导入数据到elasticsearch中
	1.下载java 下要用到的mysql连接器
	2.写导入脚本
	3.执行脚本

10.能够使用elasticsearch进行数据检索查询
	查询字符串的做法: q=title:python%20web 
	高级查询: 把查询字符串的做法变成请求体的做法。

11.能够使用python操作elasticsearch数据库
	1.导入 
	from elasticsearch5 import Elasticsearch
	2.配置Elasticsearch集群
	3.创建es客户端对象
	4.构建查询请求体字典
	5.执行查询

12.能够使用elasticsearch实现拼写纠错建议查询
13.能够使用elasticsearch实现自动补全建议查询

1.elasticsearch简介

elasticsearch是什么?

elasticsearch除了是一个搜索引擎，也是一个面向文档的数据库。

elasticsearch特点

1.Elasticsearch是一个基于Lucene库的搜索引擎
2.Elasticsearch是一个分布式、支持多用户的全文搜索引擎
3.Elasticsearch是用Java开发的

使用elasticsearch的网站

1.Wikipedia 维基百科
2.Stack Overflow
3.GitHub

备注:

1.Elasticsearch 有2.x、5.x、6.x 三个大版本，我们在黑马头条中使用5.6版本。
2.可以使用 RESTful API 通过端口 9200 和 Elasticsearch 进行通信
3.Elasticsearch中的文档都是json格式的文档。

2.倒排索引介绍

倒排索引
```
是一种非常重要的数据结构
```

简单类比：

图书的目录页就是简单的正排索引
最后的索引页就是倒排索引

有一些书籍的最后：
	提供的单词在哪些文章出现了，以及出现所在的页数和位置。

Elasticsearch中的索引

正排索引: 文档id到文章内容和单词的关联
倒排索引: 单词到文章id的关联

Elasticsearch倒排索引包含两个部分

1.单词词典:

记录所有文档的单词，记录单词到倒排列表的关联关系
单词词典比较大，一般使用B+树实现

2.倒排列表:

记录了单词对应文档结合，由倒排索引项组成
倒排索引项:
  1.文档id
  2.词频TF - 该词在文档中出现的次数，用于相关性评分
  3.位置(position) - 单词在文档中分词的位置。
  4.偏移(offset) - 记录单词的开始和结束位置。

3.通过Analyzer进行分词

把整个文档分成单词，是由Analyzer(分词器)来完成的。
analyer(分词器)有三个部分组成:

1.character filters

过滤字符，比如把html标签去掉

2.tokenizer

按照规则切分单词。比如根据空格切分

3.token filter

将切分的单词进行加工，比如把大写变成小写。

Elasticsearch内置的分词器

1.standard analyzer - 默认分词器，按词切分，小写处理
2.simple analyzer - 按照非字母切分(符号被过滤)，小写处理
3.stop analyzer - 小写处理，停用词过滤(the, a, is)
4.whitespace analyzer - 按照空格切分，不转小写
5.keyword analyzer - 不分词，直接将输入当作输出
6.patter analyzer - 正则表达式，默认\W+(非字符分割)
7.language - 提供了30多种常见语言的分词器
8.customer analyzer - 自定义分词器

演示每种分词器分词的不同

curl -X GET 127.0.0.1:9200/_analyze?pretty -d '
{
  "analyzer": "keyword",
  "text": "Looking for work or have a Python related position that you are trying to hire for? "
}'

1.standard
2.stop
3.keyword

4.相关性排序

默认情况下，搜索结果是按照相关性进行倒序排序的——最相关的文档排在最前。
Elasticsearch 的相似度算法, 被定义为检索词频率/反向文档频率，TF/IDF

检索词频率

出现频率越高，相关性也越高。 字段中出现过 5 次要比只出现过 1 次的相关性高。

反向文档频率

频率越高，相关性越低。检索词出现在多数文档中会比出现在少数文档中的权重更低。

字段长度准则

长度越长，相关性越低。 检索词出现在一个短的文档中要比同样的词出现在一个长的文档中，权重更大

5.Elasticsearch索引

动词

存储数据到 Elasticsearch 的行为叫做 索引

名词

一个 Elasticsearch 集群可以 包含多个 索引 （indices 数据库）

关系型数据库和Elasticsearch中的一些对比(这不是非常恰当的比喻)

Databases 数据库 ->  Indices 索引库
Tables 表       ->  Types 类型
Rows 行         ->  Documents 文档
Columns 列      ->  Fields 字段/属性

小结

一个 Elasticsearch 集群可以包含多个索引(indices 数据库),
相应的每个索引可以包含多个类型（type 表） 。 
这些不同的类型存储着多个文档(document 数据行）,
每个文档又有多个属性(field 列)。

6.Elasticsearch 集群

Elasticsearch 尽可能地屏蔽了分布式系统的复杂性。

这里列举了一些在后台自动执行的操作

1.分配文档到不同的容器 或 分片 中，文档可以储存在一个或多个节点中
2.按集群节点来均衡分配这些分片，从而对索引和搜索过程进行负载均衡
3.复制每个分片以支持数据冗余，从而防止硬件故障导致的数据丢失
4.将集群中任一节点的请求路由到存有相关数据的节点
5.集群扩容时无缝整合新节点，重新分配分片以便从离群节点恢复

创建索引(数据库):

PUT /blogs
{
   "settings" : {
      "number_of_shards" : 3, # 索引主分片数，不可修改
      "number_of_replicas" : 1 # 复制分片数，一定程度影响吞吐，可以修改
   }
}

查看集群的状态

curl -X GET 127.0.0.1:9200/_cluster/health?pretty

status 字段指示着当前集群在总体上是否工作正常。它的三种颜色含义如下

green: 所有的主分片和副本分片都正常运行。
yellow: 所有的主分片都正常运行，但不是所有的副本分片都正常运行。
red: 有主分片没能正常运行。

7.IK中文分词器

下载:

https://github.com/medcl/elasticsearch-analysis-ik/releases

安装:

sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install file:///home/python/elasticsearch-analysis-ik-5.6.16.zip

重启elasticsearch
```
sudo systemctl restart elasticsearch
```

测试

curl -X GET 127.0.0.1:9200/_analyze?pretty -d '
{
  "analyzer": "standard",
  "text": "我是&中国人"
}'

curl -X GET 127.0.0.1:9200/_analyze?pretty -d '
{
  "analyzer": "ik_max_word",
  "text": "我是&中国人"
}'

8.索引的增删查

查看所有索引
```
curl 127.0.0.1:9200/_cat/indices
```

创建头条项目文章索引库(创建数据库)

curl -X PUT 127.0.0.1:9200/articles -H 'Content-Type: application/json' -d'
{
   "settings" : {
        "index": {
            "number_of_shards" : 3,
            "number_of_replicas" : 1
        }
   }
}'

删除项目文章索引库
```
curl -X DELETE 127.0.0.1:9200/articles
```

9.创建类型映射(创建表)

头条项目文章类型(创建表):

curl -X PUT 127.0.0.1:9200/articles/_mapping/article -H 'Content-Type: application/json' -d'
{
     "_all": {
          "analyzer": "ik_max_word"
      },
      "properties": {
          "article_id": {
              "type": "long",
              "include_in_all": "false"
          },
          "user_id": {
              "type": "long",
              "include_in_all": "false"
          },
          "title": {
              "type": "text",
              "analyzer": "ik_max_word",
              "include_in_all": "true",
              "boost": 2
          },
          "content": {
              "type": "text",
              "analyzer": "ik_max_word",
              "include_in_all": "true"
          },
          "status": {
              "type": "integer",
              "include_in_all": "false"
          },
          "create_time": {
              "type": "date",
              "include_in_all": "false"
          }
      }
}'

字段解释

_all:
	把所有其它字段中的值，以空格为分隔符组成一个大字符串，然后被分析和索引
analyzer:
	指明使用的分析器, 不指定使用默认分词器 standard.
include_in_all:
	参数用于控制 _all 查询时需要包含的字段。默认为 true。
boost:
	可以提升查询时计算相关性分数的权重。例如title字段将是其他字段权重的两倍。

简单字段类型

字符串: text (在elaticsearch 2.x版本中，为string类型)
整数 : byte, short, integer, long
浮点数: float, double
布尔型: boolean
日期: date

查看类型(查看表结构)

格式: host/索引/_mapping/类型?pretty
curl 127.0.0.1:9200/articles/_mapping/article?pretty

修改类型-增加类型字段(增加表字段)

curl -X PUT 127.0.0.1:9200/articles/_mapping/article -H 'Content-Type:application/json' -d '
{
  "properties": {
    "new_tag": {
      "type": "text"
    }
  }
}'

不能修改已有的字段类型，比如把status的数据类型改成byte(报错)

curl -X PUT 127.0.0.1:9200/articles/_mapping/article -H 'Content-Type:application/json' -d '
{
  "properties": {
    "status": {
      "type": "byte"
    }
  }
}'

# 如果确定要改字段，只能删掉索引，重新建立索引。

10.重新建索引方案(改字段类型)

步骤

1.重新建一个索引，名字叫articles_v2

curl -X PUT 127.0.0.1:9200/articles_v2 -H 'Content-Type: application/json' -d '{
   "settings" : {
      "index": {
          "number_of_shards" : 3,
          "number_of_replicas" : 1
       }
   }
}'

2.创建新的类型和映射(新表的结构)

curl -X PUT 127.0.0.1:9200/articles_v2/_mapping/article -H 'Content-Type: application/json' -d'
{
     "_all": {
          "analyzer": "ik_max_word"
      },
      "properties": {
          "article_id": {
              "type": "long",
              "include_in_all": "false"
          },
          "user_id": {
               "type": "long",
              "include_in_all": "false"
          },
          "title": {
              "type": "text",
              "analyzer": "ik_max_word",
              "include_in_all": "true",
              "boost": 2
          },
          "content": {
              "type": "text",
              "analyzer": "ik_max_word",
              "include_in_all": "true"
          },
          "status": {
              "type": "byte",
              "include_in_all": "false"
          },
          "create_time": {
              "type": "date",
              "include_in_all": "false"
          }
      }
}'

3.重新索引(动词)数据(从原来的表中导数据到新的表中)

curl -X POST 127.0.0.1:9200/_reindex -H 'Content-Type:application/json' -d '
{
  "source": {
    "index": "articles"
  },
  "dest": {
    "index": "articles_v2"
  }
}'

# 表示从articles复制数据到articles_v2中

4.为索引起别名。为索引起别名，让新建的索引具有原索引的名字，可以让应用程序零停机。

# 4.1删掉原来的索引
curl -X DELETE 127.0.0.1:9200/articles
# 4.2.改别名
curl -X PUT 127.0.0.1:9200/articles_v2/_alias/articles
# 4.3 用原来的索引名字查看新的索引结构
curl 127.0.0.1:9200/articles/_mapping/article?pretty

查询索引别名

# 1.查看别名指向哪个索引
curl 127.0.0.1:9200/*/_alias/articles
# 2.查看哪些别名指向这个索引
curl 127.0.0.1:9200/articles_v2/_alias/*

11.索引文档(保存文档数据)(增删改查)

格式

1.使用自定义id

PUT /{index}/{type}/{id}
{
  "field": "value"
}

2.自动生成文档id

PUT /{index}/{type}
{
  "field": "value"
}

添加article文档

curl -X PUT 127.0.0.1:9200/articles/article/150000 -H 'Content-Type:application/json' -d '
{
  "article_id": 150000,
  "user_id": 1,
  "title": "python是世界上最好的语言",
  "content": "确实如此",
  "status": 2,
  "create_time": "2019-04-03"
}'

获取指定id文档

curl 127.0.0.1:9200/articles/article/150000?pretty

使用_source参数指定获取字段

curl 127.0.0.1:9200/articles/article/150000?_source=title,content\&pretty

判断文档是否存在

curl -i -X HEAD 127.0.0.1:9200/articles/article/150000

# 返回状态码: 200 存在 404 不存在

更新文档

在 Elasticsearch 中文档是 不可改变 的，不能修改它们。
只能重新添加，使用添加文档的api会全部覆盖原来的内容，不能局部更新。
而且实际上已经把原来的文档删掉，注意看返回文档时的_version

删除文档

curl -X DELETE 127.0.0.1:9200/articles/article/150000

获取多个文档

curl -X GET 127.0.0.1:9200/_mget?pretty -d '
{
  "docs": [
    {
      "_index": "articles",
      "_type": "article",
      "_id": 150000
    },
    {
      "_index": "articles",
      "_type": "article",
      "_id": 150001
    }
  ]
}'

12.Logstash导入数据

解压

tar -zxvf mysql-connector-java-8.0.13.tar.gz

编写脚本

# vi /home/python/logstash_mysql.conf
input{
     jdbc {
         jdbc_driver_library => "/home/python/mysql-connector-java-8.0.13/mysql-connector-java-8.0.13.jar"
         jdbc_driver_class => "com.mysql.jdbc.Driver"
         jdbc_connection_string => "jdbc:mysql://127.0.0.1:3306/toutiao?tinyInt1isBit=false"
         jdbc_user => "root"
         jdbc_password => "mysql"
         jdbc_paging_enabled => "true"
         jdbc_page_size => "1000"
         jdbc_default_timezone =>"Asia/Shanghai"
         statement => "select a.article_id as article_id,a.user_id as user_id, a.title as title, a.status as status, a.create_time as create_time,  b.content as content from news_article_basic as a inner join news_article_content as b on a.article_id=b.article_id"
         use_column_value => "true"
         tracking_column => "article_id"
         clean_run => true
     }
}
output{
      elasticsearch {
         hosts => "127.0.0.1:9200"
         index => "articles"
         document_id => "%{article_id}"
         document_type => "article"
      }
      stdout {
         codec => json_lines
     }
}

执行脚本

sudo /usr/share/logstash/bin/logstash -f /home/python/logstash_mysql.conf

注意

1.不要在ssh工具上执行，直接去到虚拟机中执行。
2.看到开始刷数据后，连续按ctrl + c停止，不要加载全部数据。不然的话会非常卡，因为数据库的文章数据很多。

13.基本查询

根据文档ID

curl -X GET 127.0.0.1:9200/articles/article/1
curl -X GET 127.0.0.1:9200/articles/article/1?_source=title,user_id
curl -X GET 127.0.0.1:9200/articles/article/1?_source=false

查询所有(默认返回10条)

curl -X GET 127.0.0.1:9200/articles/article/_search?_source=title,user_id\&pretty

分页查询 from-起始(偏移量)，size-每页数量

curl -X GET 127.0.0.1:9200/articles/article/_search?_source=title,user_id\&from=1\&size=2\&pretty

全文检索 - 使用q参数

# 从所有文档中的content字段里查找，和python%20web关键词相关的文档。 %20 表示空格
q=content:python%20web
# 从所有文档中的title字段里查找，和python%20web关键词相关的文档。
q=title:python%20web
# 从所有文档中的_all字段里查找，和python%20web关键词相关的文档。
q=_all:python%20web

curl -X GET 127.0.0.1:9200/articles/article/_search?_source=title,user_id\&pretty\&q=_all:python%20web

14.高级查询

使用查询字符串的方式换成，使用json字符串查询的方式。
全文检索 match

1.从title字段中搜索

curl -X GET 127.0.0.1:9200/articles/article/_search -d'
{
    "query" : {
        "match" : {
            "title" : "python web"
        }
    }
}'

2.从title字段搜索，从第0个开始，返回5个文档。返回的字段包括article_id, title.

curl -X GET 127.0.0.1:9200/articles/article/_search?pretty -d'
{
    "from": 0,
    "size": 5,
    "_source": ["article_id","title"],
    "query" : {
        "match" : {
            "title" : "python web"
        }
    }
}'

3.从_all字段中查找，从第0个开始，返回5个文档。返回的字段包括article_id,title

curl -X GET 127.0.0.1:9200/articles/article/_search?pretty -d'
{
    "from": 0,
    "size": 5,
    "_source": ["article_id","title"],
    "query" : {
        "match" : {
            "_all" : "python web 编程"
        }
    }
}'

短语搜索 match_phrase

# 从_all字段中搜索短语"python web", 返回5个文档，字段包括article_id,title
curl -X GET 127.0.0.1:9200/articles/article/_search?pretty -d'
{
    "size": 5,
    "_source": ["article_id","content"],
    "query" : {
        "match_phrase" : {
            "content" : "python web"
        }
    }
}'

精确查找 term

# 获取user_id为1的文档，返回5个文档，返回的字段包括article_id,title,user_id
curl -X GET 127.0.0.1:9200/articles/article/_search?pretty -d'
{
    "size": 5,
    "_source": ["article_id","title", "user_id"],
    "query" : {
        "term" : {
            "user_id" : 1
        }
    }
}'

范围查找 range

# 获取article_id大于等于3，小于等于5的文档，返回5个文档。
# 字段包括article_id, title, user_id
curl -X GET 127.0.0.1:9200/articles/article/_search?pretty -d'
{
    "size": 5,
    "_source": ["article_id","title", "user_id"],
    "query" : {
        "range" : {
            "article_id": { 
                "gte": 3,
                "lte": 5
            }
        }
    }
}'

高亮搜索 highlight

curl -X GET 127.0.0.1:9200/articles/article/_search?pretty -d '
{
    "size":2,
    "_source": ["article_id", "title", "user_id"],
    "query": {
        "match": {
             "title": "python web 编程"
         }
     },
     "highlight":{
          "fields": {
              "title": {}
          }
     }
}'

组合查询

# must: 文档 必须 匹配这些条件才能被包含进来。
# must_not: 文档 必须不 匹配这些条件才能被包含进来。
# should: 如果满足这些语句中的任意语句，将增加 _score ，否则，无任何影响。它们主要用于修正每个文档的相关性得分
# filter: 必须 匹配，但它以不评分、过滤模式来进行。这些语句对评分没有贡献，只是根据过滤标准来排除或包含文档。

curl -X GET 127.0.0.1:9200/articles/article/_search?pretty -d '
{
  "_source": ["title", "user_id"],
  "query": {
      "bool": {
          "must": {
              "match": {
                  "title": "python web"
              }
          },
          "filter": {
              "term": {
                  "user_id": 2
              }
          }
      }
  }
}'

排序

curl -X GET 127.0.0.1:9200/articles/article/_search?pretty -d'
{
    "size": 5,
    "_source": ["article_id","title"],
    "query" : {
        "match" : {
            "_all" : "python web"
        }
    },
    "sort": [
        { "create_time":  { "order": "desc" }},
        { "_score": { "order": "desc" }}
    ]
}'

boost 提升权重，优化排序

curl -X GET 127.0.0.1:9200/articles/article/_search?pretty -d'
{
    "size": 5,
    "_source": ["article_id","title"],
    "query" : {
        "match" : {s
            "title" : {
                "query": "python web",
                "boost": 4
            }
        }
    }
}'

15.python中使用elasticsearch的方法

安装
```
pip install elasticsearch
```
文件：d01_elasticsearch_flask.py

1.导入模块

# elasticsearch 5.x 版本
from elasticsearch5 import Elasticsearch
from flask import Flask, request, current_app, jsonify

2.配置elasticsearch集群服务器的地址

app = Flask(__name__)

class Config(object):
    # elasticsearch集群服务器的地址
    ES = [
        '127.0.0.1:9200'
    ]

app.config.from_object(Config)

3.创建elasticsearch客户端对象

# 创建elasticsearch客户端对象
app.es = Elasticsearch(
    app.config.get('ES'),
    # 启动前嗅探es集群服务器
    sniff_on_start=True,
    # es集群服务器结点连接异常时是否刷新es结点信息
    sniff_on_connection_fail=True,
    # 每60秒刷新结点信息
    sniffer_timeout=60
)

4.组装查询字典, 并执行查询

@app.route('/')
def index():
    # 对status为2的文章进行全文搜索内容为q
    q = request.args.get('q')
    query = {
        'query': {
            'bool': {
                'must': [
                    {'match': {'_all': q}}
                ],
                'filter': [
                    {'term': {'status': 2}}
                ]
            }
        }
    }
    ret = current_app.es.search(index='articles', doc_type='article', body=query)
    return jsonify(ret)

16.头条项目中搜索代码说明&演示

17.suggest建议查询

1.拼写纠错

# 当我们输入错误的关键词phtyon web时，es可以提供根据索引库数据得出的正确拼写python web
curl 127.0.0.1:9200/articles/article/_search?pretty -d '
{
    "from": 0,
    "size": 10,
    "_source": false,
    "suggest": {
        "text": "phtyon web",
        "word-phrase": {
            "phrase": {
                "field": "_all",
                "size": 1
            }
        }
    }
}'

# word-phrase表示返回来的数据放在word-phrase中。
# phrase表示以短语的方式返回，field指定从哪里获取数据，size指定返回一条数据。

2.自动补全

# 自动补全需要用到suggest查询建议中的type=completion，所以原先建立的文章索引库不能用于自动补全，需要再建立一个自动补全的索引库

2.1创建自动补全的索引库

curl -X PUT 127.0.0.1:9200/completions -H 'Content-Type: application/json' -d'
{
   "settings" : {
       "index": {
           "number_of_shards" : 3,
           "number_of_replicas" : 1
       }
   }
}'

2.2创建自动补全的映射类型

curl -X PUT 127.0.0.1:9200/completions/_mapping/words -H 'Content-Type: application/json' -d'
{
     "words": {
          "properties": {
              "suggest": {
                  "type": "completion",
                  "analyzer": "ik_max_word"
              }
          }
     }
}'
# 自动补全建议字段，必须是completion类型。

2.3使用logstash导入自动补全初始数据

创建脚本:

# vi /home/python/logstash_mysql_completion.conf
input{
     jdbc {
         jdbc_driver_library => "/home/python/mysql-connector-java-8.0.13/mysql-connector-java-8.0.13.jar"
         jdbc_driver_class => "com.mysql.jdbc.Driver"
         jdbc_connection_string => "jdbc:mysql://127.0.0.1:3306/toutiao?tinyInt1isBit=false"
         jdbc_user => "root"
         jdbc_password => "mysql"
         jdbc_paging_enabled => "true"
         jdbc_page_size => "1000"
         jdbc_default_timezone =>"Asia/Shanghai"
         statement => "select title as suggest from news_article_basic"
         clean_run => true
     }
}
output{
      elasticsearch {
         hosts => "127.0.0.1:9200"
         index => "completions"
         document_type => "words"
      }
}

执行脚本:

sudo /usr/share/logstash/bin/logstash -f /home/python/logstash_mysql_completion.conf

2.4自动补全建议查询

curl 127.0.0.1:9200/completions/words/_search?pretty -d '
{
    "suggest": {
        "title-suggest" : {
            "prefix" : "pyth", 
            "completion" : { 
                "field" : "suggest" 
            }
        }
    }
}'

18.头条suggest查询实现

需求: 搜索时的自动补全和纠错补全。

思路:
  1.先将关键字在completions 自动补全索引库中查询，获取建议的补全信息
  2.如没有获取到补全信息，可能表示用户输入的关键词有拼写错误，在articles索引库中进行纠错建议查询