ElasticSearch 原理笔记

最新推荐文章于 2025-06-21 15:07:48 发布

行走江湖

最新推荐文章于 2025-06-21 15:07:48 发布

阅读量1.4k

点赞数

分类专栏： elasticsearch

elasticsearch 专栏收录该内容

15 篇文章

订阅专栏

本文详细解析了ElasticSearch的实现原理，包括近实时检索、分布式目录、partition概念、term-based和document-based优缺点、shards数量影响、facetsearch、alias等，并通过实际应用例子展示其功能。此外，还介绍了routing机制及其作用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

设计原理看ElasticSearch作者Shay Banon的PPT就可以了

The Road to a Distributed, (Near) Real Time, Search Engine

Shay Banon - @kimchy
https://speakerdeck.com/kimchy/the-road-to-a-distributed-search-engine

这里简要概括一下。文中首先介绍了Lucene近实时检索的实现原理，即IndexWriter#getReader方法，可以快速获得更新后的索引。同时更新文档的操作之所以不影响近实时性，是因为更新分成的两个步骤都可以很快执行，第一步删除文档操作是在磁盘上的segment中标记文档被删除（而没有实际删除），第二步新文档直接在内存中的segment上重建索引。也就是通过删除和重建这两步合起来实现索引更新，不论是新来的文档还是更新的文档，都出现在内存中，从而实现了近实时检索。

然后讨论了Distributed Directory（如Compass、Solr）的缺点，不能并行写等。为了解决这个问题，引入了Partition的概念并讨论了Term Based和Document Based的优缺点。选择了Document Based Partitioning后，再加上Replicated Lucene Shards提高可用性和检索性能。同时用Push Replication代替Pull方式，以支持实时检索和热备份/热恢复。

后面展示了ElasticSearch添加删除节点，添加删除shard，多租户等实际应用例子。

################################

上面还有一个很重要的概念没有涉及，就是routing。介绍routing之前需要先了解索引、文档和shard之间的关系。

每个索引建立时需要指定shard的个数，建好以后不能改变。每篇文档都会被分配到某个特定的shard中去，具体分配方法就是routing。

以下是routing的官方定义：

When you index a document, it is stored on a single primary shard. That shard is chosen by hashing the routing value. By default, the routing value is derived from the ID of the document or, if the document has a specified parent document, from the ID of the parent document (to ensure that child and parent documents are stored on the same shard).

This value can be overridden by specifying a routing value at index time, or a routing field in the mapping.

简而言之，routing就是将文档hash到一个特点的primary shard中去，从而避免在primary shards中挨个查找，加快定位速度。这里存在潜在的hash分布不均匀问题，还不清楚ElasticSearch是怎样解决的。

################################

顺便提一下，还有个PPT介绍了ElasticSearch的应用场景

elasticsearch.
Big Data, Search, and Analytics

Shay Banon (@kimchy)
https://speakerdeck.com/kimchy/elasticsearch-big-data-search-analytics

探讨了shards数量的影响，facet search，alias等等。

################################

附一些名词解释，摘自 http://www.elasticsearch.org/guide/appendix/glossary.html

cluster:

每个cluster由一个或多个节点组成，它们共享同一个集群名。每个cluster有一个被自动选出的master节点，当该master节点挂掉的时候会被自动替换。

node:

node是elasticsearch的运行实例。为了测试，多个node可以在同一台服务器上启动，但是通常一个服务器只放一个node。

系统启动时，node会使用广播（或指定的多播）来发现一个现有的cluster，并且试图加入该cluster。

index:

index有点像关系型数据库中的“database”，包含不同的type的schema映射。

一个index是一个逻辑上的命名空间，具有一个或多个primary shards，可以拥有零个或多个replia shards。

shard:

一个shard是一个单独的lucene实例，是被elasticsearch自动管理的底层工作单元。一个索引是包含primary或replia切片的逻辑命名空间。

除了需要定义primary shards和replia shards的数量以外，你不需要直接指定使用的shards，你的代码中只关心index就好。

Elasticsearch在集群中分布所有的shards，并且在添加删除节点时，自动重新分配。

primary shard:

每个document都存储在一个单独的primary shard中。当为一个document建索引时，首先在primary shard上建立，然后在该primary shard的所有replica shards上面建。

默认的，每个索引有5个primary shards。你可以通过减少或增加primary shards的数量来伸缩你的索引能够接受的文档数量。

当索引创建以后，你不能够改变索引中primary shards的数量。

replica shard:

每个primary shard有零或多个repica shard，replica是primary的拷贝，有两个目的，

a) 提高恢复能力：当primary挂掉时，replica可以变成primary

b) 提高性能：get和search请求既可以由primary又可以由replica处理

默认的，每个primary有一个replica，但一个索引的replica的数量可以动态地调整。replica从不与它的primary在同一个node上启动。

routing:

当为某个document建立索引的时候，索引存储在某个primary shard上。该shard是通过哈希routing value选出来的。默认的，routing value通过document ID得到，或者当该文档有特定的父文档，从父文档的ID得到(这是为了保证子文档和父文档存储在相同的shard)。

该value可以在建索引时指定，或者在mapping中通过routing field给定。