Elasticsearch的主master节点管理shard在数据节点间的分配,如果有足够多的数据节点,它自动分配shard(primary和replica)到相应的数据节点上。但某些特殊的情况下,也会有未分配shard。如果未分配的是 replica shard,则整个集群处于Yellow状态。在你有足够的replica shard备份的情况下, yellow并不影响集群的整体可用性(但搜索性能可能会有下降),而且很多时候可以自动恢复,不需要任何人工干预,比如:某个数据节点的系统在打补丁或者系统维护时会被自动重新启动。
但如果shard长时间处于未分配状态,则需要特别注意了,往往需要人工干预。例如:节点上索引文件损坏。通过GET /_cat/shards API 查看哪个节点存在未分配的shard,在节点日志文件中会发现如下的内容。造成这种情况的原因可能有多种,直接删除损坏的文件既可以解决问题。
[2014-12-22 16:03:27,347][WARN ][index.engine.internal ] [ES-10-data] [ppub-v5][22] failed engine [corrupted preexisting index]
[2014-12-22 16:03:27,347][WARN ][indices.cluster ] [ES-10-data] [ppub-v5][22] failed to start shard
org.apache.lucene.index.CorruptIndexException: [ppub-v5][22]
Corrupted index [corrupted_Ccq5Hd42SuGbQWk_aB9CFw] caused by: CorruptIndexException[codec footer mismatch: actual footer=817092787 vs expected footer=-1071082520 (resource: MMapIndexInput(path="D:\IndexStorage\Data\search-cluster\nodes\0\indices\ppub-v5\22\index\_48jet.fdt"))]
at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:353)
at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:338)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyInitializingShard(IndicesClusterStateService.java:727)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewOrUpdatedShards(IndicesClusterStateService.java:580)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:184)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:444)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:153)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[2014-12-22 16:03:27,347][WARN ][cluster.action.shard ] [ES-10-data] [ppub-v5][22] sending failed shard for [ppub-v5][22], node[pK7CwHokQKCSk8TEVlCsWA], [R], s[INITIALIZING], indexUUID [4QlLH670Q_Kd6rb1WXhgaQ], reason [Failed to start shard, message [CorruptIndexException[[ppub-v5][22]
Corrupted index [corrupted_Ccq5Hd42SuGbQWk_aB9CFw] caused by: CorruptIndexException[codec footer mismatch: actual footer=817092787 vs expected footer=-1071082520 (resource: MMapIndexInput(path="D:\IndexStorage\Data\search-cluster\nodes\0\indices\ppub-v5\22\index\_48jet.fdt"))]]]]
[2014-12-22 16:03:27,347][WARN ][cluster.action.shard ] [ES-10-data] [ppub-v5][22] sending failed shard for [ppub-v5][22], node[pK7CwHokQKCSk8TEVlCsWA], [R], s[INITIALIZING], indexUUID [4QlLH670Q_Kd6rb1WXhgaQ], reason [engine failure, message [corrupted
preexisting index][CorruptIndexException[[ppub-v5][22] Corrupted index [corrupted_Ccq5Hd42SuGbQWk_aB9CFw] caused by: CorruptIndexException[codec footer mismatch: actual footer=817092787 vs expected footer=-1071082520 (resource: MMapIndexInput(path="D:\IndexStorage\Data\search-cluster\nodes\0\indices\ppub-v5\22\index\_48jet.fdt"))]]]]
在生产环境使用Elasticsearch集群,一般都会有相应监测机制自动监视集群的状态,如:调用 GET /_cluster/health API 。如果返回的状态是Red,则需要马上有人响应;如果是Yellow并长时间处于该状态,也需要有人及时查看。Elasticsearch的日志文件是分析和查找Elasticsearch集群问题的首选,对于稍复杂的问题你需要查看master节点、client节点以及数据节点上的日志文件。
Elasticsearch集群:解决未分配shard问题
本文介绍了Elasticsearch集群中未分配shard的情况,特别是当replica shard未分配导致集群呈现Yellow状态时,分析了黄色状态的影响以及如何通过GET /_cat/shards API检查问题。如果shard长时间未分配,可能由于节点上的索引文件损坏,需要人工干预,如删除损坏文件。同时,建议监控集群健康状态,以便及时发现和解决问题。
640

被折叠的 条评论
为什么被折叠?



