es集群状态yellow排查

最新推荐文章于 2025-11-10 03:00:00 发布

原创最新推荐文章于 2025-11-10 03:00:00 发布 · 6.8k 阅读

3 ·

CC 4.0 BY-SA版权

文章标签：

#elasticsearch

ElasticSearch 专栏收录该内容

2 篇文章

订阅专栏

问题背景：

项目中全文检索接口响应时间超30s，排查接口逻辑，耗时主要花在es查询上，故对es集群进行排查。把接口请求生成的dsl拿去kibana中执行，发现响应时间确实太长，于是开始排查es健康问题

通过es命令对集群情况进行分析，得到以下结果：

1.集群健康状况为yellow，存在大量副本分片未分配情况；

{
  "cluster_name" : "cdb*",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : ***,
  "number_of_data_nodes" : ***,
  "active_primary_shards" : ***,
  "active_shards" : ***,
  "relocating_shards" : ***,
  "initializing_shards" : ***,
  "unassigned_shards" : 214, // ~注意看这里
  "delayed_unassigned_shards" : ***,
  "number_of_pending_tasks" : ***,
  "number_of_in_flight_fetch" : ***,
  "task_max_waiting_in_queue_millis" : ***
}

2.某个节点因位置原因导致连接不上，集群触发分片恢复；(1.把所有丢失的副本分片重新分配到集群其他健康节点中2.rebalancing操作)

{
	"unassigned_info": {
		"reason": "NODE_LEFT",
		"at": "2020-11-20T03:12:16",
		"details": "node_left ***",
		"last_allocation_status": "no_attempt"
	}
}

3.分片恢复并发数（源节点并发数和目标节点并发数）使用的默认设置，导致分片恢复并发拉满，恢复速度过慢；

（cluster.routing.allocation.node_concurrent_incoming_recoveries=2、cluster.routing.allocation.node_concurrent_outgoing_recoveries=2）

问题描述：
{
	"node_id": "***",
	"node_name": "mastersha",
	"transport_address": "***",
	"node_decision": "throttled",
	"deciders": [{
		"decider": "throttling",
		"decision": "THROTTLE",
		"explanation": "reached the limit of incoming shard recoveries [2], cluster setting [cluster.routing.allocation.node_concurrent_incoming_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
	}]
}

{
	"node_id": "***",
	"node_name": "master",
	"transport_address": ***,
	"node_decision": "no",
	"store": {
		"matching_sync_id": true
	},
	"deciders": [{
			"decider": "same_shard",
			"decision": "NO",
			"explanation": "the shard cannot be allocated to the same node on which a copy of the shard already exists [[index_execution][2], node[***], [P], s[STARTED], a[id=***]]"
		},
		{
			"decider": "throttling",
			"decision": "THROTTLE",
			"explanation": "reached the limit of outgoing shard recoveries [2] on the node [***] which holds the primary, cluster setting [cluster.routing.allocation.node_concurrent_outgoing_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
		}
	]
}

注:

ES性能分析用到的一些DSL命令：

GET _cat/health
GET _cluster/health
GET _cat/nodes
GET _cluster/health?level=indices
GET _cluster/health?level=shards
GET _cluster/allocation/explain
GET _cat/indices
GET _cluster/state