现象
集群启动时,集群状态为红色,没有处于Unassigned状态的主分片。
冲突点:官方文档中描述红色代表集群中有主分片处于Unassigned状态,但集群中并没有主分片处于Unassigned状态,却为红色。原文如下
One or more primary shards are unassigned, so some data is unavailable.
调查
TransportClusterHealthAction#clusterHealth代码返回集群的健康状态,ClusterStateHealth中维护了集群健康相关信息,在构造方法中根据所有索引的健康状态(red yellow green)确定集群的健康状态,取最差索引的健康状态;索引的健康状态维护在ClusterIndexHealth中,是根据分片的状态进行确定的,全部主分片为active状态则为绿色,active的判断标准是分片为started或relocating状态
int computeActiveShards = 0;
int computeRelocatingShards = 0;
int computeInitializingShards = 0;
int computeUnassignedShards = 0;
for (ShardRouting shardRouting : shardRoutingTable) {
if (shardRouting.active()) {
computeActiveShards++;
if (shardRouting.relocating()) {
// the shard is relocating, the one it is relocating to will be in initializing state, so we don't count it
computeRelocatingShards++;
}
} else if (shardRouting.initializing()) {
computeInitializingShards++;
} else if (shardRouting.unassigned()) {
computeUnassignedShards++;
}
}
ClusterHealthStatus computeStatus;
final ShardRouting primaryRouting = shardRoutingTable.primaryShard();
if (primaryRouting.active()) {
if (computeActiveShards == shardRoutingTable.size()) {
computeStatus = ClusterHealthStatus.GREEN;
} else {
computeStatus = ClusterHealthStatus.YELLOW;
}
} else {
computeStatus = getInactivePrimaryHealth(primaryRouting);
}
public boolean active() {
return started() || relocating();
}
/**
* The shard is not assigned to any node.
*/
UNASSIGNED((byte) 1),
/**
* The shard is initializing (probably recovering from either a peer shard
* or gateway).
*/
INITIALIZING((byte) 2),
/**
* The shard is started.
*/
STARTED((byte) 3),
/**
* The shard is in the process being relocated.
*/
RELOCATING((byte) 4);
结论
只有所有的主分片均处于started或relocating状态,集群才会为黄色。现象中集群为红色的原因是有分片处于INITIALIZING状态。
备注:当source节点的分片处于relocating,那么target节点的同个分片处于INITIALIZING。INITIALIZING状态可能是节点从其他节点恢复(relocating、replica copy)、snapshot恢复或者从本地恢复