数据库重启
GaussDB数据库集群状态有以下三种情况:
- Normal:表示数据库可用,且数据有冗余备份。所有进程都在运行,主备关系正常。
- Degraded:表示数据库可用,但数据没有冗余备份。
- Unavailable:表示数据库不可用。
查询集群中所有实例和服务的状态:
gs_om -t status --detail
重启所有节点上的实例和服务:
# 方法一:gs_om命令
gs_om -t stop
gs_om -t start
# 方法二:cm_ctl命令
cm_ctl stop
cm_ctl start
⭐️ 停止单个备机实例,该备机节点上的cm_server、etcd和数据库实例会停止服务,集群状态降级为Degraded。
cm_ctl命令重启单个节点:
[omm@gaussdb001 ~]$ cm_ctl stop -n 2
cm_ctl: stop the node: 2.
cm_ctl: stop node, nodeid: 2
..............
cm_ctl: stop node successfully.
cm_ctl: stopping the ETCD instance.
cm_ctl: stop the ETCD instance in this node, nodeid: 2.
.
cm_ctl: The ETCD instance stopped successfully in node: 2.
[omm@gaussdb001 ~]$ cm_ctl start -n 2
cm_ctl: start ETCD in node, nodeid: 2
.
cm_ctl: the ETCD instance in this node starts successfully done.
cm_ctl: checking the ETCD cluster status
.
cm_ctl: check ETCD cluster finished.
cm_ctl: start the node:2.
.........
cm_ctl: start node successfully.
gs_om命令重启单个节点:
[omm@gaussdb001 ~]$ gs_om -t stop -h 22.69.77.32
Stopping node.
=========================================
the Stop cmd is source /home/omm/gauss_env_file ; cm_ctl stop -n 2 -m fast.
Successfully stopped node.
=========================================
End stop node.
[omm@gaussdb001 ~]$ gs_om -t start -h 22.69.77.32
Starting node.
======================================================================
Successfully started node.
======================================================================
End start node.
Successfully started node.
实例主备切换
检查备机数据目录:
gs_om -t status --detai1
检查主备有无日志追赶(状态为Streaming表示同步正常):
select * from pg_get_senders_catchup_time;
计划内切换,在目标备库节点执行:
gs_ctl switchover -D /gsdata/dn/dn_6002
主库发生故障时,可以在目标备库节点执行:
gs_ctl failover -D /gsdata/dn/dn_6002
在非目标备库节点执行切换会报错:
[gs_ctl]: can't create lock file "/gsdata/dn/dn_6002/pg_ctl.lock" : No such file or directory
切换示例:
[omm@gaussdb002 ~]$ gs_ctl switchover -D /gsdata/dn/dn_6002
[2024-12-25 10:00:18.315][1278722][][gs_ctl]: gs_ctl switchover ,datadir is /gsdata/dn/dn_6002
[2024-12-25 10:00:18.315][1278722][][gs_ctl]: switchover term (1)
[2024-12-25 10:00:18.326][1278722][][gs_ctl]: waiting for server to switchover.........
[2024-12-25 10:00:24.394][1278722][][gs_ctl]: done
[2024-12-25 10:00:24.394][1278722][][gs_ctl]: switchover completed (/gsdata/dn/dn_6002)
查看切换日志:
[omm@gaussdb002 ~]$ tail -n50 $GAUSSLOG/bin/gs_ctl/gs_ctl-2024-12-16_163928-current.log
...
[2024-12-25 10:00:18]
[2024-12-25 10:00:18.315][1278722][][gs_ctl]: gs_ctl switchover ,datadir is /gsdata/dn/dn_6002
[2024-12-25 10:00:18.315][1278722][][gs_ctl]: switchover term (1)
[2024-12-25 10:00:18.326][1278722][][gs_ctl]: waiting for server to switchover.........
[2024-12-25 10:00:24.394][1278722][][gs_ctl]: done
[2024-12-25 10:00:24.394][1278722][][gs_ctl]: switchover completed (/gsdata/dn/dn_6002)
🦁 出现双主状态后的处理流程:
#(1)检查数据库实例状态,显示两个实例的状态都为Primary:
gs_om -t status --detai1
#(2)确定降为备机的节点,在节点上执行命令关闭服务:
gs_ctl stop -D /gsdata/dn/dn_6001
#(3)以Standby模式启动上面关闭的备节点:
gs_ctl start -D /gsdata/dn/dn_6001 -M standby
#(4)检查数据库实例状态,确认实例状态恢复:
gs_om -t status --detai1
重置实例状态
发生过主备切换后,集群的数据均衡状态会从balanced: Yes
变为balanced: No
(不影响数据库正常访问),需要手动重置。
[ Cluster State ]
cluster_state : Normal
redistributing : No
balanced : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state | node node_ip instance state | node node_ip instance state
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 22.69.77.30 22.69.77.30 6001 /gsdata/dn/dn_6001 P Standby Normal | 2 22.69.77.32 22.69.77.32 6002 /gsdata/dn/dn_6002 S Primary Normal | 3 22.69.77.28 22.69.77.28 6003 /gsdata/dn/dn_6003 S Standby Normal
登录数据库任意主机,手动恢复实例状态:
gs_om -t switch --reset --time-out=300
该操作会重置主备实例角色(即切回到原来的主库),因此操作前需要保证主备实例状态正常。
示例:
[omm@gaussdb001 ~]$ gs_om -t switch --reset --time-out=300
Operating: Switch reset.
cm_ctl: cmserver is rebalancing the cluster automatically.
....
cm_ctl: switchover successfully.
Operation succeeded: Switch reset.
[omm@gaussdb001 ~]$ gs_om -t status --detail
...
[ Cluster State ]
cluster_state : Normal
redistributing : No
balanced : Yes
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state | node node_ip instance state | node node_ip instance state
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 22.69.77.30 22.69.77.30 6001 /gsdata/dn/dn_6001 P Primary Normal | 2 22.69.77.32 22.69.77.32 6002 /gsdata/dn/dn_6002 S Standby Normal | 3 22.69.77.28 22.69.77.28 6003 /gsdata/dn/dn_6003 S Standby Normal
模拟实例重启
🐯 重启备机服务器,集群状态和被重启备机的状态变化过程如下:
- (1) cluster_state: Normal, 主库: Primary Normal, 备机1: Standby Normal, 备机2: Standby Normal
- (2) cluster_state: Degraded, 主库: Primary Normal, 备机1: Down Unknown, 备机2: Standby Normal
- (3) cluster_state: Degraded, 主库: Primary Normal, 备机1: Pending Starting, 备机2: Standby Normal
- (4) cluster_state: Normal, 主库: Primary Normal, 备机1: Standby Normal, 备机2: Standby Normal
即重启备机过程中不会发生主备自动切换,备机重启后会自动恢复。
🦁 重启主库服务器,集群状态和主备机的状态变化过程如下:
- (1) cluster_state: Normal, 主库: Primary Normal, 备机1: Standby Normal, 备机2: Standby Normal
- (2) cluster_state: Unavailable, 主库: Down Unknown, 备机1: Standby Need repair(Disconnected), 备机2: Standby Need repair(Disconnected)
- (3) cluster_state: Degraded, 主库: Down Unknown, 备机1: Primary Normal, 备机2: Standby Normal
- (4) cluster_state: Degraded, 主库: Down Starting, 备机1: Primary Normal, 备机2: Standby Normal
- (5) cluster_state: Normal, 主库: Standby Normal, 备机1: Primary Normal, 备机2: Standby Normal
即重启主库服务器过程中会发生主备自动切换,某个备库会自动升级为主库。主库重启成功后会以备库身份加入集群,并且集群的balanced状态会变为No。
执行下面的命令来重置所有实例的状态和角色:
gs_om -t switch --reset --time-out=300