上一节中讲到了DSC集群的服务管理和备份还原,这节对DSC集群的故障处理过程进行探讨。
首先,看一下实例环境中的数据库实例情况:
SQL> select * from v$instance;
LINEID NAME INSTANCE_NAME INSTANCE_NUMBER HOST_NAME
---------- ---- ------------- --------------- ---------
SVR_VERSION DB_VERSION
-------------------------- -------------------
START_TIME
----------------------------------------------------------------------------------------------------
STATUS$ MODE$ OGUID DSC_SEQNO DSC_ROLE
------- ------ ----------- ----------- ------------
1 DSC0 DSC0 1 dcs0
DM Database Server x64 V8 DB Version: 0x7000a
2021-05-01 23:05:12
OPEN NORMAL 0 0 Control node
used time: 76.219(ms). Execute id is 4.
在名为DSC0的实例中,该实例状态正常,目前为集群控制节点,查一下另外的实例情况:
SQL> select * from v$instance;
LINEID NAME INSTANCE_NAME INSTANCE_NUMBER HOST_NAME SVR_VERSION DB_VERSION
---------- ---- ------------- --------------- --------- -------------------------- -------------------
START_TIME STATUS$
---------------------------------------------------------------------------------------------------- -------
MODE$ OGUID DSC_SEQNO DSC_ROLE
------ ----------- ----------- -----------
1 DSC1 DSC1 2 dcs1 DM Database Server x64 V8 DB Version: 0x7000a
2021-05-01 23:04:54 OPEN
NORMAL 0 1 Normal node
used time: 148.055(ms). Execute id is 1.
名为DSC1的实例状态也是正常的,目前为普通节点,下面模拟故障,通过系统KILL命令将实例进程强杀,确认实例进程已经不存在了,过程如下图所示:

在DISQL中进一步确认实例状态,此时DSC0中的DISQL已失去连接,在DSC1中的DISQL中查询实例,显示其状态已经切换为控制节点,如下:
SQL> select * from v$instance;
LINEID NAME INSTANCE_NAME INSTANCE_NUMBER HOST_NAME SVR_VERSION
---------- ---- ------------- --------------- --------- --------------------------
DB_VERSION
-------------------
START_TIME
----------------------------------------------------------------------------------------------------
STATUS$ MODE$ OGUID DSC_SEQNO DSC_ROLE
------- ------ ----------- ----------- ------------
1 DSC1 DSC1 2 dcs1 DM Database Server x64 V8
DB Version: 0x7000a
2021-05-01 23:04:54
OPEN NORMAL 0 1 Control node
used time: 1.152(ms). Execute id is 2.
分析日志文件中的详细过程,查看集群日志文件中的记录
cat dm_CSS0_202105.log
2021-05-01 23:50:16.096 [INFO] dmcss P0000000707 T0000000000000000766 [DB]: Detected EP DSC0[0] break in PROCESS_OPEN
2021-05-01 23:50:16.099 [INFO] dmcss P0000000707 T0000000000000000766 [DB]: Set EP DSC0[0] as break EP
2021-05-01 23:50:16.105 [INFO] dmcss P0000000707 T0000000000000000766 [DB]: status change from (OPEN, STARTUP) to (PROCESS_EP_CRASH, CSS_SUB_STATUS_EP_CRASH_STEP1)
2021-05-01 23:50:16.105 [INFO] dmcss P0000000707 T0000000000000000766 [DB]: Control Node[0], break ep[0], recover ep[255], n_ok_ep[2]
2021-05-01 23:50:17.738 [INFO] dmcss P0000000707 T0000000000000000766 [DB]: Set EP DSC1[1] as Control node
2021-05-01 23:50:17.740 [INFO] dmcss P0000000707 T0000000000000000766 [DB]: CSS set cmd EP_CRASH_STEP1, dest_ep DSC1 seqno = 1, cmd_seq = 49
2021-05-01 23:50:17.748 [INFO] dmcss P0000000707 T0000000000000000766 [DB]: status change from (PROCESS_EP_CRASH, CSS_SUB_STATUS_EP_CRASH_STEP1) to (PROCESS_EP_CRASH, CSS_SUB_STATUS_EP_CRASH_WAIT_STEP1)
2021-05-01 23:50:17.748 [INFO] dmcss P0000000707 T0000000000000000766 [DB]: Control Node[1], break ep[0], recover ep[255], n_ok_ep[1]
2021-05-01 23:50:18.754 [INFO] dmcss P0000000707 T0000000000000000766 [DB]: CSS set cmd NONE, dest_ep DSC1 seqno = 1, cmd_seq

本文探讨了DSC集群中的故障处理过程,包括实例状态的变化、日志文件记录的详细流程,以及实例重新加入集群的具体步骤。介绍了DMCSS控制节点如何检测并处理故障实例,确保集群稳定运行。
最低0.47元/天 解锁文章
1343

被折叠的 条评论
为什么被折叠?



