-
流程梳理
2节点12:20:40探查到磁盘的不可用,经过200s后剔出一块vote盘。因为仍存活2块,2节点集群继续存活。
2024-12-07 12:23:41.623 [OCSSD(3998440)]CRS-1719: Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread_0 not scheduled for 180263 msecs. 2024-12-07 12:23:42.624 [OCSSD(3998440)]CRS-1719: Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread_0 not scheduled for 181708 msecs. 2024-12-07 12:24:00.793 [OCSSD(3998440)]CRS-1604: CSSD voting file is offline: /dev/rhdisk4; details at (:CSSNM00058:) in /u01/app/grid/diag/crs/pnewapsdb2/crs/trace/ocssd.trc. 2024-12-07 12:24:00.794 [OCSSD(3998440)]CRS-1672: The number of voting files currently available 2 has fallen to the minimum number of voting files required 2. Further reduction in voting files will result in eviction and loss of functionality |
2节点12:21:00节点2出现lgwr无法写入redo日志,2节点开始无法对外服务
Sat Dec 07 12:22:15 2024 |
2节点12:23:17节点2因为异步io无法完成,残留若干进程
*** 2024-12-07 12:23:17.085 |
1节点12:23:29节点1由于预存的scn耗尽,开始出现lgwr 开始出现wait for scn ack等待事件,1节点开始无法对外服务。同时在12:23:29左右发起了集群组服务重配置(CGS cluster group service) reconfiguration)
|
1节点在12:23:29-12:28:31节点1的等待事件一直为lgwr 的wait for scn ack
1节点的进程心跳健康检查因为2节点的进程残留,一直保持成功。
*** 2024-12-07 12:26:48.912 ============================== LGWR (ospid: 4850844) has not moved for 214 sec (1733545608.1733545394) kjgcr_ServiceGCR: KJGCR_METRICS: Local metric heartbeat check, id 0 succeed Hang Manager: Health: System:15, Cluster:0. Hangs: Local:3, Global:0 Hang Manager: Current no of hangs 58, no of impacted sessions - hangs:16 deadlocks:0 |
1节点上,因为隐含参数_cgs_send_timeout的控制。在动作下发超过300s后,触发了ipc send timeout。
此时1节点于12:28:31 lmon将2节点标记损坏并发起了实例成员恢复(instance member recovery即IMR)。
Sat Dec 07 12:28:31 2024 |
因为_imr_dd_slave_wait_time参数的控制,在下发的实例成员恢复动作要30s超时。
可以看到日志
2节点上12:29:00节点集群尝试拉起节点2实例失败(因为异步io的进程驻留问题)
2024-12-07 12:29:00.046 [ORAAGENT(6816362)]CRS-5017: The resource action "ora.zjywdbdg.db start" encountered the following error: 2024-12-07 12:29:00.046+ORA-01034: ORACLE not available ORA-27101: shared memory realm does not exist IBM AIX RISC System/6000 Error: 2: No such file or directory Additional information: 2820 Additional information: 2069113247 Process ID: 0 Session ID: 0 Serial number: 0 . For details refer to "(:CLSN00107:)" in "/u01/app/grid/diag/crs/pnewapsdb2/crs/trace/crsd_oraagent_oracle.trc". |
因为实例成员恢复动作失败,节点1恢复动作转换为member kill即杀死组员。12:29:01驱逐节点2
Sat Dec 07 12:29:01 2024 Evicting instance 2 from cluster Waiting for instances to leave: 2 |
该行为受_imr_evicted_member_kill_wait参数控制,有20s的等待。
节点2于12:29:21节点1集群发起member kill的软重启动作(软重启仅会重启数据库)
2024-12-07 12:29:21.305 [OCSSD(3932964)]CRS-1663: Member kill issued by PID 5702064 for 1 members, group DBAPSDBDG. Details at (:CSSGM00044:) in /u01/app/grid/diag/crs/pnewapsdb1/crs/trace/ocssd.trc. |
节点2集群进程正常,在软重启过程中,发现磁盘不可用超时。超时时间受_asm_hbeatiowait 控制,超时120s时返回asm的通信失败。
节点2于12:31:21探查到asm不可用
2024-12-07 12:31:21.814 :GIPCXCPT:10025: gipcInternalDissociate: obj 114c3fdb0 [00000000a416e5d0] { gipcEndpoint : localAddr 'gipcha://pnewapsdb1:nm2_pnewapsdb/a55e-90d1-96eb-14d1', remoteAddr 'gipcha://pnewapsdb2:507b-76ef-0239-3995', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 3998440, readyRef 0, ready 1, wobj 114c490f0, sendp 0 status 0flags 0x20138606, flags-2 0x50, usrFlags 0x0 } not associated with any container, ret gipcretFail (1) 2024-12-07 12:31:21.814 :GIPCXCPT:10025: gipcDissociateF [clssnmDiscHelper : clssnm.c : 4248]: EXCEPTION[ ret gipcretFail (1) ] failed to dissociate obj 114c3fdb0 [00000000a416e5d0] { gipcEndpoint : localAddr 'gipcha://pnewapsdb1:nm2_pnewapsdb/a55e-90d1-96eb-14d1', remoteAddr 'gipcha://pnewapsdb2:507b-76ef-0239-3995', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 3998440, readyRef 0, ready 1, wobj 114c490f0, sendp 0 status 0flags 0x20138606, flags-2 0x50, usrFlags 0x0 }, flags 0x0 |
节点2此时以正常集群的逻辑清理crsd进程
2024-12-07 12:31:22.840 [OCSSD(3998440)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/grid/diag/crs/pnewapsdb2/crs/trace/ocssd.trc 2024-12-07 12:31:22.846 [OCSSD(3998440)]CRS-1608: This node was evicted by node 1, pnewapsdb1; details at (:CSSNM00005:) in /u01/app/grid/diag/crs/pnewapsdb2/crs/trace/ocssd.trc. 2024-12-07 12:31:22.872 [OCSSD(3998440)]CRS-1652: Starting clean up of CRSD resources. |
crsd进程在清理asm时有60s的延迟
2024-12-07 12:31:22.956 : AGFW:2828: [ INFO] {2:58926:27524} Started implicit monitor for [ora.net1.network pnewapsdb2 1] interval=60000 delay=60000 |
节点2于12:32:21cssd响应异常,将在59s后重启服务器
2024-12-07 12:31:37.060 [OHASD(3474006)]CRS-2771: Maximum restart attempts reached for resource 'ora.crsd'; will not restart. 2024-12-07 12:32:21.435 [CSSDMONITOR(4260976)]CRS-1661: The CSS daemon is not responding. If this persists, a reboot will occur in 59107 milliseconds; details are at (:CLSN00111:) in /u01/app/grid/diag/crs/pnewapsdb2/crs/trace/ohasd_cssdmonitor_root.trc. |
节点1于12:33:23节点1重新配置集群,并在3s后完成,开始对外服务。
Sat Dec 07 12:33:23 2024 …… Sat Dec 07 12:33:26 2024 Reconfiguration complete (total time 3.0 secs) |
随后节点2重启完成,于12:41:22恢复对外服务
12:41:22节点2恢复对外服务 Sat Dec 07 12:41:22 2024 CJQ0 started with pid=87, OS id=8389058 Completed: ALTER DATABASE OPEN /* db agent *//* {2:37088:2} */ |
-
参数表
参数名 | 默认值 | 描述 |
_cgs_send_timeout | 300 | CGS send timeout value |
_imr_dd_slave_wait_time | 30 | IMR max time LMON will wait for slave response in seconds |
_imr_evicted_member_kill_wait | 20 | IMR evicted member kill wait time in seconds |
_asm_hbeatiowait | 120 | number of secs to wait for PST Async Hbeat IO return |
-
总结性梳理
节点2集群在12:20:40探查到1块vote盘故障,踢出后仍存活2块,集群组件视为正常。节点2数据库12:21:00已无法写入磁盘,停止对外服务,然后于12:23:17实例被kill,但是因为进程存在异步行为,残留了大量的进程。
节点2的残留进程影响了各种进程通信的超时。节点1因为scn产生需要协调节点2被阻塞,但是进程心跳健康检查正常,此时节点1发起了集群重配置,该进程仍需要访问节点2的残留进程,故而导致响应超时。响应超时后(300s)。节点1发起了实例成员恢复,该过程30s超时,超时后行为转为软重启,即通过集群控件重启数据库实例,该行为会在20s后下发。
节点2在20s后,收到member kill的信息。集群在访问ASM实例时,io的心跳效应持续失败,达到120s后触发超时。CRSD需要将asm服务清理。CRSD的监控存在60s的延迟,在判断crsd无法使用后,cssd判断为无法响应,发起节点重启行为。并预告于59s后发生。
节点2重启,节点1重配置发起,3s后配置完成,节点1随后对外提供服务。