因为磁盘链路原因，导致rac长达10分钟无法对外服务

本文链接：https://blog.youkuaiyun.com/q195136130/article/details/146560504

流程梳理

2节点12:20:40探查到磁盘的不可用，经过200s后剔出一块vote盘。因为仍存活2块，2节点集群继续存活。

2024-12-07 12:23:41.623 [OCSSD(3998440)]CRS-1719: Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread_0 not scheduled for 180263 msecs.

2024-12-07 12:23:42.624 [OCSSD(3998440)]CRS-1719: Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread_0 not scheduled for 181708 msecs.

2024-12-07 12:24:00.793 [OCSSD(3998440)]CRS-1604: CSSD voting file is offline: /dev/rhdisk4; details at (:CSSNM00058:) in /u01/app/grid/diag/crs/pnewapsdb2/crs/trace/ocssd.trc.

2024-12-07 12:24:00.794 [OCSSD(3998440)]CRS-1672: The number of voting files currently available 2 has fallen to the minimum number of voting files required 2. Further reduction in voting files will result in eviction and loss of functionality

2节点12:21:00节点2出现lgwr无法写入redo日志，2节点开始无法对外服务

Sat Dec 07 12:22:15 2024
LGWR (ospid: 6030322) waits for event 'log file parallel write' for 62 secs.

2节点12:23:17节点2因为异步io无法完成，残留若干进程

*** 2024-12-07 12:23:17.085
Instance termination failed to kill one or more processes
ksuitm_check: OS PID=16974736 is still alive
ksuitm_check: OS PID=9045200 is still alive
ksuitm_check: OS PID=10618056 is still alive
……

1节点12:23:29节点1由于预存的scn耗尽，开始出现lgwr 开始出现wait for scn ack等待事件，1节点开始无法对外服务。同时在12:23:29左右发起了集群组服务重配置（CGS cluster group service） reconfiguration）

Sat Dec 07 12:24:36 2024
LGWR (ospid: 4850844) waits for event 'wait for scn ack' for 77 secs.

1节点在12:23:29-12:28:31节点1的等待事件一直为lgwr 的wait for scn ack

1节点的进程心跳健康检查因为2节点的进程残留，一直保持成功。

*** 2024-12-07 12:26:48.912

==============================

LGWR (ospid: 4850844) has not moved for 214 sec (1733545608.1733545394)

kjgcr_ServiceGCR: KJGCR_METRICS: Local metric heartbeat check, id 0 succeed

Hang Manager: Health: System:15, Cluster:0. Hangs: Local:3, Global:0

Hang Manager: Current no of hangs 58, no of impacted sessions - hangs:16 deadlocks:0

1节点上，因为隐含参数_cgs_send_timeout的控制。在动作下发超过300s后，触发了ipc send timeout。

此时1节点于12:28:31 lmon将2节点标记损坏并发起了实例成员恢复（instance member recovery即IMR）。

Sat Dec 07 12:28:31 2024
LMD1 (ospid: 5767608) has detected no messaging activity from instance 2
LMD1 (ospid: 5767608) issues an IMR to resolve the situation
Please check LMD1 trace file for more detail.
Sat Dec 07 12:28:31 2024
LMON (ospid: 5702064) drops the IMR request from LMD1 (ospid: 5767608) because IMR is in progress and inst 2 is marked bad.

因为_imr_dd_slave_wait_time参数的控制，在下发的实例成员恢复动作要30s超时。

可以看到日志

2节点上12:29:00节点集群尝试拉起节点2实例失败（因为异步io的进程驻留问题）

2024-12-07 12:29:00.046 [ORAAGENT(6816362)]CRS-5017: The resource action "ora.zjywdbdg.db start" encountered the following error:

2024-12-07 12:29:00.046+ORA-01034: ORACLE not available

ORA-27101: shared memory realm does not exist

IBM AIX RISC System/6000 Error: 2: No such file or directory

Additional information: 2820

Additional information: 2069113247

Process ID: 0

Session ID: 0 Serial number: 0

. For details refer to "(:CLSN00107:)" in "/u01/app/grid/diag/crs/pnewapsdb2/crs/trace/crsd_oraagent_oracle.trc".

因为实例成员恢复动作失败，节点1恢复动作转换为member kill即杀死组员。12:29:01驱逐节点2

Sat Dec 07 12:29:01 2024

Evicting instance 2 from cluster

Waiting for instances to leave: 2

该行为受_imr_evicted_member_kill_wait参数控制，有20s的等待。

节点2于12:29:21节点1集群发起member kill的软重启动作（软重启仅会重启数据库）

2024-12-07 12:29:21.305 [OCSSD(3932964)]CRS-1663: Member kill issued by PID 5702064 for 1 members, group DBAPSDBDG. Details at (:CSSGM00044:) in /u01/app/grid/diag/crs/pnewapsdb1/crs/trace/ocssd.trc.
2024-12-07 12:31:22.368 [OCSSD(3932964)]CRS-1607: Node pnewapsdb2 is being evicted in cluster incarnation 577740118; details at (:CSSNM00007:) in /u01/app/grid/diag/crs/pnewapsdb1/crs/trace/ocssd.trc.

节点2集群进程正常，在软重启过程中，发现磁盘不可用超时。超时时间受_asm_hbeatiowait 控制，超时120s时返回asm的通信失败。

节点2于12:31:21探查到asm不可用

2024-12-07 12:31:21.814 :GIPCXCPT:10025: gipcInternalDissociate: obj 114c3fdb0 [00000000a416e5d0] { gipcEndpoint : localAddr 'gipcha://pnewapsdb1:nm2_pnewapsdb/a55e-90d1-96eb-14d1', remoteAddr 'gipcha://pnewapsdb2:507b-76ef-0239-3995', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 3998440, readyRef 0, ready 1, wobj 114c490f0, sendp 0 status 0flags 0x20138606, flags-2 0x50, usrFlags 0x0 } not associated with any container, ret gipcretFail (1)

2024-12-07 12:31:21.814 :GIPCXCPT:10025: gipcDissociateF [clssnmDiscHelper : clssnm.c : 4248]: EXCEPTION[ ret gipcretFail (1) ] failed to dissociate obj 114c3fdb0 [00000000a416e5d0] { gipcEndpoint : localAddr 'gipcha://pnewapsdb1:nm2_pnewapsdb/a55e-90d1-96eb-14d1', remoteAddr 'gipcha://pnewapsdb2:507b-76ef-0239-3995', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 3998440, readyRef 0, ready 1, wobj 114c490f0, sendp 0 status 0flags 0x20138606, flags-2 0x50, usrFlags 0x0 }, flags 0x0

节点2此时以正常集群的逻辑清理crsd进程

2024-12-07 12:31:22.840 [OCSSD(3998440)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/grid/diag/crs/pnewapsdb2/crs/trace/ocssd.trc

2024-12-07 12:31:22.846 [OCSSD(3998440)]CRS-1608: This node was evicted by node 1, pnewapsdb1; details at (:CSSNM00005:) in /u01/app/grid/diag/crs/pnewapsdb2/crs/trace/ocssd.trc.

2024-12-07 12:31:22.872 [OCSSD(3998440)]CRS-1652: Starting clean up of CRSD resources.

crsd进程在清理asm时有60s的延迟

2024-12-07 12:31:22.956 : AGFW:2828: [ INFO] {2:58926:27524} Started implicit monitor for [ora.net1.network pnewapsdb2 1] interval=60000 delay=60000

节点2于12:32:21cssd响应异常，将在59s后重启服务器

2024-12-07 12:31:37.060 [OHASD(3474006)]CRS-2771: Maximum restart attempts reached for resource 'ora.crsd'; will not restart.

2024-12-07 12:32:21.435 [CSSDMONITOR(4260976)]CRS-1661: The CSS daemon is not responding. If this persists, a reboot will occur in 59107 milliseconds; details are at (:CLSN00111:) in /u01/app/grid/diag/crs/pnewapsdb2/crs/trace/ohasd_cssdmonitor_root.trc.

节点1于12:33:23节点1重新配置集群，并在3s后完成，开始对外服务。

Sat Dec 07 12:33:23 2024
Increasing priority of 2 RS
Reconfiguration started (old inc 32, new inc 36)
List of instances (total 1) :
1
Dead instances (total 1) :
2
My inst 1
publish big name space - dead or down/up instance detected, invalidate domain 0

……

Sat Dec 07 12:33:26 2024

Reconfiguration complete (total time 3.0 secs)

随后节点2重启完成，于12:41:22恢复对外服务

12:41:22节点2恢复对外服务

Sat Dec 07 12:41:22 2024

CJQ0 started with pid=87, OS id=8389058

Completed: ALTER DATABASE OPEN /* db agent *//* {2:37088:2} */

参数表

参数名	默认值	描述
_cgs_send_timeout	300	CGS send timeout value
_imr_dd_slave_wait_time	30	IMR max time LMON will wait for slave response in seconds
_imr_evicted_member_kill_wait	20	IMR evicted member kill wait time in seconds
_asm_hbeatiowait	120	number of secs to wait for PST Async Hbeat IO return

总结性梳理

节点2集群在12:20:40探查到1块vote盘故障，踢出后仍存活2块，集群组件视为正常。节点2数据库12:21:00已无法写入磁盘，停止对外服务，然后于12:23:17实例被kill，但是因为进程存在异步行为，残留了大量的进程。

节点2的残留进程影响了各种进程通信的超时。节点1因为scn产生需要协调节点2被阻塞，但是进程心跳健康检查正常，此时节点1发起了集群重配置，该进程仍需要访问节点2的残留进程，故而导致响应超时。响应超时后（300s）。节点1发起了实例成员恢复，该过程30s超时，超时后行为转为软重启，即通过集群控件重启数据库实例，该行为会在20s后下发。

节点2在20s后，收到member kill的信息。集群在访问ASM实例时，io的心跳效应持续失败，达到120s后触发超时。CRSD需要将asm服务清理。CRSD的监控存在60s的延迟，在判断crsd无法使用后，cssd判断为无法响应，发起节点重启行为。并预告于59s后发生。

节点2重启，节点1重配置发起，3s后配置完成，节点1随后对外提供服务。