因为磁盘链路原因,导致rac长达10分钟无法对外服务

  • 流程梳理

2节点12:20:40探查到磁盘的不可用,经过200s后剔出一块vote盘。因为仍存活2块,2节点集群继续存活。

2024-12-07 12:23:41.623 [OCSSD(3998440)]CRS-1719: Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread_0 not scheduled for 180263 msecs.

2024-12-07 12:23:42.624 [OCSSD(3998440)]CRS-1719: Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread_0 not scheduled for 181708 msecs.

2024-12-07 12:24:00.793 [OCSSD(3998440)]CRS-1604: CSSD voting file is offline: /dev/rhdisk4; details at (:CSSNM00058:) in /u01/app/grid/diag/crs/pnewapsdb2/crs/trace/ocssd.trc.

2024-12-07 12:24:00.794 [OCSSD(3998440)]CRS-1672: The number of voting files currently available 2 has fallen to the minimum number of voting files required 2. Further reduction in voting files will result in eviction and loss of functionality

2节点12:21:00节点2出现lgwr无法写入redo日志,2节点开始无法对外服务

Sat Dec 07 12:22:15 2024
LGWR (ospid: 6030322) waits for event 'log file parallel write' for 62 secs.


 

2节点12:23:17节点2因为异步io无法完成,残留若干进程

*** 2024-12-07 12:23:17.085
Instance termination failed to kill one or more processes
ksuitm_check: OS PID=16974736 is still alive
ksuitm_check: OS PID=9045200 is still alive
ksuitm_check: OS PID=10618056 is still alive

……


 

1节点12:23:29节点1由于预存的scn耗尽,开始出现lgwr 开始出现wait for scn ack等待事件,1节点开始无法对外服务。同时在12:23:29左右发起了集群组服务重配置(CGS  cluster group service) reconfiguration)


Sat Dec 07 12:24:36 2024
LGWR (ospid: 4850844) waits for event 'wait for scn ack' for 77 secs.

1节点在12:23:29-12:28:31节点1的等待事件一直为lgwr 的wait for scn ack

1节点的进程心跳健康检查因为2节点的进程残留,一直保持成功。

*** 2024-12-07 12:26:48.912

==============================

LGWR (ospid: 4850844) has not moved for 214 sec (1733545608.1733545394)

kjgcr_ServiceGCR: KJGCR_METRICS: Local metric heartbeat check, id 0 succeed

Hang Manager: Health: System:15, Cluster:0. Hangs: Local:3, Global:0

Hang Manager: Current no of hangs 58, no of impacted sessions - hangs:16 deadlocks:0

1节点上,因为隐含参数_cgs_send_timeout的控制。在动作下发超过300s后,触发了ipc send timeout。

此时1节点于12:28:31 lmon将2节点标记损坏并发起了实例成员恢复(instance member recovery即IMR)。

Sat Dec 07 12:28:31 2024
LMD1 (ospid: 5767608) has detected no messaging activity from instance 2
LMD1 (ospid: 5767608) issues an IMR to resolve the situation
Please check LMD1 trace file for more detail.
Sat Dec 07 12:28:31 2024

LMON (ospid: 5702064) drops the IMR request from LMD1 (ospid: 5767608) because IMR is in progress and inst 2 is marked bad.

因为_imr_dd_slave_wait_time参数的控制,在下发的实例成员恢复动作要30s超时。

可以看到日志

2节点上12:29:00节点集群尝试拉起节点2实例失败(因为异步io的进程驻留问题)

2024-12-07 12:29:00.046 [ORAAGENT(6816362)]CRS-5017: The resource action "ora.zjywdbdg.db start" encountered the following error:

2024-12-07 12:29:00.046+ORA-01034: ORACLE not available

ORA-27101: shared memory realm does not exist

IBM AIX RISC System/6000 Error: 2: No such file or directory

Additional information: 2820

Additional information: 2069113247

Process ID: 0

Session ID: 0 Serial number: 0

. For details refer to "(:CLSN00107:)" in "/u01/app/grid/diag/crs/pnewapsdb2/crs/trace/crsd_oraagent_oracle.trc".

因为实例成员恢复动作失败,节点1恢复动作转换为member kill即杀死组员。12:29:01驱逐节点2

Sat Dec 07 12:29:01 2024

Evicting instance 2 from cluster

Waiting for instances to leave: 2

该行为受_imr_evicted_member_kill_wait参数控制,有20s的等待。

节点2于12:29:21节点1集群发起member kill的软重启动作(软重启仅会重启数据库)

2024-12-07 12:29:21.305 [OCSSD(3932964)]CRS-1663: Member kill issued by PID 5702064 for 1 members, group DBAPSDBDG. Details at (:CSSGM00044:) in /u01/app/grid/diag/crs/pnewapsdb1/crs/trace/ocssd.trc.
2024-12-07 12:31:22.368 [OCSSD(3932964)]CRS-1607: Node pnewapsdb2 is being evicted in cluster incarnation 577740118; details at (:CSSNM00007:) in /u01/app/grid/diag/crs/pnewapsdb1/crs/trace/ocssd.trc.


 

节点2集群进程正常,在软重启过程中,发现磁盘不可用超时。超时时间受_asm_hbeatiowait  控制,超时120s时返回asm的通信失败。

节点2于12:31:21探查到asm不可用

2024-12-07 12:31:21.814 :GIPCXCPT:10025:  gipcInternalDissociate: obj 114c3fdb0 [00000000a416e5d0] { gipcEndpoint : localAddr 'gipcha://pnewapsdb1:nm2_pnewapsdb/a55e-90d1-96eb-14d1', remoteAddr 'gipcha://pnewapsdb2:507b-76ef-0239-3995', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 3998440, readyRef 0, ready 1, wobj 114c490f0, sendp 0 status 0flags 0x20138606, flags-2 0x50, usrFlags 0x0 } not associated with any container, ret gipcretFail (1)

2024-12-07 12:31:21.814 :GIPCXCPT:10025:  gipcDissociateF [clssnmDiscHelper : clssnm.c : 4248]: EXCEPTION[ ret gipcretFail (1) ]  failed to dissociate obj 114c3fdb0 [00000000a416e5d0] { gipcEndpoint : localAddr 'gipcha://pnewapsdb1:nm2_pnewapsdb/a55e-90d1-96eb-14d1', remoteAddr 'gipcha://pnewapsdb2:507b-76ef-0239-3995', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 3998440, readyRef 0, ready 1, wobj 114c490f0, sendp 0 status 0flags 0x20138606, flags-2 0x50, usrFlags 0x0 }, flags 0x0

节点2此时以正常集群的逻辑清理crsd进程

2024-12-07 12:31:22.840 [OCSSD(3998440)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/grid/diag/crs/pnewapsdb2/crs/trace/ocssd.trc

2024-12-07 12:31:22.846 [OCSSD(3998440)]CRS-1608: This node was evicted by node 1, pnewapsdb1; details at (:CSSNM00005:) in /u01/app/grid/diag/crs/pnewapsdb2/crs/trace/ocssd.trc.

2024-12-07 12:31:22.872 [OCSSD(3998440)]CRS-1652: Starting clean up of CRSD resources.

crsd进程在清理asm时有60s的延迟

2024-12-07 12:31:22.956 :    AGFW:2828: [     INFO] {2:58926:27524} Started implicit monitor for [ora.net1.network pnewapsdb2 1] interval=60000 delay=60000

节点2于12:32:21cssd响应异常,将在59s后重启服务器

2024-12-07 12:31:37.060 [OHASD(3474006)]CRS-2771: Maximum restart attempts reached for resource 'ora.crsd'; will not restart.

2024-12-07 12:32:21.435 [CSSDMONITOR(4260976)]CRS-1661: The CSS daemon is not responding. If this persists, a reboot will occur in 59107 milliseconds; details are at (:CLSN00111:) in /u01/app/grid/diag/crs/pnewapsdb2/crs/trace/ohasd_cssdmonitor_root.trc.

节点1于12:33:23节点1重新配置集群,并在3s后完成,开始对外服务。

Sat Dec 07 12:33:23 2024
Increasing priority of 2 RS

Reconfiguration started (old inc 32, new inc 36)
List of instances (total 1) :
 1
Dead instances (total 1) :
 2
My inst 1  
publish big name space -  dead or down/up instance detected, invalidate domain 0

……

Sat Dec 07 12:33:26 2024

Reconfiguration complete (total time 3.0 secs)


随后节点2重启完成,于12:41:22恢复对外服务

12:41:22节点2恢复对外服务

Sat Dec 07 12:41:22 2024

CJQ0 started with pid=87, OS id=8389058

Completed: ALTER DATABASE OPEN /* db agent *//* {2:37088:2} */

  • 参数表

参数名

默认值

描述

_cgs_send_timeout

300

CGS send timeout value  

_imr_dd_slave_wait_time                                                      

30

IMR max time LMON will wait for slave response in seconds

_imr_evicted_member_kill_wait                                       

20

IMR evicted member kill wait time in seconds

_asm_hbeatiowait                                                        

120

number of secs to wait for PST Async Hbeat IO return

  • 总结性梳理

节点2集群在12:20:40探查到1块vote盘故障,踢出后仍存活2块,集群组件视为正常。节点2数据库12:21:00已无法写入磁盘,停止对外服务,然后于12:23:17实例被kill,但是因为进程存在异步行为,残留了大量的进程。

节点2的残留进程影响了各种进程通信的超时。节点1因为scn产生需要协调节点2被阻塞,但是进程心跳健康检查正常,此时节点1发起了集群重配置,该进程仍需要访问节点2的残留进程,故而导致响应超时。响应超时后(300s)。节点1发起了实例成员恢复,该过程30s超时,超时后行为转为软重启,即通过集群控件重启数据库实例,该行为会在20s后下发。

节点2在20s后,收到member kill的信息。集群在访问ASM实例时,io的心跳效应持续失败,达到120s后触发超时。CRSD需要将asm服务清理。CRSD的监控存在60s的延迟,在判断crsd无法使用后,cssd判断为无法响应,发起节点重启行为。并预告于59s后发生。

节点2重启,节点1重配置发起,3s后配置完成,节点1随后对外提供服务。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值