节点失败,VIP不能切换分析

本文记录了一次数据库高可用演练过程中出现的RAC VIP切换延迟问题。演练中发现,相较于node1,node2重启后的VIP切换耗时显著增加,经过8分钟才完成。通过分析crsd.log和ocssd.log等日志文件,定位到问题是由Oracle CSS/CRS未能及时识别node2的状态变化导致。
部署运行你感兴趣的模型镜像

高可用对于核心的数据库是一个保障,我们的核心库中,每个月都要进行高可用演练,在这次演练的时候RAC的VIP切换出现了点怪异的问题,目前还没有找到可以信服的原因。
下面看下这个切换过程的不正常:
我先大致描述下这个演练的过程:1,node1 reboot -q>check the node1-vip's failover 2,node2 reboot -q>check the node2-vip's failover
当我们reboot -q node1的时候node1-vip很快切换到node2,但是当我们reboot -q node2的时候node2的node2-vip足足花了将近8分钟时间,直到node2 reboot完成后
node2-vip才在node1上startup sucessfully。
note:the time for reboot -q on node2 is 00:00左右,我们使用VCS,所以timecount是600s。
node1's crsd.log:

2012-06-21 00:04:49.873: [  OCRMSG][11227]prom_rpc: CLSC recv failure..ret code 7
2012-06-21 00:04:49.873: [  OCRMSG][11227]prom_rpc: possible OCR retry scenario
2012-06-21 00:04:49.873: [ COMMCRS][11740]clscsendx: (113900410) Physical connection (1138ffb90) not active
2012-06-21 00:04:49.873: [  OCRMSG][11740]prom_rpc: CLSC send failure..ret code 11
2012-06-21 00:04:49.873: [  OCRMSG][11740]prom_rpc: possible OCR retry scenario
2012-06-21 00:04:49.874: [ COMMCRS][12255]clscsendx: (113900410) Connection not active
2012-06-21 00:04:49.874: [  OCRMSG][12255]prom_rpc: CLSC send failure..ret code 6
2012-06-21 00:04:49.874: [  OCRMSG][12255]prom_rpc: possible OCR retry scenario
2012-06-21 00:04:49.874: [ COMMCRS][12514]clscsendx: (113900410) Connection not active
2012-06-21 00:04:49.874: [  OCRMSG][12514]prom_rpc: CLSC send failure..ret code 6
2012-06-21 00:04:49.874: [  OCRMSG][12514]prom_rpc: possible OCR retry scenario
2012-06-21 00:07:31.242: [ CRSCOMM][12777]32CLEANUP: Searching for connections to failed node node2
2012-06-21 00:07:31.242: [  CRSEVT][12777]32Processing member leave for node2, incarnation: 4
2012-06-21 00:07:31.243: [    CRSD][12777]32SM: recovery in process: 8
2012-06-21 00:07:31.243: [  CRSEVT][12777]32Do failover for: node2
2012-06-21 00:07:31.247: [ COMMCRS][12777]clscsendx: (113900410) Connection not active
2012-06-21 00:07:31.247: [  OCRMSG][12777]prom_rpc: CLSC send failure..ret code 6
2012-06-21 00:07:31.247: [  OCRMSG][12777]prom_rpc: possible OCR retry scenario
2012-06-21 00:07:31.320: [  OCRMAS][3366]th_master:13: I AM THE NEW OCR MASTER at incar 4. Node Number 1
2012-06-21 00:07:31.322: [  OCRRAW][3366]proprioo: for disk 0 (/crsdata/ocr1), id match (1), my id set (725219174,2133226085) total id sets (1), 1st set (725219174,2133226085), 2nd set (0,0) my votes (1), total votes (2)
2012-06-21 00:07:31.322: [  OCRRAW][3366]proprioo: for disk 1 (/crsdata/ocr2), id match (1), my id set (725219174,2133226085) total id sets (1), 1st set (725219174,2133226085), 2nd set (0,0) my votes (1), total votes (2)
2012-06-21 00:07:31.324: [  OCRMAS][3366]th_master: Deleted ver keys from cache (master)
2012-06-21 00:07:31.713: [  CRSRES][12777]32 startup = 0
2012-06-21 00:07:31.720: [  CRSRES][12777]32 startup = 0
2012-06-21 00:07:31.726: [  CRSRES][12777]32 startup = 0
2012-06-21 00:07:31.732: [  CRSRES][12777]32 startup = 0
2012-06-21 00:07:31.737: [  CRSRES][12777]32 startup = 0
2012-06-21 00:07:31.743: [  CRSRES][12777]32 startup = 0
2012-06-21 00:07:31.749: [  CRSRES][12777]32 startup = 0
2012-06-21 00:07:31.829: [  CRSRES][13293]32startRunnable: setting CLI values
2012-06-21 00:07:31.918: [  CRSRES][13550]32startRunnable: setting CLI values
2012-06-21 00:07:31.955: [  CRSRES][13293]32Attempting to start `ora.node2.vip` on member `node1`
2012-06-21 00:07:31.955: [  CRSRES][13550]32Attempting to start `ora.node1.gsd` on member `node1`
2012-06-21 00:07:32.045: [  CRSRES][13807]32startRunnable: setting CLI values
2012-06-21 00:07:32.071: [  CRSRES][13807]32Attempting to start `ora.rac.db` on member `node1`>>注意这里的时间和vip启动
node1的alert:
2012-06-21 00:07:31.229
[cssd(589846)]CRS-1601:CSSD Reconfiguration complete. Active nodes are node1 .
2012-06-21 00:07:31.323
[crsd(405814)]CRS-1005:The OCR upgrade was completed. Version has changed from 169870080 to 169870080. Details in /u01/oracle/app/oracle/product/10.2.0/crs/log/node1/crsd/crsd.log.
2012-06-21 00:07:32.318
[cssd(589846)]CRS-1601:CSSD Reconfiguration complete. Active nodes are node1 node2 .
从以上alert可以看到在00:07:32,reconfiguration完成了,两个active nodes。
再来看看node1 ocssd.log:
[    CSSD]2012-06-21 00:00:36.287 [1029] >TRACE:   clssnm_skgxncheck: CSS daemon failed on node 2>>>>这个时间我们reboot -q on node2
[    CSSD]2012-06-21 00:04:43.355 [3085] >WARNING: clssnmeventhndlr: Receive failure with node 2 (node2), state 3, con(111aa2c70), probe(0), rc=11
[    CSSD]2012-06-21 00:04:43.355 [3085] >TRACE:   clssnmDiscHelper: node2, node(2) connection failed, con (111aa2c70), probe(0)
[    CSSD]2012-06-21 00:04:57.225 [4370] >WARNING: clssnmPollingThread: node node2 (2) at 50% heartbeat fatal, eviction in 299.398 seconds>>>>我们使用VCS,默认的misscount是600s,所以这里
大概过了300s,节点node2还没有被eviction。
[    CSSD]2012-06-21 00:07:20.343 [1029] >TRACE:   clssnm_skgxncheck: completing eviction for bounced node 2
[    CSSD]2012-06-21 00:07:20.343 [1029] >TRACE:   clssnmDiscHelper: node2, node(2) connection failed, con (0), probe(0)
[    CSSD]2012-06-21 00:07:21.189 [4884] >TRACE:   clssnmDoSyncUpdate: Initiating sync 4
[    CSSD]2012-06-21 00:07:21.189 [4884] >TRACE:   clssnmDoSyncUpdate: diskTimeout set to (597000)ms
[    CSSD]2012-06-21 00:07:21.189 [4884] >TRACE:   clssnmSetupAckWait: Ack message type (11) 
[    CSSD]2012-06-21 00:07:21.189 [4884] >TRACE:   clssnmSetupAckWait: node(1) is ALIVE
[    CSSD]2012-06-21 00:07:21.189 [4884] >TRACE:   clssnmSendSync: syncSeqNo(4)
[    CSSD]2012-06-21 00:07:21.189 [4884] >TRACE:   clssnmWaitForAcks: Ack message type(11), ackCount(1)
[    CSSD]2012-06-21 00:07:21.189 [3085] >TRACE:   clssnmHandleSync: diskTimeout set to (597000)ms
[    CSSD]2012-06-21 00:07:21.189 [3085] >TRACE:   clssnmHandleSync: Acknowledging sync: src[1] srcName[node1] seq[1] sync[4]
[    CSSD]2012-06-21 00:07:21.189 [4884] >TRACE:   clssnmWaitForAcks: done, msg type(11)
[    CSSD]2012-06-21 00:07:21.189 [4884] >TRACE:   clssnmDoSyncUpdate: Terminating node 2, node2, misstime(444564) state(5)
[    CSSD]2012-06-21 00:07:21.189 [4884] >TRACE:   clssnmSetupAckWait: Ack message type (13) 
[    CSSD]2012-06-21 00:07:21.189 [1] >USER:    NMEVENT_SUSPEND [00][00][00][06]
[    CSSD]2012-06-21 00:07:21.189 [4884] >TRACE:   clssnmSetupAckWait: node(1) is ACTIVE
[    CSSD]2012-06-21 00:07:21.189 [4884] >TRACE:   clssnmSendVote: syncSeqNo(4)
[    CSSD]2012-06-21 00:07:21.189 [4884] >TRACE:   clssnmWaitForAcks: Ack message type(13), ackCount(1)
[    CSSD]2012-06-21 00:07:21.189 [3085] >TRACE:   clssnmSendVoteInfo: node(1) syncSeqNo(4)
[    CSSD]2012-06-21 00:07:21.189 [4884] >TRACE:   clssnmWaitForAcks: done, msg type(13)
[    CSSD]2012-06-21 00:07:21.190 [4884] >TRACE:   clssnmCheckDskInfo: Checking disk info...
[    CSSD]2012-06-21 00:07:21.226 [4884] >TRACE:   clssnmCheckDskInfo: node(2) timeout(443932) state_network(5) state_disk(3) misstime(444564)
[    CSSD]2012-06-21 00:07:22.226 [4884] >TRACE:   clssnmCheckDskInfo: node(2) timeout(444932) state_network(5) state_disk(3) misstime(445566)
[    CSSD]2012-06-21 00:07:23.226 [4884] >TRACE:   clssnmCheckDskInfo: node(2) timeout(445932) state_network(5) state_disk(3) misstime(446568)
[    CSSD]2012-06-21 00:07:24.226 [4884] >TRACE:   clssnmCheckDskInfo: node(2) timeout(446932) state_network(5) state_disk(3) misstime(447569)
[    CSSD]2012-06-21 00:07:25.226 [4884] >TRACE:   clssnmCheckDskInfo: node(2) timeout(447932) state_network(5) state_disk(3) misstime(448569)
[    CSSD]2012-06-21 00:07:26.226 [4884] >TRACE:   clssnmCheckDskInfo: node(2) timeout(448932) state_network(5) state_disk(3) misstime(449571)
[    CSSD]2012-06-21 00:07:27.226 [4884] >TRACE:   clssnmCheckDskInfo: node(2) timeout(449932) state_network(5) state_disk(3) misstime(450572)
[    CSSD]2012-06-21 00:07:28.226 [4884] >TRACE:   clssnmCheckDskInfo: node(2) timeout(450932) state_network(5) state_disk(3) misstime(451572)
[    CSSD]2012-06-21 00:07:29.226 [4884] >TRACE:   clssnmCheckDskInfo: node(2) timeout(451932) state_network(5) state_disk(3) misstime(452575)
[    CSSD]2012-06-21 00:07:30.226 [4884] >TRACE:   clssnmCheckDskInfo: node(2) timeout(452932) state_network(5) state_disk(3) misstime(453575)
[    CSSD]2012-06-21 00:07:30.639 [1543] >TRACE:   clssnmReadDskHeartbeat: node(2) is down. rcfg(1) wrtcnt(1) LATS(775519945) Disk lastSeqNo(1)
[    CSSD]2012-06-21 00:07:30.666 [1286] >TRACE:   clssnmReadDskHeartbeat: node(2) is down. rcfg(1) wrtcnt(2) LATS(775519972) Disk lastSeqNo(2)
[    CSSD]2012-06-21 00:07:30.715 [1800] >TRACE:   clssnmReadDskHeartbeat: node(2) is down. rcfg(1) wrtcnt(3) LATS(775520021) Disk lastSeqNo(3)
[    CSSD]2012-06-21 00:07:31.226 [4884] >TRACE:   clssnmCheckDskInfo: node 2, node2, state 5 with leader 2 has smaller cluster size 0; my cluster size 1 with leader 1
[    CSSD]2012-06-21 00:07:31.226 [4884] >TRACE:   clssnmEvict: Start
[    CSSD]2012-06-21 00:07:31.226 [4884] >TRACE:   clssnmWaitOnEvictions: Start
[    CSSD]2012-06-21 00:07:31.226 [4884] >TRACE:   clssnmWaitOnEvictions: Node(0) down, LATS(0),timeout(775520533)
[    CSSD]2012-06-21 00:07:31.226 [4884] >TRACE:   clssnmSetupAckWait: Ack message type (15) 
[    CSSD]2012-06-21 00:07:31.226 [4884] >TRACE:   clssnmSetupAckWait: node(1) is ACTIVE
[    CSSD]2012-06-21 00:07:31.226 [4884] >TRACE:   clssnmSendUpdate: syncSeqNo(4)
[    CSSD]2012-06-21 00:07:31.227 [4884] >TRACE:   clssnmWaitForAcks: Ack message type(15), ackCount(1)
[    CSSD]2012-06-21 00:07:31.227 [3085] >TRACE:   clssnmUpdateNodeState: node 0, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new)
[    CSSD]2012-06-21 00:07:31.227 [3085] >TRACE:   clssnmUpdateNodeState: node 1, state (3/3) unique (1340205072/1340205072) prevConuni(0) birth (3/3) (old/new)
[    CSSD]2012-06-21 00:07:31.227 [3085] >TRACE:   clssnmUpdateNodeState: node 2, state (0/0) unique (1339991520/1339991520) prevConuni(1339991520) birth (1/0) (old/new)
[    CSSD]2012-06-21 00:07:31.227 [3085] >USER:    clssnmHandleUpdate: SYNC(4) from node(1) completed
[    CSSD]2012-06-21 00:07:31.227 [3085] >USER:    clssnmHandleUpdate: NODE 1 (node1) IS ACTIVE MEMBER OF CLUSTER
[    CSSD]2012-06-21 00:07:31.227 [3085] >TRACE:   clssnmHandleUpdate: diskTimeout set to (597000)ms
[    CSSD]2012-06-21 00:07:31.227 [4884] >TRACE:   clssnmWaitForAcks: done, msg type(15)
[    CSSD]2012-06-21 00:07:31.227 [4884] >TRACE:   clssnmDoSyncUpdate: Sync Complete!
[    CSSD]2012-06-21 00:07:31.227 [5142] >TRACE:   clssgmReconfigThread:  started for reconfig (4)
[    CSSD]2012-06-21 00:07:31.227 [5142] >USER:    NMEVENT_RECONFIG [00][00][00][02]
[    CSSD]2012-06-21 00:07:31.227 [5142] >TRACE:   clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_node1 type 3
[    CSSD]2012-06-21 00:07:31.227 [5142] >TRACE:   clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_node2 type 3
[    CSSD]2012-06-21 00:07:31.227 [5142] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(0) grock(_ORA_CRS_MEMBER_node2) birth(1/1)
[    CSSD]2012-06-21 00:07:31.227 [5142] >TRACE:   clssgmCleanupGrocks: cleaning up grock crs_version type 2
[    CSSD]2012-06-21 00:07:31.227 [5142] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(crs_version) birth(1/1)
[    CSSD]2012-06-21 00:07:31.227 [5142] >TRACE:   clssgmCleanupGrocks: cleaning up grock ORA_CLSRD_1_rac type 2
[    CSSD]2012-06-21 00:07:31.227 [5142] >TRACE:   clssgmCleanupGrocks: cleaning up grock ORA_CLSRD_1_rac type 3
[    CSSD]2012-06-21 00:07:31.227 [5142] >TRACE:   clssgmCleanupGrocks: cleaning up grock ORA_CLSRD_2_rac type 3
[    CSSD]2012-06-21 00:07:31.227 [5142] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(0) grock(ORA_CLSRD_2_rac) birth(1/1)
[    CSSD]2012-06-21 00:07:31.227 [5142] >TRACE:   clssgmCleanupGrocks: cleaning up grock ORA_CLSRD_2_rac type 2
[    CSSD]2012-06-21 00:07:31.227 [5142] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(0) grock(ORA_CLSRD_2_rac) birth(1/1)
[    CSSD]2012-06-21 00:07:31.227 [5142] >TRACE:   clssgmCleanupGrocks: cleaning up grock DBrac type 2
[    CSSD]2012-06-21 00:07:31.227 [5142] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(DBrac) birth(1/1)
[    CSSD]2012-06-21 00:07:31.227 [5142] >TRACE:   clssgmCleanupGrocks: cleaning up grock DGrac type 2
[    CSSD]2012-06-21 00:07:31.227 [5142] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(DGrac) birth(1/1)
[    CSSD]2012-06-21 00:07:31.228 [5142] >TRACE:   clssgmCleanupGrocks: cleaning up grock DAALL_DB type 2
[    CSSD]2012-06-21 00:07:31.228 [5142] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(DAALL_DB) birth(1/1)
[    CSSD]2012-06-21 00:07:31.228 [5142] >TRACE:   clssgmCleanupGrocks: cleaning up grock CRSDMAIN type 2
[    CSSD]2012-06-21 00:07:31.228 [5142] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(2) grock(CRSDMAIN) birth(1/1)
[    CSSD]2012-06-21 00:07:31.228 [5142] >TRACE:   clssgmCleanupGrocks: cleaning up grock EVMDMAIN type 2
[    CSSD]2012-06-21 00:07:31.228 [5142] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(2) grock(EVMDMAIN) birth(1/1)
[    CSSD]2012-06-21 00:07:31.228 [5142] >TRACE:   clssgmCleanupGrocks: cleaning up grock IGracALL type 2
[    CSSD]2012-06-21 00:07:31.228 [5142] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(2) grock(IGracALL) birth(1/1)
[    CSSD]2012-06-21 00:07:31.228 [5142] >TRACE:   clssgmCleanupGrocks: cleaning up grock ocr_crs type 2
[    CSSD]2012-06-21 00:07:31.228 [5142] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(2) grock(ocr_crs) birth(1/1)
[    CSSD]2012-06-21 00:07:31.228 [5142] >TRACE:   clssgmCleanupGrocks: cleaning up grock IGracrac1 type 2
[    CSSD]2012-06-21 00:07:31.228 [5142] >TRACE:   clssgmCleanupGrocks: cleaning up grock IGracrac2 type 2
[    CSSD]2012-06-21 00:07:31.228 [5142] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(2) grock(IGracrac2) birth(1/1)
[    CSSD]2012-06-21 00:07:31.228 [5142] >TRACE:   clssgmCleanupGrocks: cleaning up grock #CSS_CLSSOMON type 2
[    CSSD]2012-06-21 00:07:31.228 [5142] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(2) grock(#CSS_CLSSOMON) birth(1/1)
[    CSSD]2012-06-21 00:07:31.228 [5142] >TRACE:   clssgmEstablishConnections: 1 nodes in cluster incarn 4
[    CSSD]2012-06-21 00:07:31.228 [4113] >TRACE:   clssgmPeerDeactivate: node 2 (node2), death 4, state 0x80000000 connstate 0xf
[    CSSD]2012-06-21 00:07:31.228 [4113] >TRACE:   clssgmPeerListener: connects done (1/1)
[    CSSD]2012-06-21 00:07:31.228 [5142] >TRACE:   clssgmEstablishMasterNode: MASTER for 4 is node(1) birth(3)
[    CSSD]2012-06-21 00:07:31.228 [5142] >TRACE:   clssgmChangeMasterNode: requeued 0 RPCs
[    CSSD]2012-06-21 00:07:31.228 [5142] >TRACE:   clssgmMasterCMSync: Synchronizing group/lock status
[    CSSD]2012-06-21 00:07:31.229 [5142] >TRACE:   clssgmMasterSendDBDone: group/lock status synchronization complete
[    CSSD]CLSS-3000: reconfiguration successful, incarnation 4 with 1 nodes
[    CSSD]CLSS-3001: local node number 1, master node number 1
[    CSSD]2012-06-21 00:07:31.229 [5142] >TRACE:   clssgmReconfigThread:  completed for reconfig(4), with status(1)
[    CSSD]2012-06-21 00:07:31.818 [3085] >TRACE:   clssnmConnComplete: MSGSRC 2, type 5, node 2, flags 0x0001, con 112c2e470, probe 0
[    CSSD]2012-06-21 00:07:31.818 [3085] >TRACE:   clssnmConnComplete: connecting to node 2 con(112c2e470), msg state 1, birth 0, probcon(0), ninfcon(0), prevConuni(0), node state 0
[    CSSD]2012-06-21 00:07:31.818 [3085] >TRACE:   clssnmsendConnAck: node 2, node state 0
[    CSSD]2012-06-21 00:07:31.818 [3085] >TRACE:   clssnmSendConnAck: connected to node 2 , ninfcon (112c2e470), state (0)
[    CSSD]2012-06-21 00:07:31.818 [3085] >TRACE:   clssnmConnComplete: connecting to node 2 (con 112c2e470), ninfcon (112c2e470), state (0)
[    CSSD]2012-06-21 00:07:32.203 [4884] >TRACE:   clssnmDoSyncUpdate: Initiating sync 5
[    CSSD]2012-06-21 00:07:32.203 [4884] >TRACE:   clssnmDoSyncUpdate: diskTimeout set to (597000)ms
[    CSSD]2012-06-21 00:07:32.203 [4884] >TRACE:   clssnmSetupAckWait: Ack message type (11) 
[    CSSD]2012-06-21 00:07:32.203 [4884] >TRACE:   clssnmSetupAckWait: node(1) is ALIVE
[    CSSD]2012-06-21 00:07:32.203 [4884] >TRACE:   clssnmSetupAckWait: node(2) is ALIVE
[    CSSD]2012-06-21 00:07:32.203 [4884] >TRACE:   clssnmSendSync: syncSeqNo(5)
[    CSSD]2012-06-21 00:07:32.203 [3085] >TRACE:   clssnmHandleSync: diskTimeout set to (597000)ms
[    CSSD]2012-06-21 00:07:32.203 [4884] >TRACE:   clssnmWaitForAcks: Ack message type(11), ackCount(2)
[    CSSD]2012-06-21 00:07:32.203 [3085] >TRACE:   clssnmHandleSync: Acknowledging sync: src[1] srcName[node1] seq[5] sync[5]
[    CSSD]2012-06-21 00:07:32.203 [1] >USER:    NMEVENT_SUSPEND [00][00][00][02]
[    CSSD]2012-06-21 00:07:32.204 [4884] >TRACE:   clssnmWaitForAcks: done, msg type(11)
[    CSSD]2012-06-21 00:07:32.204 [4884] >TRACE:   clssnmDoSyncUpdate: node(2) is transitioning from joining state to active state
[    CSSD]2012-06-21 00:07:32.204 [4884] >TRACE:   clssnmSetupAckWait: Ack message type (13) 
[    CSSD]2012-06-21 00:07:32.204 [4884] >TRACE:   clssnmSetupAckWait: node(1) is ACTIVE
[    CSSD]2012-06-21 00:07:32.204 [4884] >TRACE:   clssnmSetupAckWait: node(2) is ACTIVE
[    CSSD]2012-06-21 00:07:32.204 [4884] >TRACE:   clssnmSendVote: syncSeqNo(5)
[    CSSD]2012-06-21 00:07:32.204 [3085] >TRACE:   clssnmSendVoteInfo: node(1) syncSeqNo(5)
[    CSSD]2012-06-21 00:07:32.204 [4884] >TRACE:   clssnmWaitForAcks: Ack message type(13), ackCount(2)
[    CSSD]2012-06-21 00:07:32.204 [4884] >TRACE:   clssnmWaitForAcks: done, msg type(13)
[    CSSD]2012-06-21 00:07:32.204 [4884] >TRACE:   clssnmCheckDskInfo: Checking disk info...
[    CSSD]2012-06-21 00:07:32.204 [4884] >TRACE:   clssnmEvict: Start
[    CSSD]2012-06-21 00:07:32.204 [4884] >TRACE:   clssnmWaitOnEvictions: Start
[    CSSD]2012-06-21 00:07:32.204 [4884] >TRACE:   clssnmWaitOnEvictions: Node(0) down, LATS(0),timeout(775521510)
[    CSSD]2012-06-21 00:07:32.204 [4884] >TRACE:   clssnmSetupAckWait: Ack message type (15) 
[    CSSD]2012-06-21 00:07:32.204 [4884] >TRACE:   clssnmSetupAckWait: node(1) is ACTIVE
[    CSSD]2012-06-21 00:07:32.204 [4884] >TRACE:   clssnmSetupAckWait: node(2) is ACTIVE
从以上可以看出当node2 reoot -q的时候并没有立即对node2做 [ CRSEVT]Do failover for: node2,而是当node2已经起来的时候足足过了7分多钟以后,node2-vip才切换成功。
node2 crsd.log:
2012-06-21 00:07:17.032: [ default][1][ENTER]32
Oracle Database 10g CRS Release 10.2.0.3.0 Production Copyright 1996, 2004, Oracle.  All rights reserved
2012-06-21 00:07:17.038: [ default][1]32CRS Daemon Starting
2012-06-21 00:07:17.039: [ CRSMAIN][1]32Checking the OCR device
2012-06-21 00:07:17.042: [ CRSMAIN][1]32Connecting to the CSS Daemon
2012-06-21 00:07:17.383: [ COMMCRS][259]clsc_connect: (1109a9e90) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node2_crs))
2012-06-21 00:07:17.383: [ CSSCLNT][1]clsssInitNative: connect failed, rc 9
2012-06-21 00:07:17.384: [  CRSRTI][1]32CSS is not ready. Received status 3 from CSS. Waiting for good status .. 
2012-06-21 00:07:18.717: [ COMMCRS][261]clsc_connect: (1109a9e90) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node2_crs))
2012-06-21 00:07:18.717: [ CSSCLNT][1]clsssInitNative: connect failed, rc 9
2012-06-21 00:07:18.717: [  CRSRTI][1]32CSS is not ready. Received status 3 from CSS. Waiting for good status .. 
2012-06-21 00:07:20.047: [ COMMCRS][263]clsc_connect: (1109a9e90) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node2_crs))
……
2012-06-21 00:07:29.361: [  CRSRTI][1]32CSS is not ready. Received status 3 from CSS. Waiting for good status .. 
2012-06-21 00:07:30.691: [ COMMCRS][279]clsc_connect: (1109a9e90) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node2_crs))
2012-06-21 00:07:30.691: [ CSSCLNT][1]clsssInitNative: connect failed, rc 9
2012-06-21 00:07:30.691: [  CRSRTI][1]32CSS is not ready. Received status 3 from CSS. Waiting for good status .. 
2012-06-21 00:07:32.699: [    CRSD][1]32Daemon Version: 10.2.0.3.0 Active Version: 10.2.0.3.0
2012-06-21 00:07:32.699: [    CRSD][1]32Active Version and Software Version are same
2012-06-21 00:07:32.700: [ CRSMAIN][1]32Initializing OCR
2012-06-21 00:07:32.701: [  OCRRAW][1]proprioo: for disk 0 (/crsdata/ocr1), id match (1), my id set (725219174,2133226085) total id sets (1), 1st set (725219174,2133226085), 2nd set (0,0) my votes (1), total votes (2)
2012-06-21 00:07:32.701: [  OCRRAW][1]proprioo: for disk 1 (/crsdata/ocr2), id match (1), my id set (725219174,2133226085) total id sets (1), 1st set (725219174,2133226085), 2nd set (0,0) my votes (1), total votes (2)
2012-06-21 00:07:32.826: [    CRSD][1]32ENV Logging level for Module: allcomp  0
……
2012-06-21 00:07:32.848: [    CRSD][1]32ENV Logging level for Module: OCRMAS  0
2012-06-21 00:07:32.848: [ CRSMAIN][1]32Filename is /u01/oracle/app/oracle/product/10.2.0/crs/crs/init/node2.pid
2012-06-21 00:07:32.889: [ CRSMAIN][1]32Using Authorizer location: /u01/oracle/app/oracle/product/10.2.0/crs/crs/auth/
[  clsdmt][7992]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=node2DBG_CRSD))
2012-06-21 00:07:32.901: [ CRSMAIN][1]32Initializing RTI
2012-06-21 00:07:32.990: [CRSTIMER][8506]32Timer Thread Starting.
2012-06-21 00:07:33.012: [  CRSRES][1]32Parameter SECURITY = 1, running in USER Mode
2012-06-21 00:07:33.012: [ CRSMAIN][1]32Initializing EVMMgr
2012-06-21 00:07:33.213: [ COMMCRS][8763]clsc_connect: (1158d9410) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=SYSTEM.evm.acceptor.auth))
2012-06-21 00:07:35.148: [  CRSRES][1]32Skipping locked resource ora.node2.vip
2012-06-21 00:07:35.149: [  CRSRES][1]32Skipping locked resource ora.node1.gsd
2012-06-21 00:07:35.191: [ CRSMAIN][1]32CRSD locked during state recovery, please wait.
2012-06-21 00:07:35.254: [ CRSMAIN][1]32CRSD recovered, unlocked.
2012-06-21 00:07:35.288: [ CRSMAIN][1]32QS socket on: (ADDRESS=(PROTOCOL=ipc)(KEY=ora_crsqs))
2012-06-21 00:07:35.312: [ CRSMAIN][1]32CRSD UI socket on: (ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_UI_SOCKET))
2012-06-21 00:07:35.324: [ CRSMAIN][1]32E2E socket on: (ADDRESS=(PROTOCOL=tcp)(HOST=node2_priv)(PORT=49896))
2012-06-21 00:07:35.324: [ CRSMAIN][1]32Starting Threads
2012-06-21 00:07:35.324: [ CRSMAIN][10049]32Starting runCommandServer for (UI = 1, E2E = 0). 0
2012-06-21 00:07:35.325: [ CRSMAIN][10306]32Starting runCommandServer for (UI = 1, E2E = 0). 1
2012-06-21 00:07:35.325: [ CRSMAIN][1]32CRS Daemon Started.
2012-06-21 00:07:35.389: [  CRSRTI][1]32...lock denied for _ORA_CRS_MEMBER_node2
2012-06-21 00:07:38.986: [    CRSD][10049]32SM: rE2Ec: 4
2012-06-21 00:07:39.078: [  CRSRES][11077]32startRunnable: setting CLI values
2012-06-21 00:07:43.030: [    CRSD][11077]32SM:dE2Ec: all E2E cmds done. 0
2012-06-21 00:07:43.034: [    CRSD][10049]32SM: rE2Ec: 4
2012-06-21 00:07:43.045: [  CRSRES][11079]32startRunnable: setting CLI values
2012-06-21 00:07:46.235: [    CRSD][11079]32SM:dE2Ec: all E2E cmds done. 0
2012-06-21 00:08:14.044: [  CRSRES][1]32 startup = 1
2012-06-21 00:08:14.050: [  CRSRES][1]32 startup = 1
2012-06-21 00:08:14.066: [  CRSRES][1]32 startup = 1
2012-06-21 00:08:14.084: [  CRSEVT][11084]32Processing RecoveryDone
2012-06-21 00:08:14.088: [  CRSRES][1]32 startup = 1
2012-06-21 00:08:14.208: [  CRSRES][11085]32startRunnable: setting CLI values
2012-06-21 00:08:14.229: [  CRSRES][11085]32Attempting to start `ora.rac.rac2.inst` on member `node2`
2012-06-21 00:08:43.134: [  CRSAPP][11085]32StartResource error for ora.rac.rac2.inst error code = 1
2012-06-21 00:08:51.023: [  CRSRES][11085]32Start of `ora.rac.rac2.inst` on member `node2` failed.
2012-06-21 00:08:51.135: [  CRSRES][11097]32startRunnable: setting CLI values
2012-06-21 00:08:51.168: [  CRSRES][11610]32startRunnable: setting CLI values
2012-06-21 00:08:51.169: [  CRSRES][11097]32Attempting to start `ora.node2.gsd` on member `node2`
2012-06-21 00:08:51.200: [  CRSRES][11610]32Attempting to start `ora.node2.ons` on member `node2`
2012-06-21 00:08:53.647: [  CRSRES][11610]32Start of `ora.node2.ons` on member `node2` succeeded.
2012-06-21 00:09:30.893: [  CRSAPP][11097]32StartResource error for ora.node2.gsd error code = 1
2012-06-21 00:09:32.552: [  CRSRES][11097]32Start of `ora.node2.gsd` on member `node2` failed.
2012-06-21 00:10:06.054: [  CRSRES][11614]32startRunnable: setting CLI values
2012-06-21 00:10:06.064: [  CRSRES][11614]32Attempting to start `ora.rac.rac2.inst` on member `node2`
2012-06-21 00:10:07.215: [  OCRUTL][7478]u_freem: mem passed is null
2012-06-21 00:10:08.703: [  CRSRES][11614]32Start of `ora.rac.rac2.inst` on member `node2` succeeded.>>>node2 数据库启动成功。
node2 ocssd.log:
[    CSSD]2012-06-21 00:07:19.681 >USER:    Oracle Database 10g CSS Release 10.2.0.3.0 Production Copyright 1996, 2004 Oracle.  All rights reserved.
[    CSSD]2012-06-21 00:07:19.681 >USER:    CSS daemon log for node node2, number 2, in cluster crs
[    CSSD]2012-06-21 00:07:19.698 [1] >TRACE:   clssscmain: local-only set to false
[  clsdmt]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=node2DBG_CSSD))
[    CSSD]2012-06-21 00:07:19.726 [1] >TRACE:   clssnmReadNodeInfo: added node 1 (node1) to cluster
[    CSSD]2012-06-21 00:07:19.727 [1] >TRACE:   clssnmReadNodeInfo: added node 2 (node2) to cluster
[    CSSD]2012-06-21 00:07:19.751 [1029] >TRACE:   clssnm_skgxninit: initialized skgxn version (2/0/Veritas Cluster Server MM)
[    CSSD]2012-06-21 00:07:29.753 [1] >TRACE:   clssnmNMInitialize: misscount set to (600), impending reconfig threshold set to (596000)
[    CSSD]2012-06-21 00:07:29.753 [1] >TRACE:   clssnmNMInitialize: diskShortTimeout set to (597000)ms
[    CSSD]2012-06-21 00:07:29.753 [1] >TRACE:   clssnmNMInitialize: diskLongTimeout set to (597000)ms
[    CSSD]2012-06-21 00:07:29.754 [1] >TRACE:   clssnmDiskStateChange: state from 1 to 2 disk (0//crsdata/vote1)
[    CSSD]2012-06-21 00:07:29.755 [1] >TRACE:   clssnmDiskStateChange: state from 1 to 2 disk (1//crsdata/vote2)
[    CSSD]2012-06-21 00:07:29.755 [1286] >TRACE:   clssnmvDPT: spawned for disk 0 (/crsdata/vote1)
[    CSSD]2012-06-21 00:07:29.755 [1543] >TRACE:   clssnmvDPT: spawned for disk 1 (/crsdata/vote2)
[    CSSD]2012-06-21 00:07:29.755 [1] >TRACE:   clssnmDiskStateChange: state from 1 to 2 disk (2//crsdata/vote3)
[    CSSD]2012-06-21 00:07:29.755 [1800] >TRACE:   clssnmvDPT: spawned for disk 2 (/crsdata/vote3)
[    CSSD]2012-06-21 00:07:31.760 [1286] >TRACE:   clssnmDiskStateChange: state from 2 to 4 disk (0//crsdata/vote1)
[    CSSD]2012-06-21 00:07:31.787 [1543] >TRACE:   clssnmDiskStateChange: state from 2 to 4 disk (1//crsdata/vote2)
[    CSSD]2012-06-21 00:07:31.787 [2057] >TRACE:   clssnmvKillBlockThread: spawned for disk 0 (/crsdata/vote1) initial sleep interval (1000)ms
[    CSSD]2012-06-21 00:07:31.787 [1800] >TRACE:   clssnmDiskStateChange: state from 2 to 4 disk (2//crsdata/vote3)
[    CSSD]2012-06-21 00:07:31.787 [2314] >TRACE:   clssnmvKillBlockThread: spawned for disk 1 (/crsdata/vote2) initial sleep interval (1000)ms
[    CSSD]2012-06-21 00:07:31.787 [2571] >TRACE:   clssnmvKillBlockThread: spawned for disk 2 (/crsdata/vote3) initial sleep interval (1000)ms
[    CSSD]2012-06-21 00:07:31.789 [1543] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(4) wrtcnt(3315) LATS(3042964870) Disk lastSeqNo(3315)
[    CSSD]2012-06-21 00:07:31.795 [1] >TRACE:   clssnmFatalInit: fatal mode enabled
[    CSSD]2012-06-21 00:07:31.795 [1286] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(4) wrtcnt(3315) LATS(3042964875) Disk lastSeqNo(3315)
[    CSSD]2012-06-21 00:07:31.800 [1800] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(4) wrtcnt(3315) LATS(3042964880) Disk lastSeqNo(3315)
[    CSSD]2012-06-21 00:07:31.813 [3085] >TRACE:   clssnmClusterListener: Listening on (ADDRESS=(PROTOCOL=tcp)(HOST=node2_priv)(PORT=49895))
[    CSSD]2012-06-21 00:07:31.813 [3085] >TRACE:   clssnmClusterListener: skgxn node  (0) down
[    CSSD]2012-06-21 00:07:31.815 [3085] >TRACE:   clssnmconnect: connecting to node(1), con(111aa2c70), flags 0x0003
[    CSSD]2012-06-21 00:07:31.816 [3085] >TRACE:   clssnmConnComplete: MSGSRC 1, type 6, node 1, flags 0x0003, con 111aa2c70, probe 0
[    CSSD]2012-06-21 00:07:31.816 [3085] >TRACE:   clssnmConnComplete: connecting to node 1 con(111aa2c70), msg state 3, birth 0, probcon(0), ninfcon(111aa2c70), prevConuni(0), node state 0
[    CSSD]2012-06-21 00:07:31.816 [3085] >TRACE:   clssnmConnComplete: connected to node 1 (con 111aa2c70), ninfcon (111aa2c70), state (0), flag (1039)
[    CSSD]2012-06-21 00:07:31.869 [3342] >TRACE:   clssgmclientlsnr: listening on (ADDRESS=(PROTOCOL=ipc)(KEY=Oracle_CSS_LclLstnr_crs_2))
[    CSSD]2012-06-21 00:07:31.869 [3342] >TRACE:   clssgmclientlsnr: listening on (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node2_crs))
[    CSSD]2012-06-21 00:07:31.881 [4113] >TRACE:   clssgmPeerListener: Listening on (ADDRESS=(PROTOCOL=tcp)(DEV=26)(HOST=10.70.206.69)(PORT=32888))
[    CSSD]2012-06-21 00:07:32.201 [3085] >TRACE:   clssnmHandleSync: diskTimeout set to (597000)ms
[    CSSD]2012-06-21 00:07:32.201 [4884] >TRACE:   clssnmRcfgMgrThread: initial lastleader(1) unique(1340205072)
[    CSSD]2012-06-21 00:07:32.201 [3085] >TRACE:   clssnmHandleSync: Acknowledging sync: src[1] srcName[node1] seq[5] sync[5]
[    CSSD]2012-06-21 00:07:32.202 [3085] >TRACE:   clssnmSendVoteInfo: node(1) syncSeqNo(5)
[    CSSD]2012-06-21 00:07:32.203 [3085] >TRACE:   clssnmUpdateNodeState: node 0, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new)
[    CSSD]2012-06-21 00:07:32.203 [3085] >TRACE:   clssnmUpdateNodeState: node 1, state (4/3) unique (1340205072/1340205072) prevConuni(0) birth (0/3) (old/new)
[    CSSD]2012-06-21 00:07:32.203 [3085] >TRACE:   clssnmUpdateNodeState: node 2, state (1/3) unique (1340208439/1340208439) prevConuni(0) birth (0/5) (old/new)
[    CSSD]2012-06-21 00:07:32.203 [3085] >USER:    clssnmHandleUpdate: SYNC(5) from node(1) completed
[    CSSD]2012-06-21 00:07:32.203 [3085] >USER:    clssnmHandleUpdate: NODE 1 (node1) IS ACTIVE MEMBER OF CLUSTER
[    CSSD]2012-06-21 00:07:32.203 [3085] >USER:    clssnmHandleUpdate: NODE 2 (node2) IS ACTIVE MEMBER OF CLUSTER
[    CSSD]2012-06-21 00:07:32.203 [3085] >TRACE:   clssnmHandleUpdate: diskTimeout set to (597000)ms
[    CSSD]2012-06-21 00:07:32.283 [1] >USER:    NMEVENT_SUSPEND [00][00][00][00]
[    CSSD]2012-06-21 00:07:32.284 [5141] >TRACE:   clssgmReconfigThread:  started for reconfig (5)
[    CSSD]2012-06-21 00:07:32.284 [5141] >USER:    NMEVENT_RECONFIG [00][00][00][06]
[    CSSD]2012-06-21 00:07:32.284 [5141] >TRACE:   clssgmEstablishConnections: 2 nodes in cluster incarn 5
[    CSSD]2012-06-21 00:07:32.285 [4113] >TRACE:   clssgmInitialRecv: (112d34e10) accepted a new connection from node 1 born at 3 active (2, 2), vers (10,3,1,2)
[    CSSD]2012-06-21 00:07:32.285 [4113] >TRACE:   clssgmInitialRecv: conns done (2/2)
[    CSSD]2012-06-21 00:07:32.285 [5141] >TRACE:   clssgmEstablishMasterNode: MASTER for 5 is node(1) birth(3)
[    CSSD]2012-06-21 00:07:32.285 [5141] >TRACE:   clssgmChangeMasterNode: requeued 0 RPCs
[    CSSD]2012-06-21 00:07:32.316 [4113] >TRACE:   clssgmHandleDBDone(): src/dest (1/65535) size(72) incarn 5
[    CSSD]CLSS-3000: reconfiguration successful, incarnation 5 with 2 nodes
[    CSSD]CLSS-3001: local node number 2, master node number 1
[    CSSD]2012-06-21 00:07:32.316 [5141] >TRACE:   clssgmReconfigThread:  completed for reconfig(5), with status(1)>>>>完成了群集重新配置。
从node2 alert里面也可以看到:
2012-06-21 00:07:31.784
[cssd(446818)]CRS-1605:CSSD voting file is online: /crsdata/vote1. Details in /u01/oracle/app/oracle/product/10.2.0/crs/log/node2/cssd/ocssd.log.
2012-06-21 00:07:31.787
[cssd(446818)]CRS-1605:CSSD voting file is online: /crsdata/vote2. Details in /u01/oracle/app/oracle/product/10.2.0/crs/log/node2/cssd/ocssd.log.
2012-06-21 00:07:31.787
[cssd(446818)]CRS-1605:CSSD voting file is online: /crsdata/vote3. Details in /u01/oracle/app/oracle/product/10.2.0/crs/log/node2/cssd/ocssd.log.
2012-06-21 00:07:32.316
[cssd(446818)]CRS-1601:CSSD Reconfiguration complete. Active nodes are node1 node2 .
2012-06-21 00:07:32.791
[crsd(267178)]CRS-1012:The OCR service started on node node2.
2012-06-21 00:07:32.862
[evmd(577634)]CRS-1401:EVMD started on node node2.
2012-06-21 00:07:35.325
[crsd(267178)]CRS-1201:CRSD started on node node2.
我们大概来理一下,当在node2 reboot -q的时候,node2-vip并未立即切换到node1,而是当node2启动,群集reconfigration完成后在node1启动了node2-vip,这个过程足足花费了8分钟多点。

目前这个问题还在分析中,再继续更新。
这个问题已经有个了结论,遇到了bug:7001987
看下面的描述:

An induced node panic is not being recognised by Oracle CSS/CRS until 10 
minutes after the node has gone down. The CRS reources for the failed node 
are reminaing online until CSSD has performed the node eviction itself after 
waiting for 10 minutes (default behaviour when Sun Cluster is in place) 
rather than as soon as the Sun Cluster notification regarding the panic has 
occurred
我们用clustereware是VCS,其实和上面的情况是类似的。

您可能感兴趣的与本文相关的镜像

Stable-Diffusion-3.5

Stable-Diffusion-3.5

图片生成
Stable-Diffusion

Stable Diffusion 3.5 (SD 3.5) 是由 Stability AI 推出的新一代文本到图像生成模型,相比 3.0 版本,它提升了图像质量、运行速度和硬件效率

<think> 嗯,用户这次问的是Oracle 19c RAC中第二节点VIP显示"FAILED OVER,STABLE"状态的问题。从描述看用户应该是个DBA,遇到的是生产环境的高可用性问题,语气比较急切。 我注意到用户引用了几段配置信息,其中引用[1]显示了集群节点配置,引用[2]有个时间戳的报错记录,引用[3]是补丁检查记录,引用[4]则是监听配置。这些线索都很关键。 这个"FAILED OVER,STABLE"状态很有意思。字面理解是VIP已经故障转移但处于稳定状态,实际上意味着: - VIP资源已从故障节点切换到其他节点 - 当前状态稳定但并不在预期节点 可能的原因链条需要系统性地梳理。先从网络层开始,因为70%的RAC问题都是网络引起的。用户提供的配置里看到节点1和节点2都配置了VIP,但引用[2]报错指向ora.node2.vip,结合引用[4]的监听配置,很可能存在IP冲突或路由问题。 操作系统层面需要检查三个关键点:主机名解析是否同步、网络接口状态是否正常、ARP表是否有异常条目。特别是那个"HUB"参数值得注意,在Oracle集群配置里通常表示节点类型。 集群资源层要重点看crsctl的输出,用户有提供实际状态但提到"FAILED OVER",说明资源已经触发转移。这里需要区分是VIP资源本身故障还是被节点驱逐导致的转移。 深层原因可能有几种经典情况:比如网络闪断触发节点驱逐、VIP被意外绑定在其他机器、网卡冗余配置冲突等。用户提到补丁检查(引用[3]),虽然当时通过,但可能暗示之前打过补丁,需要确认是否引入了新问题。 排查方案设计需要阶梯式推进: 第一步必须立即验证客户端连接是否受影响,这是业务最关心的。 然后从底层网络到上层资源逐层排查,特别要注意第二节点和其他节点的配置差异。 最后修复时要考虑资源回切的风险,避免业务中断。 预防措施那块其实更重要,用户问但作为专业DBA应该关注。比如建议配置持续ping监控、开启更多集群日志级别等。不过这些在首次回复里点到为止就好,用户当前最需要的是解决眼前故障。 用户引用的配置片段里有些细节值得玩味: - 监听配置用的192.168.200.21这个IP,需要确认是否与VIP在同一网段 - 报错日志里的时间戳2022-11-30与用户提问时间不符,可能是历史问题 - 补丁检查通过说明不是基础文件冲突 整体来看,用户提供的技术线索足够聚焦,应该是个经验丰富的DBA。回复时需要保持专业深度,但要把排查步骤拆解得足够清晰,方便直接执行。对于"STABLE"这种特殊状态要特别解释,避免误解为正常状态。 </think>Oracle 19c RAC 第二节点 VIP 显示 `FAILED OVER, STABLE` 表明该 VIP 资源已被**故障转移(Failed Over)**到集群中的其他节点(通常是第一节点),且转移后的状态是**稳定(STABLE)**的。这意味着 VIP 服务当前在另一个节点上正常运行,但不在其预期的第二节点上。以下是原因分析和解决方案: --- ### **主要原因分析** 1. **节点驱逐或隔离** - 第二节点可能因网络闪断、`ocssd.bin` 进程崩溃、心跳超时等问题被集群驱逐。 - 触发原因包括: - 网络问题(网卡/交换机故障、IP冲突)。 - 资源争用(CPU/内存耗尽)。 - 存储问题(ASM 磁盘组不可达)。 - 操作系统故障(内核崩溃)。 2. **VIP 资源故障** - 第二节点 VIP 配置错误(如子网掩码、网关错误)。 - ARP 冲突(同一子网存在重复 IP)。 - 网卡绑定(bonding)配置异常。 3. **集群资源管理异常** - `crsd`/`ohasd` 服务异常。 - 补丁冲突或 Bug(需检查 OPatch 记录 [^3])。 4. **监听器问题** - 监听器未能注册到 VIP(参考 `local_listener` 配置 [^4])。 --- ### **解决方案** #### **1. 检查节点状态与日志** ```bash # 检查集群状态 crsctl status cluster -all # 查看节点驱逐记录 ocrcheck -local grep -i "eviction" /u01/app/grid/diag/crs/$(hostname)/crs/trace/crsd.trc grep -i "failover" /u01/app/grid/diag/crs/$(hostname)/crs/trace/ohasd.trc ``` #### **2. 验证网络配置** - **检查 VIP 是否冲突**(在第二节点执行): ```bash arping -I bond0 192.168.200.22 # 替换为第二节点 VIP ``` 若收到多个响应,说明存在 IP 冲突。 - **验证网卡绑定**: ```bash cat /proc/net/bonding/bond0 # 检查绑定状态和从属网卡 ``` #### **3. 恢复 VIP 资源** ```bash # 尝试将 VIP 资源移回第二节点 crsctl relocate resource ora.<节点2_VIP名称>.vip -n raclhr-19c-n2 ``` 若失败,强制重启资源: ```bash crsctl stop resource ora.<节点2_VIP名称>.vip -f crsctl start resource ora.<节点2_VIP名称>.vip -n raclhr-19c-n2 ``` #### **4. 修复监听器配置** ```sql -- 确认 local_listener 配置正确(参考引用[4]) ALTER SYSTEM SET local_listener='(ADDRESS=(PROTOCOL=TCP)(HOST=节点2_VIP)(PORT=1551))' SCOPE=BOTH SID='+ASM2'; ``` #### **5. 补丁与 Bug 检查** - 检查已知 Bug(如 Doc ID 2675702.1): ```bash opatch lspatches # 列出已安装补丁 ``` - 若存在冲突补丁(引用[3]中已检查无冲突),考虑回滚或应用新补丁。 #### **6. 重启集群服务(谨慎操作)** ```bash # 在第二节点以 root 执行 crsctl stop crs -f crsctl start crs ``` --- ### **关键预防措施** 1. **网络冗余**:配置网卡绑定(bonding)和交换机堆叠。 2. **资源监控**:使用 `crsctl status res -t -w "NAME co ora.*.vip"` 实时监控 VIP 状态。 3. **定期验证**:通过 `cluvfy comp nodecon -n all` 检查集群网络健康性。 > **注意**:若故障由底层硬件(如网卡损坏)引起,需更换硬件后重新配置 VIP。 --- ### 相关问题 1. Oracle RAC 中如何诊断节点驱逐(Node Eviction)的根本原因? 2. 如何配置 Oracle RAC网络冗余(bonding/Teaming)以提高可用性? 3. Oracle 19c RAC 中常见的 VIP 故障有哪些修复方法? 4. 如何验证 Oracle 集群的补丁兼容性以避免冲突? 5. 在 RAC 环境中,监听器无法注册到 VIP 应如何排查? [^1]: 集群节点配置 [^2]: VIP 资源执行环境变量 [^3]: 补丁冲突检查日志 [^4]: 监听器配置参数
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值