rac宕机分析

最新推荐文章于 2023-05-25 10:03:35 发布

转载最新推荐文章于 2023-05-25 10:03:35 发布 · 696 阅读

首先，在47分的时候实例4发起驱逐实例2的事件，实例2首先crash。
这是实例4的log
Wed Jan 11 09:47:07 GMT+08:00 2012Evicting instance 2 from cluster
Wed Jan 11 09:47:14 GMT+08:00 2012Reconfiguration started (old inc 12, new inc 16)
List of nodes:
0 2 3
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
* domain 0 valid = 0 according to instance 0
Wed Jan 11 09:47:14 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out

这是实例3的log
Wed Jan 11 09:47:07 GMT+08:00 2012Evicting instance 2 from cluster
Wed Jan 11 09:47:14 GMT+08:00 2012Reconfiguration started (old inc 12, new inc 16)
List of nodes:
0 2 3
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
* domain 0 valid = 0 according to instance 0
Wed Jan 11 09:47:14 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
这是实例1的log
Wed Jan 11 09:47:07 GMT+08:00 2012Evicting instance 2 from cluster
Wed Jan 11 09:47:14 GMT+08:00 2012Reconfiguration started (old inc 12, new inc 16)
List of nodes:
0 2 3
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
* domain 0 not valid according to instance 2
* domain 0 not valid according to instance 3
Wed Jan 11 09:47:14 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out

这是实例2的log
Wed Jan 11 09:47:11 GMT+08:00 2012Received an instance abort message from instance 4 (reason 0x1)
Wed Jan 11 09:47:11 GMT+08:00 2012Received an instance abort message from instance 4 (reason 0x1)
Wed Jan 11 09:47:11 GMT+08:00 2012Please check instance 4 alert and LMON trace files for detail.
Wed Jan 11 09:47:11 GMT+08:00 2012Please check instance 4 alert and LMON trace files for detail.
Wed Jan 11 09:47:11 GMT+08:00 2012LMD0: terminating instance due to error 481
Instance terminated by LMD0, pid = 4325382
Wed Jan 11 09:51:33 GMT+08:00 2012Starting ORACLE instance (normal)

然后实例2在51分开始启动,52分完成。
Wed Jan 11 09:52:40 GMT+08:00 2012Reconfiguration started (old inc 20, new inc 22)
List of nodes:
0 1 2 3
Global Resource Directory frozen
Communication channels reestablished
* domain 0 valid = 1 according to instance 0
Wed Jan 11 09:52:40 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out

实例1在51分的时候因为自己的asm异常而导致crash，具体log如下
Wed Jan 11 09:51:13 GMT+08:00 2012Errors in file /u01/app/oracle/admin/zhpt/bdump/zhpt1_asmb_6947490.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Wed Jan 11 09:51:21 GMT+08:00 2012ASMB: terminating instance due to error 15064
Instance terminated by ASMB, pid = 6947490

在实例4上可以看到当时2没有完全起来，而1已经crash的状态
Wed Jan 11 09:48:46 GMT+08:00 2012 Submitted all GCS remote-cache requests
Fix write in gcs resources
Wed Jan 11 09:51:27 GMT+08:00 2012Reconfiguration started (old inc 16, new inc 18)
List of nodes:
2 3
Nested reconfiguration detected.
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
* domain 0 valid = 0 according to instance 2
从上面还可以看出1的crash在crs中没有信息，说明它的crash和crs无关，而2的crash是crs发起的，日志是不一样的。

实例3的日志
Wed Jan 11 09:49:12 GMT+08:00 2012Completed redo scan
196588 redo blocks read, 29797 data blocks need recovery
Fix write in gcs resources
Wed Jan 11 09:51:27 GMT+08:00 2012Reconfiguration started (old inc 16, new inc 18)
List of nodes:
2 3
Nested reconfiguration detected.
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
* domain 0 not valid according to instance 3
Wed Jan 11 09:51:27 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out

实例2的日志没有同样的信息是因为它当时正在open中。

下面是实例4在2和1先后好了以后记录的log

Wed Jan 11 09:51:29 GMT+08:00 2012 Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Wed Jan 11 09:51:43 GMT+08:00 2012Reconfiguration started (old inc 18, new inc 20)
List of nodes:
1 2 3
Global Resource Directory frozen
Communication channels reestablished
* domain 0 valid = 0 according to instance 2
Wed Jan 11 09:51:43 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out

Wed Jan 11 09:51:45 GMT+08:00 2012 Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Wed Jan 11 09:52:40 GMT+08:00 2012Reconfiguration started (old inc 20, new inc 22)
List of nodes:
0 1 2 3
Global Resource Directory frozen
Communication channels reestablished
* domain 0 valid = 1 according to instance 1
Wed Jan 11 09:52:40 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out

下面是实例3在2和1先后好了以后记录的log

Wed Jan 11 09:51:32 GMT+08:00 2012Completed redo scan
390680 redo blocks read, 58663 data blocks need recovery
Wed Jan 11 09:51:43 GMT+08:00 2012Reconfiguration started (old inc 18, new inc 20)
List of nodes:
1 2 3
Global Resource Directory frozen
Communication channels reestablished
* domain 0 valid = 0 according to instance 1
Wed Jan 11 09:51:43 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out

Wed Jan 11 09:52:02 GMT+08:00 2012Completed redo application
Wed Jan 11 09:52:11 GMT+08:00 2012Completed instance recovery at
Thread 1: logseq 10747, block 1136082, scn 12622806928332
Thread 2: logseq 8082, block 1925780, scn 12622806926550
58633 data blocks read, 58725 data blocks written, 390680 redo blocks read
Wed Jan 11 09:52:40 GMT+08:00 2012Reconfiguration started (old inc 20, new inc 22)
List of nodes:
0 1 2 3
Global Resource Directory frozen
Communication channels reestablished
* domain 0 valid = 1 according to instance 1
Wed Jan 11 09:52:40 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Wed Jan 11 09:52:40 GMT+08:00 2012 LMS 3: 0 GCS shadows cancelled, 0 closed

实例2crash的原因
在实例2上，有一条日志
*** 2012-01-11 09:45:21.918
kjxggpoll: change poll time to 50 ms

然后各个节点开始

实例1crash的原因

database上的alert里面提示看asm的log

Wed Jan 11 09:51:13 GMT+08:00 2012Errors in file /u01/app/oracle/admin/zhpt/bdump/zhpt1_asmb_6947490.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Wed Jan 11 09:51:21 GMT+08:00 2012ASMB: terminating instance due to error 15064
Instance terminated by ASMB, pid = 6947490

database的zhpt1_asmb_6947490.trc里面提示
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
ORACLE_HOME = /u01/app/oracle/product/10.2.0/db_1
System name:    AIX
Node name:      jtzhptdb1
Release:        1
Version:        6
Machine:        00F6EEBB4C00
Instance name: zhpt1
Redo thread mounted by this instance: 1
Oracle process number: 43
Unix process pid: 6947490, image: oracle@jtzhptdb1 (ASMB)

*** 2012-01-11 09:51:13.858
*** SERVICE NAME:(SYS$BACKGROUND) 2012-01-11 09:51:10.117
*** SESSION ID:(1598.1) 2012-01-11 09:51:10.117
error 15064 detected in background process
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
ksuitm: waiting up to [5] seconds before killing DIAG(17957082)

asm的log里面提示

Wed Jan 11 09:50:55 GMT+08:00 2012LMD0 (ospid: 8192988) is not heartbeating for 216 seconds.
LMON detects unhealthy receivers.
Please check LMON and DIAG trace files for detail.
Wed Jan 11 09:51:00 GMT+08:00 2012LMON (ospid: 8520498) is terminating the instance.
LMON: terminating instance due to error 481
Termination issued to instance processes. Waiting for the processes to exit
Wed Jan 11 09:51:14 GMT+08:00 2012Instance termination failed to kill one or more processes
Instance terminated by LMON, pid = 8520498
Wed Jan 11 09:52:10 GMT+08:00 2012Starting ORACLE instance (normal)

lmon的log
Repeat 3 times
*** 2012-01-11 09:50:55.139
kjfmrcvrchk: receiver LMD[0] has no heartbeat for 216 sec (1326246437.1326246653.0).
kjfmrcvrchk: Dumping callstack of lmd0
Submitting asynchronized dump request [20]
kjfmrcvrchk: receivers are not healthy. kill instance.
ksuitm: waiting up to [5] seconds before killing DIAG(8258490)
*** 2012-01-11 09:51:14.825
Instance termination failed to kill one or more processes
ksuitm_check: OS PID=10616892 is still alive
*** 2012-01-11 09:51:14.825
Process diagnostic dump for oracle@jtzhptdb1 (LCK0), OS id=10616892,
pid: 15, proc_ser: 1, sid: 432, sess_ser: 1
-------------------------------------------------------------------------------
loadavg : 2.12 2.51 3.42
swap info: free_mem = 6406.91M rsv = 192.00M
alloc = 15371.82M avail = 49152.00M swap_free = 33780.18M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
Short stack dump: ORA-00072: process "Unix process pid: 10616892, image: oracle@jtzhptdb1 (LCK0)" is not active

lmod的log
*** 2012-01-11 09:49:01.111
KJM_HISTORY: RCVR STALL OP(12) context 8 elapsed 99907602 us
KJM HIST LMD0:
12:8:99907602 7:0 6:1 10:28 11:1 15:0 12:78144 7:0 6:1 10:31
11:0 15:0 12:78149 7:1 6:1 10:22 2:33 1:0 11:1 15:0
12:78147 7:1 6:0 10:37 11:1 15:1 12:78148 7:2 6:0 10:41
11:1 15:1 12:78154 7:1 6:0 10:38 11:0 15:1 12:78148 7:1
6:0 10:34 11:0 15:1 12:78150 7:2 6:0 10:62 11:2 15:1
12:78161 7:0 6:0 10:30 11:0 15:0 12:78149 7:1 6:0 10:76
11:1 15:2 12:78164 7:0
----------------------------------------
SO: 70000001709cab0, type: 4, owner: 70000001754ca80, flag: INIT/-/-/0x00
(session) sid: 441 trans: 0, creator: 70000001754ca80, flag: (51) USR/- BSY/-/-/-/-/-
            DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
            txn branch: 0
            oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
last wait for 'ges remote message' wait_time=1 min 17 sec, seconds since wait started=2616
          waittime=40, loop=0, p3=0
          blocking sess=0x0 seq=11868
Dumping Session Wait History
   for 'ges remote message' count=1 wait_time=1 min 17 sec
          waittime=40, loop=0, p3=0
   for 'ges remote message' count=1 wait_time=0.078142 sec
          waittime=40, loop=0, p3=0
   for 'ges remote message' count=1 wait_time=0.078146 sec
          waittime=40, loop=0, p3=0
   for 'ges remote message' count=1 wait_time=0.078143 sec
          waittime=40, loop=0, p3=0
   for 'ges remote message' count=1 wait_time=0.078145 sec
          waittime=40, loop=0, p3=0
   for 'ges remote message' count=1 wait_time=0.078149 sec
          waittime=40, loop=0, p3=0
   for 'ges remote message' count=1 wait_time=0.078145 sec
          waittime=40, loop=0, p3=0
   for 'ges remote message' count=1 wait_time=0.078147 sec
          waittime=40, loop=0, p3=0
   for 'ges remote message' count=1 wait_time=0.078152 sec
          waittime=40, loop=0, p3=0
   for 'ges remote message' count=1 wait_time=0.078147 sec
          waittime=40, loop=0, p3=0