一般来讲,RAC安装的grid、oracle用户都是本地(/etc/group, /etc/passwd)管理的方式。但是也有一些环境使用了外部管理的方式,比如ldap,或者vas。 下面介绍一个由于ldap性能问题导致的RAC实例终止的情况:
1. 首先现象是:LMHB因为错误29770 而终止实例:
Wed Sep 20 23:08:34 2017
LMON (ospid: 5008) waits for event 'CSS operation: data query' for 8 secs.
Wed Sep 20 23:08:38 2017
Thread 1 advanced to log sequence 66354 (LGWR switch)
Current log# 4 seq# 66354 mem# 0: +LOG1/taorac09/onlinelog/group_4.260.864051729
Current log# 4 seq# 66354 mem# 1: +LOG2/taorac09/onlinelog/group_4.260.864051743
Errors in file /oracle/diag/rdbms/taorac09/r09ta1/trace/r09ta1_lmhb_5027.trc (incident=480165):
ORA-29770: global enqueue process LMON (OSID 5008) is hung for more than 70 seconds
Incident details in: /oracle/diag/rdbms/taorac09/r09ta1/incident/incdir_480165/r09ta1_lmhb_5027_i480165.trc
ERROR: Some process(s) is not making progress.
LMHB (ospid: 5027) is terminating the instance.
Please check LMHB trace file for more details.
Please also check the CPU load, I/O load and other system properties for anomalous behavior
ERROR: Some process(s) is not making progress.
Wed Sep 20 23:08:39 2017
System state dump requested by (instance=1, osid=5027 (LMHB)), summary=[abnormal instance termination].
System State dumped to trace file /oracle/diag/rdbms/taorac09/r09ta1/trace/r09ta1_diag_4996.trc
LMHB (ospid: 5027): terminating the instance due to error 29770 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2. 从systemstate dump中可以看到,LMON一直在等待"CSS operation: data query" 并且耗费了大量的时间:
=== LMON (ospid: 5008) Heartbeat Report
==================================================
LMON (ospid: 5008) has no heartbeats for 99 sec. (threshold 70 sec) <<<<<<<<<<<<<<<<<<
: waiting for event 'CSS operation: data query' for 13 secs with wait_id 173783400.
===[ Wait Chain ]===
Wait chain is empty.
==============================
Dumping PROCESS LMON (ospid: 5008) States
==============================
===[ System Load State ]===
CPU Total 32 Core 16 Socket 2
Load normal: Cur 4881 Highmark 40960 (19.06 160.00)
===[ Latch State ]===
Not in Latch Get
===[ Session State Object ]===
----------------------------------------
SO: 0x1fc1583260, type: 4, owner: 0x1fd125efe0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x1fd125efe0, name=session, file=ksu.h LINE:12624, pg=0
(session) sid: 1046 ser: 1 trans: (nil), creator: 0x1fd125efe0
flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
flags2: (0x409) -/-/INC
DID: , short-term DID:
txn branch: (nil)
oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS
ksuxds FALSE at location: 0
service name: SYS$BACKGROUND
Current Wait Stack:
0: waiting for 'CSS operation: data query'
function_id=0x3, =0x0, =0x0
wait_id=173783400 seq_num=23067 snap_id=1
wait times: snap=13.550468 sec, exc=13.550468 sec, total=13.550468 sec <<<<<<<<<<<<<<<<<<
wait times: max=infinite, heur=1 min 15 sec
wait counts: calls=0 os=0
in_wait=1 iflags=0x5a0
Wait State:
fixed_waits=0 flags=0x22 boundary=(nil)/-1
Session Wait History:
elapsed time of 0.000004 sec since current wait
0: waited for 'CSS operation: data query'
function_id=0x3, =0x0, =0x0
wait_id=173783399 seq_num=23066 snap_id=1
wait times: snap=15.610047 sec, exc=15.610047 sec, total=15.610047 sec <<<<<<<<<<<<<
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000005 sec of elapsed time
1: waited for 'CSS operation: data query'
function_id=0x3, =0x0, =0x0
wait_id=173783398 seq_num=23065 snap_id=1
wait times: snap=15.606784 sec, exc=15.606784 sec, total=15.606784 sec <<<<<<<<<<<<<<
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000004 sec of elapsed time
2: waited for 'CSS operation: data query'
function_id=0x3, =0x0, =0x0
wait_id=173783397 seq_num=23064 snap_id=1
wait times: snap=15.602135 sec, exc=15.602135 sec, total=15.602135 sec <<<<<<<<<<<<<<<
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000004 sec of elapsed time
3: waited for 'CSS operation: data query'
function_id=0x3, =0x0, =0x0
wait_id=173783396 seq_num=23063 snap_id=1
wait times: snap=15.602871 sec, exc=15.602871 sec, total=15.602871 sec <<<<<<<<<<<<<<<
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000006 sec of elapsed time
3. 同时也观察到了一些其他进程的报错:"ORA-29701: unable to connect to Cluster Synchronization Service"
Wed Sep 20 23:05:18 2017
Errors in file /oracle/diag/rdbms/taorac09/r09ta1/trace/r09ta1_ora_10843.trc:
ORA-01114: IO error writing block to file (block # )
ORA-01114: IO error writing block to file 20001 (block # 722305)
ORA-29701: unable to connect to Cluster Synchronization Service
ORA-29701: unable to connect to Cluster Synchronization Service
Wed Sep 20 23:05:30 2017
LMON (ospid: 5008) waits for event 'CSS operation: data query' for 11 secs.
Wed Sep 20 23:05:34 2017
Errors in file /oracle/diag/rdbms/taorac09/r09ta1/trace/r09ta1_ora_846.trc:
ORA-01114: IO error writing block to file (block # )
ORA-01114: IO error writing block to file 20002 (block # 78977)
ORA-29701: unable to connect to Cluster Synchronization Service
ORA-29701: unable to connect to Cluster Synchronization Service
4. 从这些进程的trace中,可以看到
2017-09-20 23:08:26.233: [ GIPCMUX] gipcmodMuxCallbackRecv: EXCEPTION[ ret gipcretAuthFail (22) ] error during recv on endp 0x196b5570
2017-09-20 23:08:26.233: [GIPCXCPT] gipcmodMuxCallbackRecv: internal receive request failed req 0x196d6f30 [0000000000000035] { gipcReceiveRequest : peerName '', data (nil), len 0, olen 0, off 0, parentEndp 0x196b73d0, ret gipcretAuthFail (22), objFlags 0x0, reqFlags 0x4 }, ret gipcretAuthFail (22) <<<<<<<<看起来好像是和Authentication有关系
5.CSSD Log也发现了同样的错误:
2017-09-20 22:55:35.755: [GIPCXCPT][3824294208] gipcmodClsaAuthStart: failuring during clsaauthmsg ret clsaretPROTERR (4), endp 0x2ac1e4392650 [000000001fa9c9f9] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_talracsdw1a_)(GIPCID=fc049e15-66ed076f-28437))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_talracsdw1a_)(GIPCID=66ed076f-fc049e15-15531))', numPend 4, numReady 1, numDone 3, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 15531, flags 0x603710, usrFlags 0x14000 }
2017-09-20 22:55:35.755: [GIPCXCPT][3824294208] gipcmodMuxTransferAccept: internal accept request failed endp 0x1af74a40, child 0x2ac1e4392650, ret gipcretAuthFail (22)
2017-09-20 22:55:35.755: [ GIPCMUX][3824294208] gipcmodMuxTransferAccept: EXCEPTION[ ret gipcretAuthFail (22) ] error during accept on endp 0x1af74a40 <<<<<<<<<<<<
6. 而从/etc/nsswitch.conf可以看到使用了ldap,同时在/etc/group, /etc/passwd中又没有grid、os用户的定义:
group: files ldap <<<<<<
passwd: compat
shadow: compat
所以说,在用户或者组使用外部验证方式的时候(VAS、LDAP), 我们需要保证这些服务的性能。而如果性能不能达到要求,那么最好还是建议把这些用户或者组迁移到本地(/etc/group, /etc/passwd)