今天早上业务人员跑过来说业务都无法与数据库连接,赶紧一看发现两个节点的VIP,Listener,Service全部offline。随后查看相关日志(两个节点情况相同):
CRSD.log
2013-11-22 01:54:49.446: [ CRSAPP][11190]32CheckResource error forora.yyjcdb1.vip error code = 1
2013-11-22 01:54:49.473: [ CRSRES][11190]32In stateChanged,ora.yyjcdb1.vip target is ONLINE
2013-11-22 01:54:49.473: [ CRSRES][11190]32ora.yyjcdb1.vip on yyjcdb1went OFFLINE unexpectedly
【VIP服务首先down掉】
2013-11-22 01:54:49.478: [ CRSRES][11190]32StopResource: setting CLIvalues
2013-11-22 01:54:49.504: [ CRSRES][11190]32Attempting to stop`ora.yyjcdb1.vip` on member `yyjcdb1`
2013-11-22 01:54:50.100: [ CRSRES][11452]32In stateChanged,ora.YYJCDB.YYJCDB_SRV.YYJCDB1.srv target is ONLINE
2013-11-22 01:54:50.100: [ CRSRES][11452]32ora.YYJCDB.YYJCDB_SRV.YYJCDB1.srv on yyjcdb1 wentOFFLINE unexpectedly 【service服务进而down掉】
2013-11-22 01:54:50.101: [ CRSRES][11452]32StopResource: setting CLIvalues
2013-11-22 01:54:50.134: [ CRSRES][11452]32Attempting to stop`ora.YYJCDB.YYJCDB_SRV.YYJCDB1.srv` on member `yyjcdb1`
2013-11-22 01:54:50.234: [ CRSRES][11190]32Stop of `ora.yyjcdb1.vip` onmember `yyjcdb1` succeeded.
2013-11-22 01:54:50.238: [ CRSRES][11190]32ora.yyjcdb1.vipRESTART_COUNT=0 RESTART_ATTEMPTS=0
2013-11-22 01:54:50.258: [ CRSRES][11190]32ora.yyjcdb1.vip failed onyyjcdb1 relocating.
2013-11-22 01:54:50.319: [ CRSRES][11190]32StopResource: setting CLIvalues
2013-11-22 01:54:50.341: [ CRSRES][11190]32Attempting to stop`ora.yyjcdb1.LISTENER_YYJCDB1.lsnr` on member `yyjcdb1`【Lsnr服务进行down掉】
2013-11-22 01:54:50.522: [ CRSRES][11452]32Stop of`ora.YYJCDB.YYJCDB_SRV.YYJCDB1.srv` on member `yyjcdb1` succeeded.
2013-11-22 01:54:50.523: [ CRSRES][11452]32ora.YYJCDB.YYJCDB_SRV.YYJCDB1.srv RESTART_COUNT=0RESTART_ATTEMPTS=0
2013-11-22 01:54:50.531: [ CRSRES][11452]32ora.YYJCDB.YYJCDB_SRV.YYJCDB1.srv failed on yyjcdb1relocating.
2013-11-22 01:54:50.586: [ CRSRES][11452]32Cannot relocateora.YYJCDB.YYJCDB_SRV.YYJCDB1.srvStopping dependents
2013-11-22 01:54:50.597: [ CRSRES][11452]32StopResource: setting CLIvalues
2013-11-22 01:56:06.820: [ CRSRES][11190]32Stop of`ora.yyjcdb1.LISTENER_YYJCDB1.lsnr` on member `yyjcdb1` succeeded.
2013-11-22 01:56:06.881: [ CRSRES][11190]32Attempting to start`ora.yyjcdb1.vip` on member `yyjcdb2`
2013-11-22 01:56:08.580: [ CRSRES][11190]32Start of `ora.yyjcdb1.vip` onmember `yyjcdb2` succeeded.
2013-11-22 01:56:12.938: [ CRSRES][11024]32startRunnable: setting CLIvalues
当VIP宕掉的时候,会导致service和lsnr宕掉,甚至和VIP相关的资源都会宕掉
Ora.rac01.vip.log:
2013-11-22 01:54:49.443: [ RACG][1] [8650936][1][ora.yyjcdb1.vip]:Invalid parameters, or failed to bring up VIP (host=yyjcdb1)
2013-11-22 01:54:49.443: [ RACG][1] [8650936][1][ora.yyjcdb1.vip]:clsrcexecut: env ORACLE_CONFIG_HOME=/oracle/product/10.2.0/crs
2013-11-22 01:54:49.443: [ RACG][1] [8650936][1][ora.yyjcdb1.vip]:clsrcexecut: cmd = /oracle/product/10.2.0/crs/bin/racgeut -e _USR_ORA_DEBUG=054 /oracle/product/10.2.0/crs/bin/racgvip check yyjcdb1
2013-11-22 01:54:49.443: [ RACG][1] [8650936][1][ora.yyjcdb1.vip]: clsrcexecut:rc = 1, time = 4.305s
2013-11-22 01:54:49.443: [ RACG][1] [8650936][1][ora.yyjcdb1.vip]: endfor resource = ora.yyjcdb1.vip, action = check, status = 1, time = 4.484s
在VIP的日志里,仅仅有Invalid parameters, or failed to bring up VIP这些报错信息
当网关配置出问题时,会出现如上类似日志信息。但此例经过Ping并没有发现网关配置存在问题。
网关和VIP的关系:http://blog.youkuaiyun.com/ballontt/article/details/17525577
然后在MOS又发现如下BUG(6608472):
OracleServer - Enterprise Edition - Version: 10.2.0.3 and later [Release: 10.2 and later ]
IBMAIX on POWER Systems (64-bit)
IBMAIX Based Systems (64-bit)
Symptoms
Ifthe Etherchannel and VLAN devices are used for the public network, racgvipscript may fail and cause the VIP to offline as a result.
Cause
Currently,the racgvip script issues the following command to check if the public networkis up:
'$ENTSTAT-d $_IF | $GREP -iEq '.*lan.*state.*:.*operational.*
|.*link.*status.*:.*up.*'"
Howeveron some of the new AIX network devices, the output from the following commandis different, so the above grep check on the entstat output fails:
entstat-d <interface name>
Bug:6608472addresses this problem.
Thenew fix is in 10.2.0.3 patch 6851901 (MLR #16).
Althoughthe bug 6608472 does not say the fix is in the patch 6851901, this patch(6851901) has the new racgvip that now issues the following to check the healthof the public network:
$ENTSTAT-d $_IF | $GREP -iEq'.*lan.*state.*:.*operational.*|.*link.*status.*:.*up.*|.*port.*operational.*state.*:.*up.*'
Thefix for the bug 6608472 is also included in 10.2.0.4
Solution
Toresolve this issue, apply patch 6851901.
Again,this patch provides the new racgvip script that can handle the new / currentAIX interface types.
Thefix for the bug 6608472 is also in 10.2.0.4, so upgrading the CRS to 10.2.0.4will resolve the problem produced by the bug 6608472.
References
BUG:6608472- RACGVIP IN RAC FOR AIX FAILS EVEN THOUGH THE PUBLIC INTERFACE IS UP
If theEtherchannel and VLAN devices are used for the public network, racgvip scriptmay fail and cause the VIP to offline as a result.
重点:If the Etherchannel andVLANdevices are used for thepublic network, racgvip script may fail and cause the VIP to offline as aresult.
查看Publicne twork使用的网卡的设备类型:
ballontt@RAC01:/home/oracle$entstat -d en4
-------------------------------------------------------------
ETHERNETSTATISTICS (en4) :
DeviceType: EtherChannel
确实是EtherChannel类型。虽然这和Bug:6608472很类似,但是MOS上描述该BUG已在10.2.0.4中修复。而我们的版本是10.2.0.5的。通过如上日志信息,无法判断出故障的原因所在。
如果需要进一步定位,只能开启debug,此问题再次发生时,可以在racg下记录下VIP相关的更详细的信息。
Debug开启:
crsctldebug log res "ora.rac01.vip:1"
crsctldebug log res "ora.rac01.vip:1"
(1代表记录信息的等级)
Debug关闭:
crsctldebug log res "ora.rac01.vip:0"
crsctldebug log res "ora.rac01.vip:0"
后续
今天凌晨00:25生产数据库上的两个节点都再次同时出现如上报错信息,当然应用的链接也都中断。在alert里可以看到会话被中断的信息。于此同时,网络方面监控到发生了瞬断。所以,此报错存在的一个原因是网络出现了问题。