11G RAC 私网直连服务器报错CRS-5818 CRS-2757
客户一套linux6.x系列2节点11g r2 RAC 由于某些原因需要关闭服务器,再次重启时2节点主板故障,启动失败。单独启1节点报错
2017-07-28 15:48:39.632
[/u01/app/11.2.0.3/grid/bin/orarootagent.bin(5125)]CRS-5818:Aborted command 'start' for resource 'ora.cluster_interconnect.haip'. Details at (:CRSAGF00113:) {0:0:2} in /u01/app/11.2.0.3/grid/log/racnode1/agent/ohasd/orarootagent_root/orarootagent_root.log.
2017-07-28 15:48:39.658
[ohasd(4954)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.cluster_interconnect.haip'. Details at (:CRSPE00111:) {0:0:2} in /u01/app/11.2.0.3/grid/log/racnode1/ohasd/ohasd.log.
查询mos 发现未找到符合的报错描述的文章,比较相近的两个文章
12.1.0.2 Grid Infrastructure Stack not Start due to .pid file issues (文档 ID 2028511.1)
CRSD & HAIP Resources Remain In OFFLINE as Private Network Interface is Partially Up (文档 ID 1529721.1)
文档 ID 2028511.1是pid file 的属组权限等问题,并且是12c的版本也是不符合的
文档 ID 1529721.1倒是11g版本的,但文章中提到是一个bug,并且已经在psu6中已解决该问题。而客户这边数据库psu已经很新的补丁,该文章应该也不符合。但是我读了一下文章,发现可以用文章中的命令做下测试。
文章中提到
The clusterware failed to start the HAIP resource because the network interface for the cluster interconnect was not active.
This can be seen from the output of 'ifconfig' for the cluster interconnect interface, which shows that the interface is missing the 'UP' and the 'RUNNING' flags:
eth2 Link encap:Ethernet HWaddr 00:21:5A:9B:02:90
inet addr:10.20.11.1 Bcast:10.20.11.255 Mask:255.255.255.0
BROADCAST MULTICAST MTU:9000 Metric:1
RX packets:1 errors:18 dropped:0 overruns:0 frame:0
TX packets:253 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:80 (80.0 b) TX bytes:33638 (32.8 KiB)
The corresponding gipcd log file (<GI home>/log/<nodename>/gipcd/gipcd.log) reports "Returning NETDATA: 0 interfaces":
2013-01-30 15:38:04.998: [ CLSINET][1101314368] Returning NETDATA: 0 interfaces ===> that is a problem
2013-01-30 15:38:04.999: [GIPCDMON][1101314368] gipcdMonitorCssCheck: found node racnode1
2013-01-30 15:38:10.000: [ CLSINET][1101314368] Returning NETDATA: 0 interfaces
2013-01-30 15:38:10.001: [GIPCDMON][1101314368] gipcdMonitorCssCheck: found node racnode1
... repeat ...
我使用ifconfig -a 发现我私网ip所在网卡没有running,使用ethtool 命令返回结果也是no,就是说我的私网网卡是有问题。
综合crs报错和网卡信息,我猜测客户这套rac是心跳线直连的。
直连心跳线如果一台服务器故障,另一台服务器的集群是起不来的。
跟客户去机房查看果真是这样。将1节点的私网接到一个交换机上(正常情况下该交换机要先化好vlan),ethool命令返回yes,然后用命令将1节点拉起来了