上周六午夜12点刚要睡觉,电话响起,这个时候来电话肯定没啥好事,一看手机号码不认识,通了电话才知道是我们外聘的HP工程师在客户现场处理故障,客户是两台HP小型机做了一个两个节点的RAC,由于客户的原因导致第二个节点系统无法进入多用户模式,估计是在系统里乱操作,删了什么操作系统文件,导致机器只能进入维护模式,因此第二个节点不得不重新安装,HP工程师是克隆了另外一个节点的系统到第二个节点的,然后修改IP,主机名等等的配置好Service
Guard后,HA能起来,但是启动CRS的时候,第二个节点报如下错误:
折腾了半天毫无进展,想重启系统然系统自己带起来,但是跟HP的工程师交流了一下,主机起来后CRS是要手工启动的,那么重启就毫无意义了,在Unix、Linux下,CRS的启动停止脚本是放在init.d目录里的,对HP-Unix不太熟悉,问了才知道HP-Unix中,这个目录是在/sbin/init.d 中,而不是/etc/init.d
目录,从这个目录里用./init.crs 脚本来启动CRS,用法如下:
# ./init.crs xxx <--随便输入一个让它显示用法
Usage: ./init.crs {stop|start|enable|disable}
# ./init.crs start
这次的错误信息有参考意义了:
错误日志显示CRS不能创建cssrun这个文件,
检查之:
# cd /var/opt/oracle/scls_scr/rqtmsdb2/root/
sh: /var/opt/oracle/scls_scr/rqtmsdb2/root/: not found.
咦,没有这个目录!
# cd /var/opt/oracle/scls_scr/
ls -l 一看就明白了:
因为这个系统是从第一个节点克隆过来的,所以这个本应该是rqtmsdb2的目录现在是rqtmsdb1,怪不得呢!
修改之:
再次启动CRS:
这次能够正常启动了!
回头检查第一个节点,这个节点HP工程师跟我说什么也没动过,我就信了,克隆一个系统嘛是对这个节点不用做任何改动,但是现实且很残酷!
命令敲下去:
# cd /sbin/init.d
#
# ./init.crs start
Startup will be queued to init within 30 seconds.
等不到d.bin的进程,无任何反应,回头检查操作系统日志:
看来有些错误信息啊,其中的一个文件:
无法绑定监听到PricateIP上,再去检查/etc/hosts文件,发现没有Pricate
IP!,只有第二个节点的Pricate IP,再去检查第二个节点的/etc/hosts文件,对比后添加第一个节点的Pricate IP :
192.168.0.1 rqtmsdb1-priv
没在开始去检查/etc/hosts文件真是失误啊!听到的一定要自己再确认一遍!又一次在RAC环境里载在/etc/hosts文件手里!!!之前在一个客户那里配置RAC,工程师给我将localhosts这个系统默认的东东去掉了,导致我在这个上面花了一天的时间才找到是没有localhosts导致的!
再次启动CRS,这次正常启动了!以为一切都好了,可以去睡觉了,没先到后面VIP还有问题,
crs_start -all 启动Cluste,报告不能启动,VIP起不来,后面的就都失败了,这个错误好办,之前解决过,先设置对VIP进行debug:
然后单独启动VIP资源:
没有配置默认网关,在检查IP地址配置情况,发现,IP地址是配置在lan2上的,一问才知道,由于lan0经常出问题,这次改到lan2,不早说啊,nnd!!
VIP在启动的时候回去ping默认网关,如果不通,那么VIP是起不来的。HP工程师配置好默认网关后,修改VIP到lan0上去:
先删除之:
su - oracle
oifcfg delif -global
然后再重新配置:
-
Attempting to start CRS stack
-
Failure at scls_scr_create with code 1
-
Internal Error Information:
-
Category: 1234
-
Operation: scls_scr_create
-
Location: mkdir
-
Other: Unable to make user dir
- Dep: 2
# ./init.crs xxx <--随便输入一个让它显示用法
Usage: ./init.crs {stop|start|enable|disable}
# ./init.crs start
这次的错误信息有参考意义了:
-
/sbin/init.d/init.cssd[537]: /var/opt/oracle/scls_scr/rqtmsdb2/root/cssrun: Cannot
create the specified file.
- Startup will be queued to init within 30 seconds.
检查之:
# cd /var/opt/oracle/scls_scr/rqtmsdb2/root/
sh: /var/opt/oracle/scls_scr/rqtmsdb2/root/: not found.
咦,没有这个目录!
# cd /var/opt/oracle/scls_scr/
ls -l 一看就明白了:
-
# ls -l
-
total 0
- drwxr-xr-x 4 root sys 96 Dec 31 2010 rqtmsdb1
修改之:
- # mv rqtmsdb1 rqtmsdb2
- # ls -l
- total 0
- drwxr-xr-x 4 root sys 96 Dec 31 2010 rqtmsdb2
-
# cd rq*
# ls -l
total 16
drwxr-xr-x 2 orarac sys 96 Dec 31 2010 orarac
drwxr-xr-x 2 root sys 8192 Nov 17 09:55 root
# cd root
# ls -l
total 48
-rw-rw-rw- 1 root root 8 Nov 17 15:33 crsdboot
-rw-r--r-- 1 root sys 7 Dec 31 2010 crsstart
-rw-rw-rw- 1 root sys 6 Nov 17 15:33 cssrun
-rw-r--r-- 1 root sys 0 Nov 17 15:33 noclsmon
-rw-rw-rw- 1 root root 0 Nov 17 15:33 nooprocd
-
# cd /sbin/init.d
-
#
-
# ./init.crs
start
-
Startup will be queued to init within 30 seconds.
-
# ps -ef|grep d.bin
-
root 18734 22410 1 02:22:49 pts/ta 0:00
grep d.bin
-
# ps -ef|grep d.bin
-
root 2059 1 0 22:03:36 ? 0:00 /ora_soft/oracle/product/crs/bin/crsd.bin
reboot
-
orarac 18782 2057 0 02:23:09 ? 0:00 /ora_soft/oracle/product/crs/bin/evmd.bin
-
orarac 19013 19012 0 02:23:14 ? 0:00 /ora_soft/oracle/product/crs/bin/ocssd.bin
-
# /ora_soft/oracle/product/crs/bin/crsctl
check crs
-
CSS appears healthy
-
CRS appears healthy
-
EVM appears healthy
-
# /ora_soft/oracle/product/crs/bin/crlctl
stop crs
-
sh: /ora_soft/oracle/product/crs/bin/crlctl: not
found.
-
# /ora_soft/oracle/product/crs/bin/crsctl
stop crs
-
Stopping resources.
-
Successfully stopped CRS resources
-
Stopping CSSD.
-
Shutting down CSS daemon.
-
Shutdown request successfully issued.
-
# ps -ef|grep d.bin
-
root 21987 22410 0 02:24:53 pts/ta 0:00
grep d.bin
-
# /ora_soft/oracle/product/crs/bin/crsctl
start crs
-
Attempting to start CRS stack
-
The CRS stack will be started shortly
-
# ps -ef|grep d.bin
-
root 23992 22410 0 02:32:59 pts/ta 0:00
grep d.bin
-
# ps -ef|grep d.bin
-
root 23995 22410 0 02:33:05 pts/ta 0:00
grep d.bin
-
# ps -ef|grep d.bin
-
root 21829 1 0 02:24:44 ? 0:00 /ora_soft/oracle/product/crs/bin/crsd.bin
reboot
-
orarac 24152 21817 0 02:33:18 ? 0:00 /ora_soft/oracle/product/crs/bin/evmd.bin
-
orarac 24299 24298 0 02:33:21 ? 0:00 /ora_soft/oracle/product/crs/bin/ocssd.bin
-
root 24577 22410 0 02:33:31 pts/ta 0:00
grep d.bin
-
# /ora_soft/oracle/product/crs/bin/crsctl
status
-
Unknown parameter: status
-
# /ora_soft/oracle/product/crs/bin/crsctl
check crs
-
CSS appears healthy
-
CRS appears healthy
-
EVM appears healthy
- #
回头检查第一个节点,这个节点HP工程师跟我说什么也没动过,我就信了,克隆一个系统嘛是对这个节点不用做任何改动,但是现实且很残酷!
命令敲下去:
# cd /sbin/init.d
#
# ./init.crs start
Startup will be queued to init within 30 seconds.
等不到d.bin的进程,无任何反应,回头检查操作系统日志:
-
Nov 18 03:26:00 rqtmsdb1 syslog: Cluster
Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.2104.
-
Nov 18 03:26:00 rqtmsdb1 syslog: Cluster Ready Services waiting
on dependencies. Diagnostics in /tmp/crsctl.2116.
-
Nov 18 03:26:00 rqtmsdb1 syslog: Cluster Ready Services waiting
on dependencies. Diagnostics in /tmp/crsctl.2154.
- Nov 18 03:34:16 rqtmsdb1 syslog: Cluster Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.2154.
-
#cat /tmp/crsctl.2104
-
Failed 3 to bind listening endpoint:(ADDRESS=(PROTOCOL=tcp)(HOST=rqtmsdb1-priv))
- #
192.168.0.1 rqtmsdb1-priv
没在开始去检查/etc/hosts文件真是失误啊!听到的一定要自己再确认一遍!又一次在RAC环境里载在/etc/hosts文件手里!!!之前在一个客户那里配置RAC,工程师给我将localhosts这个系统默认的东东去掉了,导致我在这个上面花了一天的时间才找到是没有localhosts导致的!
再次启动CRS,这次正常启动了!以为一切都好了,可以去睡觉了,没先到后面VIP还有问题,
crs_start -all 启动Cluste,报告不能启动,VIP起不来,后面的就都失败了,这个错误好办,之前解决过,先设置对VIP进行debug:
- #/ora_soft/oracle/product/crs/bin/crsctl debug log res "ora.rqtmsdb1.vip:5"
- # /ora_soft/oracle/product/crs/bin/srvctl start nodeapps -n rqtmsdb1
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:29 EAT 2012 [ 25193 ] Checking interface existance
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:29 EAT 2012 [ 25193 ] Calling
getifbyip
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:29 EAT 2012 [ 25193 ] getifbyip: started for 172.16.7.22
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:29 EAT 2012 [ 25193 ] Completed
getifbyip
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:29 EAT 2012 [ 25193 ] switched
to standby : start/check operation
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25193 ] Completed
with initial interface test
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25193 ] Broadcast = 172.16.7.255
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25193 ] Interface tests
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25193 ] checkIf: start for if=lan0
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25193 ] checkIf: get default gw
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25193 ] defaultgw: started
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25193 ] defaultgw: completed
with
-
rqtmsdb1:ora.rqtmsdb1.vip:checkIf:
Default gateway is not defined (host=rqtmsdb1)
-
rqtmsdb1:ora.rqtmsdb1.vip:Interface
lan0 checked failed (host=rqtmsdb1)
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25193 ] checkIf: end for if=lan0
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25193 ] DEBUG: FAIL_WHEN_ALL_LINK_DOWN = 1
and IF_USING =
-
rqtmsdb1:ora.rqtmsdb1.vip:Invalid
parameters, or failed to bring up VIP (host=rqtmsdb1)
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25341 ] Checking interface existance
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25341 ] Calling
getifbyip
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25341 ] getifbyip: started for 172.16.7.22
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25341 ] Completed
getifbyip
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25341 ] switched
to standby : start/check operation
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:37 EAT 2012 [ 25341 ] Completed
with initial interface test
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:37 EAT 2012 [ 25341 ] Broadcast = 172.16.7.255
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:37 EAT 2012 [ 25341 ] Performing
CRS_STAT testing
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:37 EAT 2012 [ 25341 ] Completed
CRS_STAT testing
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:37 EAT 2012 [ 25341 ] Interface tests
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:37 EAT 2012 [ 25341 ] checkIf: start for if=lan0
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:37 EAT 2012 [ 25341 ] checkIf: get default gw
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:37 EAT 2012 [ 25341 ] defaultgw: started
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:37 EAT 2012 [ 25341 ] defaultgw: completed
with
-
rqtmsdb1:ora.rqtmsdb1.vip:checkIf: Default gateway is not
defined (host=rqtmsdb1)
-
rqtmsdb1:ora.rqtmsdb1.vip:Interface lan0 checked failed (host=rqtmsdb1)
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:37 EAT 2012 [ 25341 ] checkIf: end for if=lan0
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:37 EAT 2012 [ 25341 ] DEBUG: FAIL_WHEN_ALL_LINK_DOWN = 1
and IF_USING =
-
rqtmsdb1:ora.rqtmsdb1.vip:Invalid
parameters, or failed to bring up VIP (host=rqtmsdb1)
-
CRS-1006: No more members to consider
-
CRS-0215: Could not start resource 'ora.rqtmsdb1.vip'.
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:48 EAT 2012 [ 25801 ] Checking interface existance
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:48 EAT 2012 [ 25801 ] Calling
getifbyip
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:48 EAT 2012 [ 25801 ] getifbyip: started for 172.16.7.22
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:48 EAT 2012 [ 25801 ] Completed
getifbyip
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:48 EAT 2012 [ 25801 ] switched
to standby : start/check operation
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25801 ] Completed
with initial interface test
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25801 ] Broadcast = 172.16.7.255
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25801 ] Interface tests
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25801 ] checkIf: start for if=lan0
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25801 ] checkIf: get default gw
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25801 ] defaultgw: started
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25801 ] defaultgw: completed
with
-
rqtmsdb1:ora.rqtmsdb1.vip:checkIf: Default gateway is not
defined (host=rqtmsdb1)
-
rqtmsdb1:ora.rqtmsdb1.vip:Interface lan0 checked failed (host=rqtmsdb1)
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25801 ] checkIf: end for if=lan0
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25801 ] DEBUG: FAIL_WHEN_ALL_LINK_DOWN = 1
and IF_USING =
-
rqtmsdb1:ora.rqtmsdb1.vip:Invalid
parameters, or failed to bring up VIP (host=rqtmsdb1)
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25949 ] Checking interface existance
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25949 ] Calling
getifbyip
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25949 ] getifbyip: started for 172.16.7.22
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25949 ] Completed
getifbyip
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25949 ] switched
to standby : start/check operation
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:56 EAT 2012 [ 25949 ] Completed
with initial interface test
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:56 EAT 2012 [ 25949 ] Broadcast = 172.16.7.255
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:56 EAT 2012 [ 25949 ] Performing
CRS_STAT testing
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:56 EAT 2012 [ 25949 ] Completed
CRS_STAT testing
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:56 EAT 2012 [ 25949 ] Interface tests
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:56 EAT 2012 [ 25949 ] checkIf: start for if=lan0
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:56 EAT 2012 [ 25949 ] checkIf: get default gw
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:56 EAT 2012 [ 25949 ] defaultgw: started
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:56 EAT 2012 [ 25949 ] defaultgw: completed
with
-
rqtmsdb1:ora.rqtmsdb1.vip:checkIf: Default gateway is not
defined (host=rqtmsdb1)
-
rqtmsdb1:ora.rqtmsdb1.vip:Interface lan0 checked failed (host=rqtmsdb1)
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:56 EAT 2012 [ 25949 ] checkIf: end for if=lan0
-
rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:56 EAT 2012 [ 25949 ] DEBUG: FAIL_WHEN_ALL_LINK_DOWN = 1
and IF_USING =
-
rqtmsdb1:ora.rqtmsdb1.vip:Invalid
parameters, or failed to bring up VIP (host=rqtmsdb1)
-
CRS-0215: Could not start resource 'ora.rqtmsdb1.LISTENER_RQTMSDB1.lsnr'.
- #
没有配置默认网关,在检查IP地址配置情况,发现,IP地址是配置在lan2上的,一问才知道,由于lan0经常出问题,这次改到lan2,不早说啊,nnd!!
VIP在启动的时候回去ping默认网关,如果不通,那么VIP是起不来的。HP工程师配置好默认网关后,修改VIP到lan0上去:
先删除之:
su - oracle
oifcfg delif -global
然后再重新配置:
-
$oifcfg setif -global lan2/172.16.7.0:public
- $oifcfg setif -global lan3/192.168.0.0:cluster_interconnect
#/ora_soft/oracle/product/crs/bin/srvctl modify nodeapps -n rqtmsdb2 -A 172.16.7.23/255.255.255.0/lan2
#/ora_soft/oracle/product/crs/bin/srvctl modify nodeapps -n rqtmsdb1 -A 172.16.7.22/255.255.255.0/lan2
修改完成后再次crs_start -all ,RAC启动成功,手工,睡觉!
http://blog.chinaunix.net/uid-26896647-id-3417998.html