10gR2 CRS case study: CRS would not start after reboot - stuck at /etc/init.d/init.cssd startcheck

本文记录了一次10g R2 CRS安装后,在SuSE Linux 9.3上遇到的重启问题及解决过程。主要问题是重启后CRS无法启动,通过检查日志文件发现与OCR权限有关。最终通过修改OCR设备权限解决了问题。

Preface

I had recently done a 10gR2 CRS installation on SuSE linux 9.3 (2.6.5.7-244 kernel) and noticed that after a reboot of the RAC nodes, the CRS would not come up! 

The CSS daemon was stuck at the /etc/init.d/init.cssd startcheck command:

raclinux1:/tmp # ps -ef | grep css
root      6929     1  0 13:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd fatal
root      6960  6928  0 13:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
root      6963  6929  0 13:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
root      7064  6935  0 13:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd startcheck

Debugging..

To debug this more, I went to the $ORA_CRS_HOME/log/<nodename>/client and checked the latest files there:

raclinux1:/opt/oracle/product/10.2.0/crs/log/raclinux1/client # ls -ltr
total 435
-rw-r-----  1 root   root 2561 May 18 23:20 ocrconfig_8870.log
-rw-r--r--  1 root   root  195 May 18 23:22 clscfg_8924.log
-rw-r-----  1 root   root  172 May 18 23:29 ocr_15307_3.log
-rw-r-----  1 root   root  172 May 18 23:29 ocr_15319_3.log
-rw-r-----  1 root   root  172 May 18 23:29 ocr_15447_3.log
...
...
...
drwxr-x---  2 oracle dba  3472 May 19 08:10 .
drwxr-xr-t  8 root   dba   232 May 19 13:50 ..
-rw-r--r--  1 root   root 2946 May 19 14:11 clsc.log
-rw-r--r--  1 root   root 7702 May 19 14:11 css.log

I did a more of the clsc.log & css.log and saw the following errors:

$ more clsc.log
...
...
...
2008-05-19 14:11:29.912: [ COMMCRS][1094672672]clsc_connect: (0x81c74b8) no listener at (ADDRESS=(PROTOCOL=IPC)(KEY=CRSD_UI_SOCKET))

2008-05-19 14:11:31.582: [ COMMCRS][1094672672]clsc_connect: (0x817e3f0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=SYSTEM.evm.acceptor.auth))

2008-05-19 14:11:31.583: [ default][1094672672]Terminating clsd session

$ more css.log
...
...
...
2008-05-19 02:42:48.307: [  OCROSD][1094672672]utopen:7:failed to open OCR file/disk /var/opt/oracle/ocr1 /var/opt/oracle/oc
r2, errno=19, os err string=No such device
2008-05-19 02:42:48.308: [  OCRRAW][1094672672]proprinit: Could not open raw device
2008-05-19 02:42:48.308: [ default][1094672672]a_init:7!: Backend init unsuccessful : [26]
2008-05-19 02:42:48.308: [ CSSCLNT][1094672672]clsssinit: Unable to access OCR device in OCR init.

2008-05-19 02:43:41.982: [  OCROSD][1094672672]utopen:7:failed to open OCR file/disk /var/opt/oracle/ocr1 /var/opt/oracle/oc
r2, errno=19, os err string=No such device
2008-05-19 02:43:41.983: [  OCRRAW][1094672672]proprinit: Could not open raw device
2008-05-19 02:43:41.983: [ default][1094672672]a_init:7!: Backend init unsuccessful : [26]
2008-05-19 02:43:41.983: [ CSSCLNT][1094672672]clsssinit: Unable to access OCR device in OCR init.

2008-05-19 02:46:40.204: [ CSSCLNT][1094672672]clsssInitNative: connect failed, rc 9

2008-05-19 14:11:28.217: [ CSSCLNT][1094672672]clsssInitNative: connect failed, rc 9

2008-05-19 14:11:37.186: [ CSSCLNT][1094672672]clsssInitNative: connect failed, rc 9

So it was pointing towards the OCR being not available, as could be verified by the /tmp/crsctl.<PID> files too:

raclinux1:/tmp # ls -ltr crsctl*
-rw-r--r--  1 oracle dba 148 May 19 02:44 crsctl.6826
-rw-r--r--  1 oracle dba 148 May 19 02:44 crsctl.6679
-rw-r--r--  1 oracle dba 148 May 19 02:44 crsctl.6673
-rw-r--r--  1 oracle dba 148 May 19 02:49 crsctl.7784
-rw-r--r--  1 oracle dba 148 May 19 02:49 crsctl.7890
-rw-r--r--  1 oracle dba 148 May 19 02:49 crsctl.7794
-rw-r--r--  1 oracle dba 148 May 19 13:55 crsctl.7034
-rw-r--r--  1 oracle dba 148 May 19 13:55 crsctl.6886
-rw-r--r--  1 oracle dba 148 May 19 13:55 crsctl.6883
-rw-r--r--  1 oracle dba 148 May 19 14:18 crsctl.6960
-rw-r--r--  1 oracle dba 148 May 19 14:18 crsctl.7064
-rw-r--r--  1 oracle dba 148 May 19 14:18 crsctl.6963

raclinux1:/tmp # more crsctl.6963
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [Permission denied] [13]

Permission issue!

Duh! So it was a permission issue on the OCR disk (at this moment), which could expand into a permissions issue for Voting and asm disks later:

raclinux1:/tmp # ls -ltr /dev/raw/raw*
crw-rw-r--  1 root disk 162,  9 Nov 18  2005 /dev/raw/raw9
crw-rw-r--  1 root disk 162,  8 Nov 18  2005 /dev/raw/raw8
crw-rw-r--  1 root disk 162,  7 Nov 18  2005 /dev/raw/raw7
crw-rw-r--  1 root disk 162,  6 Nov 18  2005 /dev/raw/raw6
crw-rw-r--  1 root disk 162,  5 Nov 18  2005 /dev/raw/raw5
crw-rw-r--  1 root disk 162,  4 Nov 18  2005 /dev/raw/raw4
crw-rw-r--  1 root disk 162,  3 Nov 18  2005 /dev/raw/raw3
crw-rw-r--  1 root disk 162,  2 Nov 18  2005 /dev/raw/raw2
crw-rw-r--  1 root disk 162, 15 Nov 18  2005 /dev/raw/raw15
crw-rw-r--  1 root disk 162, 14 Nov 18  2005 /dev/raw/raw14
crw-rw-r--  1 root disk 162, 13 Nov 18  2005 /dev/raw/raw13
crw-rw-r--  1 root disk 162, 12 Nov 18  2005 /dev/raw/raw12
crw-rw-r--  1 root disk 162, 11 Nov 18  2005 /dev/raw/raw11
crw-rw-r--  1 root disk 162, 10 Nov 18  2005 /dev/raw/raw10
crw-rw-r--  1 root disk 162,  1 Nov 18  2005 /dev/raw/raw1

I enabled read and write permission for the raw devices using the # chmod +rw /dev/raw/raw* devices. but even after that the latest /tmp/crsctl.<PID> files being generated were showing this message:

raclinux1:/tmp # more crsctl.6960
Failure -2 opening file handle for (vote1)
Failure 1 checking the CSS voting disk 'vote1'.
Failure -2 opening file handle for (vote2)
Failure 1 checking the CSS voting disk 'vote2'.
Failure -2 opening file handle for (vote3)
Failure 1 checking the CSS voting disk 'vote3'.
Not able to read adequate number of voting disks

At this point, I just chowned /dev/raw/raw* to oracle:dba like this:

raclinux1:/tmp # chown oracle:dba /dev/raw/raw*

After 1-2 mins, the CSS came up:

raclinux1:/tmp # ps -ef | grep css
root      6929     1  0 13:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd fatal
root     10900  6929  0 14:39 ?        00:00:00 /bin/sh /etc/init.d/init.cssd daemon
oracle   10980 10900  0 14:40 ?        00:00:00 /bin/su -l oracle -c /bin/sh -c 'ulimit -c unlimited; cd /opt/oracle/product/10.2.0/crs/log/raclinux1/cssd;  /opt/oracle/product/10.2.0/crs/bin/ocssd  || exit $?'
oracle   10981 10980  0 14:40 ?        00:00:00 /bin/sh -c ulimit -c unlimited; cd /opt/oracle/product/10.2.0/crs/log/raclinux1/cssd;  /opt/oracle/product/10.2.0/crs/bin/ocssd  || exit $?
oracle   11007 10981  2 14:40 ?        00:00:00 /opt/oracle/product/10.2.0/crs/bin/ocssd.bin
root     12013  7414  0 14:40 pts/2    00:00:00 grep css
raclinux1:/tmp #

The CRS components came up fine automatically:

raclinux1:/opt/oracle/product/10.2.0/crs/bin # ./crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy

The ASM and RAC instances also came up fine:

raclinux1:/opt/oracle/product/10.2.0/crs/bin # ps -ef |grep smon
oracle   12257     1  0 14:41 ?        00:00:00 asm_smon_+ASM1
oracle   13100     1  0 14:41 ?        00:00:02 ora_smon_o10g1
root     32282  7414  0 14:55 pts/2    00:00:00 grep smon

For the long term..

To make this change permanent, I put it in /etc/init.d/boot.local file, along with the modprobe hangcheck-timer  command:

raclinux1:/opt/oracle/product/10.2.0/crs/bin # more /etc/init.d/boot.local
#! /bin/sh
#
# Copyright (c) 2002 SuSE Linux AG Nuernberg, Germany.  All rights reserved.
#
# Author: Werner Fink <werner@suse.de>, 1996
#         Burchard Steinbild, 1996
#
# /etc/init.d/boot.local
#
# script with local commands to be executed from init on system startup
#
# Here you should add things, that should happen directly after booting
# before we're going to the first run level.
#
chown oracle:dba /dev/raw/raw*
modprobe hangcheck-timer hangcheck_tick=30 hangcheck_margin=180

Conclusion

If simple things are permissions are not correct on the OCR devices, it can hold down the CRS daemons and the ASM/DB instances. It may be needed to put workarounds in /etc/init.d/boot.local for getting around the situation.
你遇到的错误信息: ``` CRS-4535: Cannot communicate with Cluster Ready Services CRS-4000: Command Start failed, or completed with errors. ``` 表明你尝试通过 `crsctl start resource ora.cssd` 启动 Oracle 集群中的 CSSD(Cluster Synchronization Services Daemon)资源时,**无法与 CRS(Cluster Ready Services)通信**。这通常意味着 **Oracle 高可用服务(OHAS 或 CRS)本身没有运行**,或者集群基础设施存在严重问题。 --- ### ✅ 问题分析 - `ora.cssd` 是 Oracle RAC(Real Application Clusters)中非常核心的资源,负责节点间的同步和锁管理。 - 在正常情况下,`cssd` 是由 OHASD(Oracle High Availability Services Daemon)自动启动的,**不应该手动使用 `crsctl start resource` 去启动它**。 - 出现 `Cannot communicate with Cluster Ready Services` 的根本原因通常是: - CRS 没有启动(即 OHASD 未运行) - 集群进程崩溃或被关闭 - root 用户权限不足(应以 root 身份操作) - 集群件损坏或配置异常 - 网络问题导致节点间通信失败(如私网中断) --- ### ✅ 正确解决方案:启动整个集群栈(以 root 用户) 你应该以 **root 用户身份** 启动 OHASD,而不是以普通用户尝试启动单个资源。 #### 1. 切换到 root 用户并启动 OHASD ```bash # 切换到 root 用户 sudo su - root # 启动 OHASD(Oracle High Availability Services) /etc/init.d/ohasd start # 或者在较新版本中使用: initctl start oracle-ohasd # 或者使用 systemctl(如果系统是 systemd 管理) systemctl start ohasd ``` > ⚠️ 注意:具体命令取决于你的操作系统和 Oracle 版本(11g/12c/19c)。 #### 2. 使用 crsctl 检查 CRS 状态 ```bash # 切换回 grid 用户 su - grid # 查看 CRS 状态 crsctl check cluster crsctl check crs ``` 预期输出应为“online”或“active”。 #### 3. 如果 CRS 仍不响应,尝试重启整个集群件 ```bash # 以 root 执行 crsctl stop crs # 停止 CRS crsctl start crs # 启动 CRS(会自动启动 cssdcrsd、evmd 等) ``` 这将启动所有集群服务,包括 `ora.cssd`。 --- ### ✅ 补充检查项 1. **确认你是以正确的用户运行命令** - `crsctl start resource ...` 可能需要 root 权限,尤其是当 CRS 未运行时。 - `grid` 用户只能控制资源状态,但不能启动底层守护进程。 2. **查看日志定位问题** ```bash # OHASD 日志 $GRID_HOME/log/<hostname>/ohasd/ohasd.log # CRS 守护进程日志 $GRID_HOME/log/<hostname>/crsd/crsd.log ``` 查看是否有权限错误、网络故障或磁盘资源不可用等问题。 3. **确认集群是否已启用自启** ```bash crsctl config crs ``` 应该显示“enabled”。 4. **检查集群是否被禁止启动** ```bash crsctl get css disktimeout crsctl get css misscount ``` 如果这些值异常,可能导致节点驱逐或无法加入集群。 --- ### ✅ 不推荐的操作 ❌ **不要直接尝试启动 `ora.cssd` 资源**,尤其是在 CRS 未运行的情况下。 因为 `cssd` 依赖于 OHASD 和其他集群服务,单独启动它几乎不可能成功。 --- ### ✅ 总结步骤(建议顺序) ```bash # 1. 切换到 root sudo su - root # 2. 停止 CRS(如果部分运行) crsctl stop crs # 3. 启动 CRS crsctl start crs # 4. 切换回 grid 用户 su - grid # 5. 检查集群状态 crsctl check cluster crsctl check crs crsctl status resource -t ``` 此时 `ora.cssd` 应该已经自动启动。 --- ### ❓常见相关现象解释 | 错误 | 原因 | |------|------| | CRS-4535: Cannot communicate with CRS | OHASD 没有运行 | | CRS-4000: Command Start failed | 目标资源依赖的服务未就绪 | | `crsctl check crs` 提示 not active | OHASD 进程未启动 | ---
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值