RAC环境替换多路径软件后cssd服务无法启动的恢复

系统环境:Centos6.8
数据库版本:11.2.0.4.0

由华为多路径替换为Centos自带的device-mapper-multipath后,RAC集群启动一直卡在CSSD服务,状态一直是starting。

 

由于11gR2中CRS服务依赖于ASM,因为ocr存放在ASM中,所以ASM若无法有效启动,这导致CRS服务也无法正常工作:

集群日志:

2022-03-26 12:08:02.469: 
[ohasd(28938)]CRS-2112:The OLR service started on node RAC01.
2022-03-26 12:08:02.486: 
[ohasd(28938)]CRS-1301:Oracle High Availability Service started on node RAC01.
2022-03-26 12:08:02.493: 
[ohasd(28938)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred
2022-03-26 12:08:06.212: 
[/u01/app/11.2.0/grid/bin/orarootagent.bin(29065)]CRS-2302:Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running). 
2022-03-26 12:08:10.405: 
[ohasd(28938)]CRS-2302:Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running). 
2022-03-26 12:08:10.412: 
[gpnpd(29482)]CRS-2328:GPNPD started on node RAC01. 
2022-03-26 12:08:12.745: 
[cssd(29564)]CRS-1713:CSSD daemon is started in clustered mode
2022-03-26 12:08:14.621: 
[ohasd(28938)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
2022-03-26 12:08:14.621: 
[ohasd(28938)]CRS-2769:Unable to failover resource 'ora.diskmon'.
2022-03-26 12:08:17.884: 
[cssd(29564)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/RAC01/cssd/ocssd.log
2022-03-26 12:08:32.900: 
[cssd(29564)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/RAC01/cssd/ocssd.log
2022-03-26 12:08:47.916: 
[cssd(29564)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/RAC01/cssd/ocssd.log
2022-03-26 12:09:02.932: 
[cssd(29564)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/RAC01/cssd/ocssd.log
2022-03-26 12:09:17.949: 
[cssd(29564)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/RAC01/cssd/ocssd.log
2022-03-26 12:09:32.965: 
[cssd(29564)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/RAC01/cssd/ocssd.log
2022-03-26 12:09:47.981: 
[cssd(29564)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/RAC01/cssd/ocssd.log
2022-03-26 12:10:02.998: 
[cssd(29564)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/RAC01/cssd/ocssd.log

 ocssd.log


2022-03-26 12:09:47.981: [    CLSF][3717953280]checksum failed for disk:/dev/asm-datadisk01-new:
2022-03-26 12:09:47.981: [    CLSF][3717953280]Error: obj 2147483658 blk 0 name 'check_kfbh' num1 1289612970 num2 2751807285
2022-03-26 12:09:47.981: [    CLSF][3717953280]bh: ptr 0x7f5dc8138e00 size 512
2022-03-26 12
### Oracle 11g RAC 集群无法启动多路径和 UDEV 配置解决方案 当 Oracle 11g RAC 集群因存储重新划分而涉及多路径和 UDEV 的配置问题时,通常需要仔细检查以下几个方面: #### 1. **多路径配置** 多路径配置是确保共享存储设备一致性的重要环节。如果多路径未正确配置,可能导致 ASM 磁盘识别失败或集群资源不可用。 - 检查 `multipath` 是否正常运行并设置为开机自启: ```bash systemctl status multipathd systemctl enable multipathd ``` - 修改 `/etc/multipath.conf` 文件以支持用户友好的名称策略以及更健壮的路径分组政策[^3]: ```conf defaults { user_friendly_names yes path_grouping_policy multibus failback immediate no_path_retry fail } ``` - 刷新多路径缓存以应用更改: ```bash multipath -r ``` #### 2. **UDEV 规则验证** UDEV 是 Linux 中用于动态管理设备节点的服务。对于 RAC 而言,正确的 UDEV 配置可以防止磁盘名变化带来的不稳定。 - 创建或编辑 UDEV 规则文件(例如 `/etc/udev/rules.d/99-oracleasm.rules`),定义固定的设备命名规则: ```bash KERNEL=="sd*", BUS=="scsi", PROGRAM=="/sbin/scsi_id --whitelisted --replace-whitespace --device /dev/$name", RESULT=="SATA_DISK_SERIAL_NUMBER", NAME="oradisk%n", OWNER="oracle", GROUP="dba", MODE="0660" ``` 替换 `SATA_DISK_SERIAL_NUMBER` 和其他占位符为实际硬件序列号或其他唯一标识。 - 加载新的 UDEV 规则并测试其效果: ```bash udevadm control --reload-rules && udevadm trigger ls -l /dev/oradisk* ``` #### 3. **Clusterware 日志分析** 通过 Clusterware 的日志定位具体错误原因是非常重要的一步。主要关注以下目录中的日志文件: - `$GRID_HOME/log/<hostname>/crs/` - `$GRID_HOME/log/diag/crs/<hostname>/crs/trace/alert_<hostname>.log` 查找关键字如 “ORA-”, “CRS-“, 或者具体的磁盘访问异常提示。 #### 4. **ASM Disk Discovery Path 设置** 确保所有节点上的 ASM 实例能够发现相同的磁盘集合。可以通过调整 `.bash_profile` 来指定统一的 DISK_DISCOVERY_PATH 参数[^4]: ```bash export ORACLE_BASE=/u01/app/grid export ORACLE_HOME=/u01/app/11.2.0/grid export GRID_HOME=/u01/app/11.2/grid export ORACLE_SID=+ASM1 # 对于第二个节点应改为 +ASM2 export ASM_DISKGROUP=data_diskgroup_name export ASM_DISKSTRING='/dev/mapper/*' ``` 注意这里的 `ASM_DISKSTRING` 应指向由多路径生成的一致性设备路径而非原始 SCSI 设备。 #### 5. **数据库实例权限校正** 有时即使解决了底层存储问题仍需手动修正某些关键可执行程序的权限以便顺利加载服务进程。比如之前提到过的案例中涉及到的操作就是典型例子之一[^1]: ```bash chmod 6751 $ORACLE_HOME/bin/oracle srvctl stop database -d <dbname> srvctl start database -d <dbname> ``` 以上步骤完成后建议逐步尝试启动各组件观察状态直至整个集群恢复正常运作为止。 --- ### 示例脚本:自动化部分基础排查流程 以下是简化版的一个 Bash 脚本来帮助快速诊断一些常见状况: ```bash #!/bin/bash # Check Multipath Status echo "Checking Multipath Service..." systemctl is-active multipathd || echo "[ERROR] Multipath service not running." # List All Mapped Devices echo "\nListing all mapped devices:" multipath -ll # Verify UDEV Rules Applied Correctly echo "\nVerifying UDEV rules applied correctly:" ls -lh /dev/oradisk* # Display Last Few Lines From CRS Log For Errors echo "\nDisplay last few lines from CRS log for potential errors:" tail -n 20 $GRID_HOME/log/$(hostname)/alert_*.log ``` ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

楚枫默寒

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值