由于CRS磁盘dismount造成的CRS进程无法启动问题

本文记录了一次 Oracle RAC 环境中 CRS 在 node1 节点自动关闭的问题排查过程。通过分析 EM 报警信息、CRS 进程状态及日志,最终定位为 CRS 磁盘组未正确挂载导致的问题,并给出了具体的解决步骤。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

环境:11.2.0.3 rac primary+rac standby

生产库rac standby的node1节点CRS自动关闭问题

--EM报警
Message=Clusterware has problems on the master agent host CRS-4638: Oracle High Availability Services is online 
CRS-4535: Cannot communicate with Cluster Ready Services CRS-4529: Cluster Synchronization Services is online 
CRS-4533: Event Manager is online 


登陆到该库node1,发现CRS进程已经关闭

[grid@node1 ~]$ crsctl check cluster -all
**************************************************************
node1:
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
node2:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************

$ crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.

[grid@node1 ~]$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services  <----------
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

ps -ef|grep crs --同样没有crs进程

试着启动crs进程,但是无法启动
[root@node1 ~]# /app/grid/bin/crsctl start crs
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.

[root@node1 ~]# /app/grid/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services <<----CRS没有起来
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online


在grid用户下查看crs进程日志

$ cd $ORACLE_HOME/log/node1/crsd
$ vim crsd.log 
-------------------
2013-12-10 15:47:19.902: [  OCRASM][33715952]ASM Error Stack :
2013-12-10 15:47:19.934: [  OCRASM][33715952]proprasmo: kgfoCheckMount returned [6]
2013-12-10 15:47:19.934: [  OCRASM][33715952]proprasmo: The ASM disk group crs is not found or not mounted <-------
2013-12-10 15:47:19.934: [  OCRRAW][33715952]proprioo: Failed to open [+crs]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2013-12-10 15:47:19.934: [  OCRRAW][33715952]proprioo: No OCR/OLR devices are usable
2013-12-10 15:47:19.934: [  OCRASM][33715952]proprasmcl: asmhandle is NULL
2013-12-10 15:47:19.935: [    GIPC][33715952] gipcCheckInitialization: possible incompatible non-threaded init from [prom.c : 690], original from [clsss.c : 5343]
2013-12-10 15:47:19.935: [ default][33715952]clsvactversion:4: Retrieving Active Version from local storage.
2013-12-10 15:47:19.937: [  OCRRAW][33715952]proprrepauto: The local OCR configuration matches with the configuration published by OCR Cache Writer. No repair required.
2013-12-10 15:47:19.938: [  OCRRAW][33715952]proprinit: Could not open raw device
2013-12-10 15:47:19.938: [  OCRASM][33715952]proprasmcl: asmhandle is NULL
2013-12-10 15:47:19.939: [  OCRAPI][33715952]a_init:16!: Backend init unsuccessful : [26]
2013-12-10 15:47:19.939: [  CRSOCR][33715952] OCR context init failure.  Error: PROC-26: Error while accessing the physical storage <-------
2013-12-10 15:47:19.939: [    CRSD][33715952] Created alert : (:CRSD00111:) :  Could not init OCR, error: PROC-26: Error while accessing the physical storage
2013-12-10 15:47:19.939: [    CRSD][33715952][PANIC] CRSD exiting: Could not init OCR, code: 26
2013-12-10 15:47:19.939: [    CRSD][33715952] Done.
---------------------


通过日志,可以看出是CRS磁盘组有问题

也确实如此,没有挂载CRS磁盘组
su - grid 
SQL> set linesize 200
SQL>  select GROUP_NUMBER,NAME,TYPE,ALLOCATION_UNIT_SIZE,STATE from v$asm_diskgroup;

GROUP_NUMBER NAME                           TYPE   ALLOCATION_UNIT_SIZE STATE
------------ ------------------------------ ------ -------------------- -----------
           0 CRS                                                      0 DISMOUNTED<--------
           2 DATA1                          EXTERN              4194304 MOUNTED


查看asm实例alert日志,返现CRS磁盘组被强制卸载了
SQL> show parameter dump
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
background_core_dump                 string      partial
background_dump_dest                 string      /app/gridbase/diag/asm/+asm/+A
                                                 SM1/trace                                          
cd /app/gridbase/diag/asm/+asm/+ASM1/trace
$ vim alert_+ASM1.log 
-------------------------------------------
Tue Dec 10 11:13:57 2013
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 1.
Tue Dec 10 11:13:57 2013
NOTE: process _b000_+asm1 (15822) initiating offline of disk 0.3916226472 (CRS_0000) with mask 0x7e in group 1
NOTE: process _b000_+asm1 (15822) initiating offline of disk 1.3916226471 (CRS_0001) with mask 0x7e in group 1
NOTE: process _b000_+asm1 (15822) initiating offline of disk 2.3916226470 (CRS_0002) with mask 0x7e in group 1
NOTE: checking PST: grp = 1
GMON checking disk modes for group 1 at 12 for pid 37, osid 15822
ERROR: no read quorum in group: required 2, found 0 disks
NOTE: checking PST for grp 1 done.
NOTE: initiating PST update: grp = 1, dsk = 0/0xe96cdfa8, mask = 0x6a, op = clear
NOTE: initiating PST update: grp = 1, dsk = 1/0xe96cdfa7, mask = 0x6a, op = clear
NOTE: initiating PST update: grp = 1, dsk = 2/0xe96cdfa6, mask = 0x6a, op = clear
GMON updating disk modes for group 1 at 13 for pid 37, osid 15822
ERROR: no read quorum in group: required 2, found 0 disks
Tue Dec 10 11:13:57 2013
NOTE: cache dismounting (not clean) group 1/0x165C2F6D (CRS) 
WARNING: Offline for disk CRS_0000 in mode 0x7f failed.
WARNING: Offline for disk CRS_0001 in mode 0x7f failed.
NOTE: messaging CKPT to quiesce pins Unix process pid: 15824, image: oracle@node1 (B001)
WARNING: Offline for disk CRS_0002 in mode 0x7f failed.
Tue Dec 10 11:13:57 2013
NOTE: halting all I/Os to diskgroup 1 (CRS)
Tue Dec 10 11:13:57 2013
NOTE: LGWR doing non-clean dismount of group 1 (CRS)
NOTE: LGWR sync ABA=3.42 last written ABA 3.42
Tue Dec 10 11:13:57 2013
kjbdomdet send to inst 2
detach from dom 1, sending detach message to inst 2
Tue Dec 10 11:13:57 2013
List of instances:
 1 2
Dirty detach reconfiguration started (new ddet inc 1, cluster inc 4)
 Global Resource Directory partially frozen for dirty detach
* dirty detach - domain 1 invalid = TRUE 
Tue Dec 10 11:13:57 2013
NOTE: No asm libraries found in the system
 520 GCS resources traversed, 0 cancelled
Dirty Detach Reconfiguration complete
Tue Dec 10 11:13:57 2013
WARNING: dirty detached from domain 1
NOTE: cache dismounted group 1/0x165C2F6D (CRS) 
SQL> alter diskgroup CRS dismount force /* ASM SERVER:375140205 */          <------------CRS被强制dismount---
Tue Dec 10 11:13:57 2013
NOTE: cache deleting context for group CRS 1/0x165c2f6d
GMON dismounting group 1 at 14 for pid 41, osid 15824
NOTE: Disk CRS_0000 in mode 0x7f marked for de-assignment
NOTE: Disk CRS_0001 in mode 0x7f marked for de-assignment
NOTE: Disk CRS_0002 in mode 0x7f marked for de-assignment
NOTE:Waiting for all pending writes to complete before de-registering: grpnum 1
Tue Dec 10 11:14:27 2013
NOTE:Waiting for all pending writes to complete before de-registering: grpnum 1
Tue Dec 10 11:14:29 2013
ASM Health Checker found 1 new failures
Tue Dec 10 11:14:57 2013
SUCCESS: diskgroup CRS was dismounted
SUCCESS: alter diskgroup CRS dismount force /* ASM SERVER:375140205 */
SUCCESS: ASM-initiated MANDATORY DISMOUNT of group CRS
--------------------------------------                     

挂载CRS 磁盘组
su - grid
sqlplus / as sysasm  --!!!一定是sysasm
SQL> alter diskgroup crs mount;

SQL> select GROUP_NUMBER,NAME,TYPE,ALLOCATION_UNIT_SIZE,STATE from v$asm_diskgroup;
GROUP_NUMBER NAME                           TYPE   ALLOCATION_UNIT_SIZE STATE
------------ ------------------------------ ------ -------------------- -----------
           1 CRS                            NORMAL              4194304 MOUNTED
           2 DATA1                          EXTERN              4194304 MOUNTED

启动CRS
但是常用的start crs命令执行不成功
# /app/grid/bin/crsctl start crs
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.

使用该命令启动成功
[root@node1 ~]# /app/grid/bin/crsctl start res ora.crsd -init
CRS-2672: Attempting to start 'ora.crsd' on 'node1'
CRS-2676: Start of 'ora.crsd' on 'node1' succeeded


# /app/grid/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online   <------------------
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online


$ crsctl check cluster -all
**************************************************************
node1:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
node2:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************

解决路线图:
crsd_log-->asm_instance_alert_log-->mount crs diskgroup -->start crs

虽然node1的CRS恢复正常,但CRS磁盘组会被强制dismount的原因还没找到,找到后会贴在这里。
<think>好的,用户遇到了ORA.DiskMon离线的问题,同时涉及到ORA-15032错误,需要解决Oracle ASM配置的问题。首先,我需要回忆一下相关的知识,并结合提供的引用内容来分析可能的原因和解决方案。 根据引用[3],在Oracle 11.2.0.3环境中,diskmon处于offline状态,而之前的版本中ora.diskmon和ora.cssd服务是互相依赖的。这可能意味着在用户的环境中,CSSD服务没有正确启动,导致diskmon无法正常运行。引用[1]中提到启动ASM实例时遇到ORA-29701错误,指出Cluster Synchronization Service(CSS)没有启动。这提示用户需要检查CSS服务的状态,确保它已启动。 接下来,引用[2]显示尝试挂载磁盘组时出现ORA-15032,并且磁盘缺失,这可能是因为ASM无法识别所需的磁盘。需要检查磁盘的权限、路径是否正确,以及是否所有磁盘都处于可用状态。 引用[4]中的crsctl命令报错,无法与Cluster Ready Services通信,说明可能集群服务存在问题,需要检查CRS的状态。这可能涉及到重新配置或启动集群服务。 综合这些信息,可能的解决步骤包括: 1. 确认CSSD服务是否运行。如果未运行,需要启动它,可能需要重新配置CSS。 2. 检查ASM磁盘的配置,确认所有磁盘都正确识别,权限正确,没有缺失。 3. 确保集群资源管理器(CRS)正常运行,可能需要重启相关服务或重新配置集群。 4. 检查磁盘组的状态,可能需要强制挂载或修复磁盘组。 5. 查看日志文件,如CSSD日志和ASM警报日志,以获取更详细的错误信息。 用户的问题可能涉及到多个服务之间的依赖关系,如diskmon依赖CSSD,而CSSD又依赖于CRS。需要逐一排查这些服务,确保它们都处于正常状态。同时,ORA-15032错误提示磁盘组不完整,必须确保所有需要的磁盘都可用且正确配置。 另外,用户的环境是单实例ASM,但错误提示与集群相关,可能存在配置上的混淆,需要确认安装时是否正确配置为单实例而非RAC。如果错误地配置了集群组件,可能需要调整参数或重新安装ASM实例。 最后,根据引用[3],可能需要检查diskmon和CSSD的依赖关系,确保在单实例环境下正确配置,避免不必要的集群服务影响。</think>### ORA.DiskMon离线问题解决方案 #### 1. 检查CSS服务状态 CSS(Cluster Synchronization Service)是ASM运行的基础依赖服务。执行以下命令检查状态: ```bash crsctl check css ``` - 若显示`CSS appears healthy`说明正常 - 若报错`CRS-4535`则需要手动启动CSS: ```bash crsctl start css ``` 对于单实例ASM,建议检查$GRID_HOME/bin目录下的`has`文件权限是否为6750[^3] #### 2. 验证ASM磁盘配置 执行`asmcmd lsdsk`检查磁盘状态: ```bash asmcmd lsdsk --all ``` 需确认: - 所有磁盘`HEADER_STATUS`为`MEMBER` - 磁盘路径权限正确(通常为grid:asmadmin) - 若出现`MISSING`状态,执行强制挂载: ```sql ALTER DISKGROUP <diskgroup_name> MOUNT FORCE; ``` #### 3. 修复磁盘组状态 当遇到ORA-15032错误时,按以下步骤处理: 1. 检查磁盘组元数据: ```sql SELECT name,state FROM v$asm_diskgroup; ``` 2. 若磁盘组处于`INCOMPATIBLE`状态,尝试: ```sql ALTER DISKGROUP <diskgroup_name> CHECK ALL REPAIR; ``` 3. 对不可挂载的磁盘组执行强制卸载再挂载: ```sql ALTER DISKGROUP <diskgroup_name> DISMOUNT FORCE; ALTER DISKGROUP <diskgroup_name> MOUNT; ``` #### 4. 检查集群服务 执行以下命令验证集群状态: ```bash crsctl stat res -t -init ``` 若出现`CRS-4000`错误,需要: 1. 停止所有相关服务: ```bash crsctl stop has ``` 2. 清理缓存: ```bash rm -rf $GRID_HOME/crs/init/* ``` 3. 重新初始化集群: ```bash crsctl start has ``` #### 5. 日志分析 检查关键日志定位问题根源: - CSS日志:`$GRID_HOME/log/<hostname>/cssd/ocssd.log` - ASM警报日志:`$ORACLE_BASE/diag/asm/+asm/+ASM/trace/alert_+ASM.log`
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值