了解更多关于11gR2 diskmon

本文详细介绍了 Oracle Clusterware 中的 Diskmon 进程的作用及其在 Exadata 和非 Exadata 环境中的行为变化。从 11.2.0.3 版本开始,在非 Exadata 环境中,默认情况下 Diskmon 进程将不再启动。文章还提供了 Diskmon 日志示例及常见问题解决方案。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

  下图显示了有ohasd管理的所有资源(resource)/守护进程(daemons)的依赖关系:    

Diskmon

 

Master diskmon

• Monitors CELLSRVs and Network using Heartbeats • Propagates Cell/Network state to ASM/RDBMS processes (dskm) • Maintains a cluster-wide global view of Cells with other DISKMONs in the cluster • Accepts fencing requests from CRS and delivers them to the Cells • Accepts intradatabase IORM plans from RDBMS and sends them to the Cells • Provides communication with the cells

    Diskmon daemon用以监控Exadata 中的cell server,即只有在Exadata环境中才有用。但是在版本11.2.0.1-11.2.0.2,即便是非Exadata环境也会默认启动该守护进 程。  在版本11.2.0.3 中 改进了这一细节,非Exadata环境无发启动diskmon了。 11.2.0.3 Grid Infrastructure diskmon Will be Offline by Default in Non-Exadata Environment What is being announced? As Grid Infrastructure daemon diskmon.bin is used for Exadata fencing, started from 11.2.0.3, resource ora.diskmon will be offline in non-Exadata environment. This is expected behaviour change. Prior to 11.2.0.3: ps -ef| grep diskmon.bin grid      3361  3166  0 22:57 ?        00:00:00 /ocw/grid/bin/diskmon.bin -d -f On 11.2.0.3: ps -ef| grep diskmon.bin >> no more diskmon.bin     一些diskmon进程的日志:  
[ CSSD]2009-07-27 10:27:36.419 [20] >TRACE: kgzf_dskm_conn4: unable to connect to master
diskmon in 60174 msec

[ CSSD]2009-07-27 10:27:36.419 [20] >TRACE: kgzf_send_main1: connection to master diskmon
timed out

[ CSSD]2009-07-27 10:27:36.421 [22] >TRACE: KGZF: Fatal diskmon condition, IO fencing is
not available. For additional error info look at the master diskmon log file (diskmon.log)

[ CSSD]2009-07-27 10:27:36.421 [22] >ERROR: ASSERT clsssc.c 2471
[ CSSD]2009-07-27 10:27:36.421 [22] >ERROR: clssscSAGEInitFenceCompl: Fence completion
failed, rc 56859

It seems that the new process registered with Oracle Clusterware diskmon is not able to communicate properly .

setsid: failed with -1/1
dskm_getenv_oracle_user: calling getpwnam_r for user oracle
dskm_getenv_oracle_user: info for user oracle complete
dskm_set_user: unable to change ownership for the log directory
/optware/oracle/11.1.0.7/crs/log/shplab01/diskmon to user oracle, id 1101, errno 1
07/27/09 10:27:37: Master Diskmon starting

The tusc output of the cssd log gives the following information
...
1248953770.528145 [/optware/ora][20944]{2992772}
unlink("/var/spool/sockets/pwgr/client20944") ERR#2 ENOENT
1248953770.612485 [/optware/ora][20944]{2992772}
unlink("/tmp/.oracle_master_diskmon") ERR#1 EPERM
1248953770.649479 [/optware/ora][20944]{2992772}
unlink("/tmp/.oracle_master_diskmon") ERR#1 EPERM
1248953770.656719 [/optware/ora][20944]{2992772}
unlink("/var/spool/sockets/pwgr/client20944") ERR#1 EPERM
...

 There is a permission error of the file /tmp/.oracle_master_diskmon.
Solution

The resolution is to change the permissions of the file  /tmp/.oracle_master_diskmon, which should be owned by oracle . 

diskmon.log
============
2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512] SKGXP:[386927568.6]{0}:
(14036 -> 12265) SKGXP_CHECK_HEART_BEAT_RESP_EXPIRE: NO PATH to Monitor
entry: 0x17161490
2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512] SKGXP:[386927568.7]{0}:
  Subnet: 0
2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512] SKGXP:[386927568.8]{0}:
   Remote endpoint [192.168.10.3/44538] is DOWN
2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512] SKGXP:[386927568.9]{0}:
   Local endpoint [192.168.10.1/45530] is UP
2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512]
SKGXP:[386927568.10]{0}: SKGXP_DO_HEART_BEAT_RESP: Matching Monitor Entry Not
Found
2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512]
SKGXP:[386927568.11]{0}:   SKGXPGPID Internet address 192.168.10.3 RDS port
number 44538
2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512] dskm_hb_thrd_main11:
got status change
2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512]
dskm_ant_rsc_monitor_start: rscnam: o/192.168.10.3 rsc: 0x171609c0 state:
UNREACHABLE reconn_attempts: 0 last_reconn_ts: 1322773921
2011-12-01 22:14:49.649: [ DISKMON][14036:1093384512]
dskm_node_guids_are_offline: query SM done. retcode = 56891(REACHABLE)
2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512] dskm_oss_get_net_info5:
oss_get_net_info for device o/192.168.10.3 returned skgxpid
040302010001894cb5afca0419ed706ae92f000008000000000000000000000001030000c0a80a
03000000000000000000000000adfa00000000000016000000 and the following 1 ip
adresess. known_reid: Yes
2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512]     192.168.10.1
2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512]
dskm_ant_rsc_monitor_start6.5:Cell does support TCP monitor, and does support
SM Query, cell incarnation is 1, guid num is 2
2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512] GUID-0 =
0x0021280001a0af15
2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512] GUID-1 =
0x0021280001a0af16
2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512]
dskm_ant_rsc_monitor_start2: Connected to Same_Inc OSS device: o/192.168.10.3
numIP: 1
2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512]     192.168.10.1

2011-12-01 22:15:07.501: [ DISKMON][14036:1108523328] dskm_slave_thrd_main3:
peer disconnected
2011-12-01 22:15:07.501: [ DISKMON][14036:1108523328] dskm_slave_thrd_main5:
client +ASM1/ASM/15374 disconnected, reid
cid=14e1b2b4de58ef1eff5487b58dccc906,icin=188142332,nmn=1,lnid=188142332,gid=7
,gin=1,gmn=0,umemid=0,opid=8,opsn=1,lvl=process hdr=0x       0

2011-12-01 22:15:08.440: [ CSSCLNT]clsssRecvMsg: got a disconnect from the
server while waiting for message type 1
2011-12-01 22:15:08.440: [ CSSCLNT]clssgsGroupGetStatus:  communications
failed (0/3/-1)

2011-12-01 22:15:08.440: [ CSSCLNT]clssgsGroupGetStatus: returning 8

2011-12-01 22:15:08.440: [ DISKMON][14036:1102219584] CRITICAL: Diskmon
exiting: dskm_rac_thrd_main10: Diskmon is shutting down due to CSSD ABORT
event
2011-12-01 22:15:08.440: [ DISKMON][14036:1102219584] SHUTDOWN FORCE due to
CSSD ABORT
2011-12-01 22:15:08.440: [ DISKMON][14036:1102219584] dskm_rac_thrd_main:
exiting
2011-12-01 22:15:08.754: [ DISKMON][14036:1104320832] dskm_slave_thrd_main5:
client orarootagent/13701 disconnected, reid
cid=DUMMY,icin=-1,nmn=-1,lnid=-1,gid=-1,gin=-1,gmn=-1,umemid=-1,opid=-1,opsn=-
1,lvl=process hdr=0xfece0100
2011-12-01 22:15:09.988: [ DISKMON][14036:1191118288] dskm_cleanup_thrds:
cleaning up the rac event handling thread tid 1102219584
[ DISKMON][13016]

I/O Fencing and SKGXP HA monitoring daemon -- Version 1.2.0.0
Process 13016 started on 2011-12-01 at 22:15:39.863

2011-12-01 22:15:39.867: [ DISKMON][13016] dskm main: starting up

ocssd.log
==========
2011-12-01 22:15:04.223: [    CSSD][1127139648]clssgmmkLocalKillThread: Time
up. Timeout 60500 Start time 369099698 End time 369160198 Current time
369160198
2011-12-01 22:15:04.223: [    CSSD][1127139648]clssgmmkLocalKillResults:
Replying to kill request from remote node 2 kill id 1 Success map 0x00000000
Fail map 0x00000000
2011-12-01 22:15:04.224: [GIPCHAUP][1094015296] gipchaUpperProcessDisconnect:
processing DISCONNECT for hendp 0x2aa5550 [00000000000092e5] { gipchaEndpoint
: port 'nm2_gts-cluster/af9c-724c-2e3f-3946', peer
'gts1db02:205f-3cac-025e-c962', srcCid 00000000-000092e5,  dstCid
00000000-000009d9, numSend 0, maxSend 100, usrFlags 0x4000, flags 0x204 }
2011-12-01 22:15:04.224: [    CSSD][1122408768]clssnmeventhndlr:
Disconnecting endp 0x932d ninf 0x1c3a2c0
2011-12-01 22:15:04.224: [    CSSD][1122408768]clssnmDiscHelper: gts1db02,
node(2) connection failed, endp (0x932d), probe(0x3000000000), ninf->endp
0x932d
2011-12-01 22:15:04.224: [    CSSD][1122408768]clssnmDiscHelper: node 2 clean
up, endp (0x932d), init state 3, cur state 3
2011-12-01 22:15:04.224: [GIPCXCPT][1122408768] gipcInternalDissociate: obj
0x2e99290 [000000000000932d] { gipcEndpoint : localAddr
'gipcha://gts1db01:nm2_gts-cluster/af9c-724c-2e3f-394', remoteAddr
'gipcha://gts1db02:205f-3cac-025e-c96', numPend 0, numReady 0, numDone 0,
numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x13860e, usrFlags
0x0 } not associated with any container, ret gipcretFail (1)
2011-12-01 22:15:04.224: [GIPCXCPT][1122408768] gipcDissociateF
[clssnmDiscHelper : clssnm.c : 3284]: EXCEPTION[ ret gipcretFail (1) ]  
failed to dissociate obj 0x2e99290 [000000000000932d] { gipcEndpoint :
localAddr 'gipcha://gts1db01:nm2_gts-cluster/af9c-724c-2e3f-394', remoteAddr
'gipcha://gts1db02:205f-3cac-025e-c96', numPend 0, numReady 0, numDone 0,
numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x13860e, usrFlags
0x0 }, flags 0x0
2011-12-01 22:15:04.224: [GIPCXCPT][1122408768] gipcInternalDissociate: obj
0x2e99290 [000000000000932d] { gipcEndpoint : localAddr
'gipcha://gts1db01:nm2_gts-cluster/af9c-724c-2e3f-394', remoteAddr
'gipcha://gts1db02:205f-3cac-025e-c96', numPend 0, numReady 0, numDone 0,
numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x13860e, usrFlags
0x0 } not associated with any container, ret gipcretFail (1)
2011-12-01 22:15:04.224: [GIPCXCPT][1122408768] gipcDissociateF
[clssnmDiscHelper : clssnm.c : 3430]: EXCEPTION[ ret gipcretFail (1) ]  
failed to dissociate obj 0x2e99290 [000000000000932d] { gipcEndpoint :
localAddr 'gipcha://gts1db01:nm2_gts-cluster/af9c-724c-2e3f-394', remoteAddr
'gipcha://gts1db02:205f-3cac-025e-c96', numPend 0, numReady 0, numDone 0,
numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x13860e, usrFlags
0x0 }, flags 0x0
2011-12-01 22:15:04.224: [    CSSD][1122408768]clssnmDiscEndp: gipcDestroy
0x932d
2011-12-01 22:15:04.603: [    
CSSD][1104976192](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by
node gts1db02, number 2, sync 188142334, stamp 393990918
2011-12-01 22:15:04.603: [    
CSSD][1104976192]###################################
2011-12-01 22:15:04.603: [    CSSD][1104976192]clssscExit: CSSD aborting from
thread clssnmvKillBlockThread
2011-12-01 22:15:04.603: [    
CSSD][1104976192]###################################
2011-12-01 22:15:04.603: [    CSSD][1104976192](:CSSSC00012:)clssscExit: A
fatal error occurred and the CSS daemon is terminating abnormally
gts1db01, number 1, has experienced a failure in thread number 10 and is
shutting down
2011-12-01 22:15:04.603: [    CSSD][1104976192]clssscExit: Starting CRSD
cleanup

2011-12-01 22:15:04.737: [    CSSD][1103399232]clssgmDiscEndpcl: gipcDestroy
0xa2ea2f7
2011-12-01 22:15:04.925: [    
CSSD][1112947008](:CSSNM00058:)clssnmvDiskCheck: No I/O completions for
3472942430 ms for voting file o/192.168.10.5/DBFS_DG_CD_04_gts1cel03)
2011-12-01 22:15:04.925: [    CSSD][1112947008]clssnmvDiskAvailabilityChange:
voting file o/192.168.10.5/DBFS_DG_CD_04_gts1cel03 now offline
2011-12-01 22:15:04.925: [    
CSSD][1112947008](:CSSNM00058:)clssnmvDiskCheck: No I/O completions for
3472942450 ms for voting file o/192.168.10.4/DBFS_DG_CD_02_gts1cel02)
2011-12-01 22:15:04.925: [    CSSD][1112947008]clssnmvDiskAvailabilityChange:
voting file o/192.168.10.4/DBFS_DG_CD_02_gts1cel02 now offline
2011-12-01 22:15:04.925: [    
CSSD][1112947008](:CSSNM00058:)clssnmvDiskCheck: No I/O completions for
3472942480 ms for voting file o/192.168.10.3/DBFS_DG_CD_02_gts1cel01)
2011-12-01 22:15:04.926: [    CSSD][1112947008]clssnmvDiskAvailabilityChange:
voting file o/192.168.10.3/DBFS_DG_CD_02_gts1cel01 now offline
2011-12-01 22:15:04.926: [    
CSSD][1112947008](:CSSNM00018:)clssnmvDiskCheck: Aborting, 0 of 3 configured
voting disks available, need 2
2011-12-01 22:15:04.926: [    CSSD][1112947008]clssscExit: abort already set
1
2011-12-01 22:15:04.926: [   SKGFD][1109793088]Lib :OSS:: closing handle
0x2538e70 for disk :o/192.168.10.5/DBFS_DG_CD_04_gts1cel03:

2011-12-01 22:15:04.926: [   SKGFD][1098676544]Lib :OSS:: closing handle
0x2aaaac0d7cb0 for disk :o/192.168.10.3/DBFS_DG_CD_02_gts1cel01:

Heartbeat timeout logic may to fail to detect dead cells if diskmon
has been running for over 40 days.

Rediscovery Notes:
 If diskmon has been running for over 40 days and DB processes start to hang
 after a cell death, you may have hit this bug.

 All nodes may hang due to one of the heartbeat threads in diskmon
getting stuck trying to notify the instance(s) that it reconnected
to the cell. However if this occurs there is insufficient diagnostic
data collected to help confirm why the hang occurred.
This fix is a diagnostic enhancement for this scenario.

If diskmon/DSKM processes are hung/stuck this fix may help collect
additional useful diagnostics.

PROBLEM:
--------
Diskmon logs fill up quickly causing their disk / volume /u00  to become full

DIAGNOSTIC ANALYSIS:
--------------------
1)  The following are consistently and repeatedly logged:

2010-12-17 03:22:25.848: [ DISKMON][17796:1089268032] dskm_rac_ini13: calling
clssgsqgrp
2010-12-17 03:22:25.849: [ DISKMON][17796:1089268032] dskm_rac_ini80: called
clssgsqgrp:
2010-12-17 03:22:25.849: [ DISKMON][17796:1089268032] dskm_dump_group_priv:
vers: 0 flags: 0x0 confInc: 0 My confInc: 0
2010-12-17 03:22:25.849: [ DISKMON][17796:1089268032] dskm_dump_group_priv:
CSS Msg Hdr: vers: 0 type: UNKNOWN (0) chunks: NO MORE CHUNKS (0) transport:
UNKNOWN (0) mSize: 0
2010-12-17 03:22:25.849: [ DISKMON][17796:1089268032] dskm_dump_group_priv:
Group Private Data is not of type DSKM_MSG_SS_REQ. Not proceeding with msg
dump
2010-12-17 03:22:25.849: [ DISKMON][17796:1089268032] dskm_rac_ini15: Found
my member number 1 to be busy. Waiting (attempts: 598) for OCSSD to clean up
previous incarnation

2) Core files generated, and many  stack dumps of diskmon futher enlarges the
diskmon.log

   The following is frequently seen in the diskmon.log

2010-12-17 03:22:28.855: [ DISKMON][17796:1089268032] dskm_rac_ini16: OCSSD
has notified that another diskmon is currently running in this node.
This might be a duplicate startup. If not consult OCSSD log for additional
information.
2010-12-17 03:22:28.855: [ DISKMON][17796] INCIDENT : EXCEPTION (SIGNAL: 6)
in [gsignal()]
2010-12-17 03:22:28.855: [ DISKMON][17796] Thread 1089268032 got exception 6
2010-12-17 03:22:28.855: [ DISKMON][17796] Stack dump for thread 1089268032
[ DISKMON][17796]
....
<think>好的,用户遇到了ORA.DiskMon离线的问题,同时涉及到ORA-15032错误,需要解决Oracle ASM配置的问题。首先,我需要回忆一下相关的知识,并结合提供的引用内容来分析可能的原因和解决方案。 根据引用[3],在Oracle 11.2.0.3环境中,diskmon处于offline状态,而之前的版本中ora.diskmon和ora.cssd服务是互相依赖的。这可能意味着在用户的环境中,CSSD服务没有正确启动,导致diskmon无法正常运行。引用[1]中提到启动ASM实例时遇到ORA-29701错误,指出Cluster Synchronization Service(CSS)没有启动。这提示用户需要检查CSS服务的状态,确保它已启动。 接下来,引用[2]显示尝试挂载磁盘组时出现ORA-15032,并且磁盘缺失,这可能是因为ASM无法识别所需的磁盘。需要检查磁盘的权限、路径是否正确,以及是否所有磁盘都处于可用状态。 引用[4]中的crsctl命令报错,无法与Cluster Ready Services通信,说明可能集群服务存在问题,需要检查CRS的状态。这可能涉及到重新配置或启动集群服务。 综合这些信息,可能的解决步骤包括: 1. 确认CSSD服务是否运行。如果未运行,需要启动它,可能需要重新配置CSS。 2. 检查ASM磁盘的配置,确认所有磁盘都正确识别,权限正确,没有缺失。 3. 确保集群资源管理器(CRS)正常运行,可能需要重启相关服务或重新配置集群。 4. 检查磁盘组的状态,可能需要强制挂载或修复磁盘组。 5. 查看日志文件,如CSSD日志和ASM警报日志,以获取更详细的错误信息。 用户的问题可能涉及到多个服务之间的依赖关系,如diskmon依赖CSSD,而CSSD又依赖于CRS。需要逐一排查这些服务,确保它们都处于正常状态。同时,ORA-15032错误提示磁盘组不完整,必须确保所有需要的磁盘都可用且正确配置。 另外,用户的环境是单实例ASM,但错误提示与集群相关,可能存在配置上的混淆,需要确认安装时是否正确配置为单实例而非RAC。如果错误地配置了集群组件,可能需要调整参数或重新安装ASM实例。 最后,根据引用[3],可能需要检查diskmon和CSSD的依赖关系,确保在单实例环境下正确配置,避免不必要的集群服务影响。</think>### ORA.DiskMon离线问题解决方案 #### 1. 检查CSS服务状态 CSS(Cluster Synchronization Service)是ASM运行的基础依赖服务。执行以下命令检查状态: ```bash crsctl check css ``` - 若显示`CSS appears healthy`说明正常 - 若报错`CRS-4535`则需要手动启动CSS: ```bash crsctl start css ``` 对于单实例ASM,建议检查$GRID_HOME/bin目录下的`has`文件权限是否为6750[^3] #### 2. 验证ASM磁盘配置 执行`asmcmd lsdsk`检查磁盘状态: ```bash asmcmd lsdsk --all ``` 需确认: - 所有磁盘`HEADER_STATUS`为`MEMBER` - 磁盘路径权限正确(通常为grid:asmadmin) - 若出现`MISSING`状态,执行强制挂载: ```sql ALTER DISKGROUP <diskgroup_name> MOUNT FORCE; ``` #### 3. 修复磁盘组状态 当遇到ORA-15032错误时,按以下步骤处理: 1. 检查磁盘组元数据: ```sql SELECT name,state FROM v$asm_diskgroup; ``` 2. 若磁盘组处于`INCOMPATIBLE`状态,尝试: ```sql ALTER DISKGROUP <diskgroup_name> CHECK ALL REPAIR; ``` 3. 对不可挂载的磁盘组执行强制卸载再挂载: ```sql ALTER DISKGROUP <diskgroup_name> DISMOUNT FORCE; ALTER DISKGROUP <diskgroup_name> MOUNT; ``` #### 4. 检查集群服务 执行以下命令验证集群状态: ```bash crsctl stat res -t -init ``` 若出现`CRS-4000`错误,需要: 1. 停止所有相关服务: ```bash crsctl stop has ``` 2. 清理缓存: ```bash rm -rf $GRID_HOME/crs/init/* ``` 3. 重新初始化集群: ```bash crsctl start has ``` #### 5. 日志分析 检查关键日志定位问题根源: - CSS日志:`$GRID_HOME/log/<hostname>/cssd/ocssd.log` - ASM警报日志:`$ORACLE_BASE/diag/asm/+asm/+ASM/trace/alert_+ASM.log`
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值