Troubleshooting 11.2 Clusterware Node Evictions (Reboots) (Doc ID 1050693.1)

本文提供了解决Oracle数据库集群节点驱逐问题的参考指南,包括关键进程角色、故障排查步骤、常见原因及故障收集文件。适用于Oracle数据库企业版11.2.0.1到11.2.0.2版本。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

In this Document

 Purpose
 Scope
 Details
 NODE EVICTION OVERVIEW
 1.0 - PROCESS ROLES FOR REBOOTS
 2.0 - DETERMINING WHICH PROCESS IS RESPONSIBLE FOR A REBOOT
 3.0 - TROUBLESHOOTING OCSSD EVICTIONS
 3.1 - COMMON CAUSES OF OCSSD EVICTIONS
 3.2 - FILES TO REVIEW AND GATHER FOR OCSSD EVICTIONS
 4.0 - TROUBLESHOOTING CSSDAGENT OR CSSDMONITOR EVICTIONS
 4.1 - COMMON CAUSES OF CSSDAGENT OR CSSDMONITOR EVICTIONS
 4.2 - FILES TO REVIEW AND GATHER FOR CSSDAGENT OR CSSDMONITOR EVICTIONS
 References

APPLIES TO:

Oracle Database - Enterprise Edition - Version 11.2.0.1 to 11.2.0.2 [Release 11.2]
Information in this document applies to any platform.

PURPOSE

This document is to provide a reference for troubleshooting 11.2 Clusterware node evictions.  For clusterware node evictions prior to 11.2, see Note: 265769.1

SCOPE

This document is intended for DBA's and support analysts experiencing clusterware node evictions (reboots).

DETAILS

NODE EVICTION OVERVIEW

The Oracle Clusterware is designed to perform a node eviction by removing one or more nodes from the cluster if some critical problem is detected.  A critical problem could be a node not responding via a network heartbeat, a node not responding via a disk heartbeat, a hung or severely degraded machine, or a hung ocssd.bin process.  The purpose of this node eviction is to maintain the overall health of the cluster by removing bad members.  

Starting in 11.2.0.2 RAC (or if you are on Exadata), a node eviction may not actually reboot the machine.  This is called a rebootless restart.  In this case we restart most of the clusterware stack to see if that fixes the unhealthy node. 

1.0 - PROCESS ROLES FOR REBOOTS


OCSSD (aka CSS daemon) - This process is spawned by the cssdagent process. It runs in both
vendor clusterware and non-vendor clusterware environments.  OCSSD's primary job is internode health monitoring and RDBMS instance endpoint discovery. The health monitoring includes a network heartbeat and a disk heartbeat (to the voting files).  OCSSD can also evict a node after escalation of a member kill from a client (such as a database LMON process). This is a multi-threaded process that runs at an elevated priority and runs as the Oracle user.

Startup sequence: INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdagent --> ocssd --> ocssd.bin

CSSDAGENT - This process is spawned by OHASD and is responsible for spawning the OCSSD process, monitoring for node hangs (via oprocd functionality), and monitoring to the OCSSD process for hangs (via oclsomon functionality), and monitoring vendor clusterware (via vmon functionality).  This is a multi-threaded process that runs at an elevated priority and runs as the root user.

Startup sequence: INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdagent

CSSDMONITOR - This proccess also monitors for node hangs (via oprocd functionality), monitors the OCSSD process for hangs (via oclsomon functionality), and monitors vendor clusterware (via vmon functionality). This is a multi-threaded process that runs at an elevated priority and runs as the root user.

Startup sequence: INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdmonitor

2.0 - DETERMINING WHICH PROCESS IS RESPONSIBLE FOR A REBOOT

Important files to review:

  • Clusterware alert log in <GRID_HOME>/log/<nodename>
  • The cssdagent log(s) in <GRID_HOME>/log/<nodename>/agent/ohasd/oracssdagent_root
  • The cssdmonitor log(s) in <GRID_HOME>/log/<nodename>/agent/ohasd/oracssdmonitor_root
  • The ocssd log(s) in <GRID_HOME>/log/<nodename>/cssd
  • The lastgasp log(s) in /etc/oracle/lastgasp or /var/opt/oracle/lastgasp
  • IPD/OS or OS Watcher data
  • 'opatch lsinventory -detail' output for the GRID home
  • *Messages files:

* Messages file locations:

  • Linux: /var/log/messages
  • Sun: /var/adm/messages
  • HP-UX: /var/adm/syslog/syslog.log
  • IBM: /bin/errpt -a > messages.out
Please refer to the following document which provides information on collecting together most of the above files:

      Document 1513912.1 - TFA Collector - Tool for Enhanced Diagnostic Gathering


11.2 Clusterware evictions should, in most cases, have some kind of meaningful error in the clusterware alert log.  This can be used to determine which process is responsible for the reboot.  Example message from a clusterware alert log:

[ohasd(11243)]CRS-8011:reboot advisory message from host: sta00129, component:  cssagent, with timestamp: L-2009-05-05-10:03:25.340
[ohasd(11243)]CRS-8013:reboot advisory message text: Rebooting after limit 28500 exceeded; disk timeout 27630, network timeout 28500, last heartbeat from CSSD at epoch seconds 1241543005.340, 4294967295 milliseconds ago based on invariant clock value of 93235653


This particular eviction happened when we had hit the network timeout.  CSSD exited and the cssdagent took action to evict. The cssdagent knows the information in the error message from local heartbeats made from CSSD.  

If no message is in the evicted node's clusterware alert log, check the lastgasp logs on the local node and/or the clusterware alert logs of other nodes. 

3.0 - TROUBLESHOOTING OCSSD EVICTIONS


If you have encountered an OCSSD eviction review common causes in section 3.1 below. 

3.1 - COMMON CAUSES OF OCSSD EVICTIONS

  • Network failure or latency between nodes. It would take 30 consecutive missed checkins (by default - determined by the CSS misscount) to cause a node eviction. 
  • Problems writing to or reading from the CSS voting disk.  If the node cannot perform a disk heartbeat to the majority of its voting files, then the node will be evicted.
  • A member kill escalation.  For example, database LMON process may request CSS to remove an instance from the cluster via the instance eviction mechanism.  If this times out it could escalate to a node kill. 
  • An unexpected failure or hang of the OCSSD process, this can be caused by any of the above issues or something else.
  • An Oracle bug.

3.2 - FILES TO REVIEW AND GATHER FOR OCSSD EVICTIONS


All files from section 2.0 from all cluster nodes.  More data may be required.

Example of an eviction due to loss of voting disk:

CSS log:

2012-03-27 22:05:48.693: [ CSSD][1100548416](:CSSNM00018:)clssnmvDiskCheck: Aborting, 0 of 3 configured voting disks available, need 2
2012-03-27 22:05:48.693: [ CSSD][1100548416]###################################
2012-03-27 22:05:48.693: [ CSSD][1100548416]clssscExit: CSSD aborting from thread clssnmvDiskPingMonitorThread



OS messages:

Mar 27 22:03:58 choldbr132p kernel: Error:Mpx:All paths to Symm 000190104720 vol 0c71 are dead.
Mar 27 22:03:58 choldbr132p kernel: Error:Mpx:Symm 000190104720 vol 0c71 is dead.
Mar 27 22:03:58 choldbr132p kernel: Buffer I/O error on device sdbig, logical block 0
...

 

4.0 - TROUBLESHOOTING CSSDAGENT OR CSSDMONITOR EVICTIONS


If you have encountered a CSSDAGENT or CSSDMONITOR eviction review common causes in section 4.1 below.

4.1 - COMMON CAUSES OF CSSDAGENT OR CSSDMONITOR EVICTIONS

  • An OS scheduler problem.  For example, if the OS is getting locked up in a driver or hardware or there is excessive amounts of load on the machine (at or near 100% cpu utilization), thus preventing the scheduler from behaving reasonably.
  • A thread(s) within the CSS daemon hung.
  • An Oracle bug.

4.2 - FILES TO REVIEW AND GATHER FOR CSSDAGENT OR CSSDMONITOR EVICTIONS


All files from section 2.0 from all cluster nodes. More data may be required.

 

Database - RAC/Scalability Community
To discuss this topic further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle SupportDatabase - RAC/Scalability Community

REFERENCES

NOTE:265769.1  - Troubleshooting 10g and 11.1 Clusterware Reboots
NOTE:301137.1  - OSWatcher Black Box (Includes: [Video])


NOTE:1053147.1  - 11gR2 Clusterware and Grid Home - What You Need to Know
NOTE:736752.1  - Introducing Cluster Health Monitor (IPD/OS)
NOTE:1513912.1  - TFA Collector - Tool for Enhanced Diagnostic Gathering 
内容概要:本文档是关于基于Tecnomatix的废旧智能手机拆解产线建模与虚拟调试的毕业设计任务书。研究内容主要包括:分析废旧智能手机拆解工艺流程;学习并使用Tecnomatix软件搭建拆解产线的三维模型,包括设备、输送装置等;进行虚拟调试以模拟各种故障情况,并对结果进行分析提出优化建议。研究周期为16周,涵盖了文献调研、拆解实验、软件学习、建模、调试和论文撰写等阶段。文中还提供了Python代码来模拟部分关键流程,如拆解顺序分析、产线布局设计、虚拟调试过程、故障模拟与分析等,并实现了结果的可视化展示。 适合人群:本任务书适用于机械工程、工业自动化及相关专业的本科毕业生,尤其是那些对智能制造、生产线优化及虚拟调试感兴趣的学生。 使用场景及目标:①帮助学生掌握Tecnomatix软件的应用技能;②通过实际项目锻炼学生的系统建模和虚拟调试能力;③培养学生解决复杂工程问题的能力,提高其对废旧电子产品回收再利用的认识和技术水平;④为后续的研究或工作打下坚实的基础,比如从事智能工厂规划、生产线设计与优化等工作。 其他说明:虽然文中提供了部分Python代码用于模拟关键流程,但完整的产线建模仍需借助Tecnomatix商业软件完成。此外,为了更好地理解和应用这些内容,建议学生具备一定的编程基础(如Python),并熟悉相关领域的基础知识。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值