造成 RAC 节点重新启动的前五大原因

最新推荐文章于 2022-12-30 17:03:42 发布

转载最新推荐文章于 2022-12-30 17:03:42 发布 · 160 阅读

文章标签：

本文概述了导致Oracle集群中节点重启或CRS意外回收的五大常见问题，并针对每个问题提供了详细的分析和解决策略。包括网络心跳丢失、存储问题、数据库实例挂起或被驱逐、自动回收失败以及CRS直接重启等场景。通过理解这些原因和解决方案，用户可以更有效地诊断和解决集群中的故障。

Top 5 Issues That Cause Node Reboots or Unexpected Recycle of CRS [ID 1367153.1]
修改时间 26-OCT-2011 类型 BULLETIN 状态 PUBLISHED

In this Document
  Purpose
  Scope and Application
  Top 5 Issues That Cause Node Reboots or Unexpected Recycle of CRS
     Issue #1: The node rebooted, but the log files do not show any error or cause.
     Issue #2: The node rebooted because it was evicted due to missing network heartbeats.
     Issue #3: The node rebooted after a problem with storage.
     Issue #4: Node is rebooted after asm or database instance hang or got evicted.
     Issue #5: The CRS recycled automatically, but node did not reboot
  References

Applies to:

Oracle Server - Enterprise Edition - Version: 10.1.0.2 to 11.2.0.3 - Release: 10.1 to 11.2
Information in this document applies to any platform.

Purpose

This note is a short summary of top issues that cause node reboots or unexpected recycle of CRS

Scope and Application

All users with node reboot issues

Top 5 Issues That Cause Node Reboots or Unexpected Recycle of CRS

Issue #1: The node rebooted, but the log files do not show any error or cause.

Cause:
If the node reboot is by one of the Oracle processes but log files do not show any error, then the culprit is oprocd, cssdmonitor, and cssdagent processes. This happens when the node was hanging for a while or one or more critical CRS processes cannot get scheduled for CPU. Because those processes run in real time, the problem is like due memory starvation or low free memory and not due to CPU starvation. The kernel was swapping pages heavily or was busy scanning memory to identify pages to free up. There could be OS scheduling bug as well.

Solution:
1) Set diagwait to 13 if CRS version is 11.1 or lower.
2) If platform. is AIX tune AIX VM parameters as suggested in the note xxx.
3) if the platform. is Linux set up hugepages and set kernel parameters vm.min_free_kbytes to reserve 512MB and swappiness to 100.
Note that memory_target can not be set when using hugepages.
4) Check if large amount of memory is allocated to IO buffer cache. Talk to OS vendor to suggest ways to reduce the amount of IO buffer cache or increase the reclamation rate of memory from IO buffer cache.
5) Increase the amount of memory.

Issue #2: The node rebooted because it was evicted due to missing network heartbeats.

This is due to missing network heartbeat or split brain condition. In two node environment, repeated reboots of node 2 normally means that node 2 is evicted due to split brain. The ocssd.log shows missing network heartbeat or a split brain message before the node is rebooted.

Cause: the network communication over private interconnect between nodes failed. The failure can be uni-directional or bi-directional.

Solution: Fix the network problem. Make sure all network components like switch and NIC cards are working. Make sure ssh work over the private interconnect. Note that the network often works again after the node is rebooted.

Issue #3: The node rebooted after a problem with storage.

The ocssd.log file shows that the node rebooted because it cannot access a majority of voting disks.

Cause: CRS must be able to access a majority of voting disks. If CRS has a problem with a majority of voting disks, CRS cannot ensure the integrity of the cluster, so CRS reboots the node.

Solution: Fix the problem with the voting disk. Make sure that voting disks are available and accessible by user oracle or grid or any user who owns CRS or GI HOME. If the voting disk is not in ASM, use "dd if= f=/dev/null bs=1024 count=10240" to test the accessibility.

Issue #4: Node is rebooted after asm or database instance hang or got evicted.

The ocssd.log of surviving node shows a member kill request escalated to node kill request.

Cause: Starting 11.1, inability to evict a database or asm instance at the database level means that CRS gets involved and try to kill the problem instance. This is member kill request. If CRS cannot kill the problem instance, then CRS reboots the node because the member kill request is escalated to node kill request.

Solution: Find out the reason that asm or database instance could not get evicted at the database level (lmon, lmd, and lms initiated eviction). One common cause is that the instance was hanging and is not responding to the remote instance's request to die. Another cause is that one of more instance processes cannot be killed. One such example is that the process is in uninterruptible IO sleep.

Issue #5: The CRS recycled automatically, but node did not reboot

Cause: Starting 11.2.0.2, if CRS needs to reboot the node due to any reason listed here, CRs first tries to recycle itself before rebooting the node. Only when it cannot recycle itself successfully, CRS will reboot the node to recycle itself forcibly.

Solution: check which of the reasons for node reboot listed here is applicable and follow the solution listed for that.

References

NOTE:1050693.1 - Troubleshooting 11.2 Clusterware Node Evictions (Reboots)
NOTE:265769.1 - Troubleshooting 10g and 11.1 Clusterware Reboots
NOTE:452326.1 - Linux Kernel Lowmem Pressure Issues and Kernel Structures

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/35489/viewspace-710756/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/35489/viewspace-710756/