oprocd 故障

最新推荐文章于 2021-05-13 06:06:03 发布

转载最新推荐文章于 2021-05-13 06:06:03 发布 · 349 阅读

文章标签：

Oprocd 同 linux 的 Hangcheck-Timer 一样，有一个探测机制，会定时去探测节点间的状态，以防止 I/O 出现问题，如果 Oprocd 没有及时返回结果，那么会引起os reboot。

但是，当系统的 cpu 出现比较高的负载是，Oprocd 会没有及时返回结果的情况，那么这时，节点就会莫名重启，而且，因为 Oprocd 的响应时间默认情况下设置比较短（0.5S），导致，系统还没来得及写日志信息，就重启了，所以，有时，我们很难去判断是由于什么原因导致系统重启的。这时，我们就需要去设置 Oprocd 的响应时间，将它改长。

首先，我们来看看 Oprocd 的官方解释：

PROCD is a process monitor that runs on hardware platforms supporting other third-party cluster managers and is present only on hardware platforms other than Linux. Its function is to create threads for the various processors on the system and to check if the processors are hanging. Every second, the PROCD thread wakes up and checks the processors on the system, and then goes to sleep for about 500 ms and tries again. If it does not receive any response after n seconds, it reboots the node. On Linux environments, the hangcheck timer module performs the same work that PROCD does on other hardware platforms.

接着，我们来看看，其他人是怎么去定义和设置 oprocd 的响应时间延时的：

linux平台上的Oracle Clusterware 10.2.0.4和以后版本引入了一个新的Oracle Clusterware Process Monitor Daemon (OPROCD)进程来监控系统状态和集群中的每个节点的健康状态，就象已经在不使用第三方的cluster软件的UNIX系统中提供的那样，下面来看看 OPROCD到底是何方神圣。

OPROCD在linux平台上的10.2.0.4版本中和hangcheck- timer一起运行，它和hangcheck-timer模块没有联系和依赖关系，它由init.ccsd进程产生出来并用root用户运行。 OPROCD进程被锁定在内存中来监控集群中的每个它自己运行的节点，来检测机器上的硬件或者驱动的freezes，并且提供I/O的fencing功能（这和SCSI提供的中断的fencing功能不同）。如果一个机器被冻结了足够长的时间后，它被会集群驱逐出节点，它自己需要强制重启自己来阻止集群从失败的节点上的锁资源被重新组织后，失败的节点仍然访问共享的数据文件上的有疑问的I/O操作。为了提供这样的功能，OPROCD执行检查，然后停止运行（休眠），然后如果在期望的时间内不能被唤醒，OPROCD将重启本机的节点。