oracle主机产生大量defunct进程

最新推荐文章于 2022-10-27 14:46:03 发布

weixin_40410849

最新推荐文章于 2022-10-27 14:46:03 发布

阅读量1.3k

点赞数

分类专栏： oracle 文章标签： oracle 数据库

本文链接：https://blog.youkuaiyun.com/weixin_40410849/article/details/79642836

版权

oracle 专栏收录该内容

20 篇文章

订阅专栏

本文探讨了数据库连接异常问题，特别是在OS资源不足的情况下，分析了与ORA-27300等相关错误的原因，并提供了排查僵尸进程和系统限制的具体步骤。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

=====DB update====
1)我为您贴的内容来自03:00:32的ps输出，这些非数据库进程在第一时间和数据库进程一起同时处于defunct状态，
而非先是数据库进程产生defunct，然后才有其他defunct进程。

filename=cnsz081498_ps_18.03.19.0300.dat

zzz ***Mon Mar 19 03:00:32 CST 2018

osoracle 10006 1 19 2.5 0.0 0 0 exit Z 03:00:29 00:00:00 [bash]
palog 5039 1 19 0.0 0.0 0 0 exit Z 03:00:07 00:00:00 [sshd]
grid 3701 1 19 0.0 0.0 0 0 exit Z 03:00:00 00:00:00 [asm_m000_+asm4]
osoracle 10408 1 19 2.5 0.0 0 0 exit Z 03:00:29 00:00:00 [bash]
osoracle 10403 1 19 0.0 0.0 0 0 exit Z 03:00:29 00:00:00 [flock]

2)关于您的疑问”4:00的时候OS命令操作都是正常的，只是数据库不能连接。”

数据库连接异常是因为OS 资源不足错误并报错ORA-27300: OS system dependent operation:fork failed with status: 11：

Mon Mar 19 04:06:15 2018
Process startup failed, error stack:
Mon Mar 19 04:06:15 2018
Errors in file /test1/app/oracle/rdbms/diag/rdbms/eginv/eginv/trace/eginv_psp0_16375.trc:
ORA-27300: OS system dependent operation:fork failed with status: 11
ORA-27301: OS failure message: Resource temporarily unavailable
ORA-27302: failure occurred at: skgpspawn3
Mon Mar 19 04:06:16 2018
Process m000 died, see its trace file
Mon Mar 19 04:06:16 2018
Process startup failed, error stack:
Mon Mar 19 04:06:16 2018
Errors in file /test1/app/oracle/rdbms/diag/rdbms/eginv/eginv/trace/eginv_psp0_16375.trc:
ORA-27300: OS system dependent operation:fork failed with status: 11
ORA-27301: OS failure message: Resource temporarily unavailable
ORA-27302: failure occurred at: skgpspawn3
Process m001 died, see its trace file

关于ORA-27300 ORA-27301 ORA-27302错误和STATUS 11以下文档有详细描述它们是OS系统调用错误：

Troubleshooting ORA-27300 ORA-27301 ORA-27302 Errors ( Doc ID 579365.1 )

STATUS 11 - EAGAIN No more processes
Executing a fork and the system’s process table is full, or the user is not allowed to create more process.

<=======system’s process table is full, or the user is not allowed to create more process.

数据库无法连接时能正常执行OS命令，这是正常的，因为实际的问题只是数据库用户osoracle无法fork新的进程了，
但是OS资源是空闲的，OSW可以看到CPU，内存都是足够free的。
GI命令能正常执行也是正常的，因为执行这些命令都不需要创建新的进程，但是连接数据库就必须创建一个新的后台进程。

3)关于irpkm2库，首先，它也在第一时间产生了defunct进程：

filename=cnsz081498_ps_18.03.19.0300.dat
zzz ***Mon Mar 19 03:00:32 CST 2018

osoracle 8640 1 19 0.6 0.0 0 0 exit Z 03:00:28 00:00:00 [oracle_8640_irp]
osoracle 8504 1 19 0.6 0.0 0 0 exit Z 03:00:28 00:00:00 [oracle_8504_irp]
osoracle 8380 1 19 0.3 0.0 0 0 exit Z 03:00:28 00:00:00 [oracle_8380_irp]
osoracle 5920 1 19 0.2 0.0 0 0 exit Z 03:00:22 00:00:00 [oracle_5920_irp]
osoracle 5914 1 19 1.8 0.0 0 0 exit Z 03:00:21 00:00:00 [oracle_5914_irp]
osoracle 5847 1 19 0.1 0.0 0 0 exit Z 03:00:21 00:00:00 [oracle_5847_irp]
osoracle 5836 1 19 0.2 0.0 0 0 exit Z 03:00:21 00:00:00 [oracle_5836_irp]
osoracle 5824 1 19 0.4 0.0 0 0 exit Z 03:00:20 00:00:00 [oracle_5824_irp]
osoracle 5820 1 19 0.1 0.0 0 0 exit Z 03:00:20 00:00:00 [oracle_5820_irp]
osoracle 4637 1 19 0.0 0.0 0 0 exit Z 03:00:05 00:00:00 [oracle_4637_irp]
osoracle 4615 1 19 1.3 0.0 0 0 exit Z 03:00:05 00:00:00 [oracle_4615_irp]
osoracle 4420 1 19 0.1 0.0 0 0 exit Z 03:00:02 00:00:00 [oracle_4420_irp]
osoracle 4409 1 19 0.0 0.0 0 0 exit Z 03:00:02 00:00:00 [oracle_4409_irp]
osoracle 4407 1 19 0.0 0.0 0 0 exit Z 03:00:02 00:00:00 [oracle_4407_irp]
osoracle 4405 1 19 0.0 0.0 0 0 exit Z 03:00:02 00:00:00 [ora_m000_irpkm2]
osoracle 4383 1 19 0.0 0.0 0 0 exit Z 03:00:01 00:00:00 [oracle_4383_irp]
osoracle 3811 1 19 0.0 0.0 0 0 exit Z 03:00:00 00:00:00 [oracle_3811_irp]
osoracle 11498 1 19 4.0 0.0 0 0 exit Z 03:00:30 00:00:00 [ora_m000_irpkm2]
osoracle 10334 1 19 27.0 0.0 0 0 exit Z 03:00:29 00:00:00 [ora_m001_irpkm2]
osoracle 10301 1 19 1.0 0.0 0 0 exit Z 03:00:29 00:00:00 [ora_m000_irpkm2]
osoracle 4405 1 19 0.0 0.0 0 0 exit Z 03:00:02 00:00:00 [ora_m000_irpkm2]
osoracle 11498 1 19 4.0 0.0 0 0 exit Z 03:00:30 00:00:00 [ora_m000_irpkm2]
osoracle 10334 1 19 27.0 0.0 0 0 exit Z 03:00:29 00:00:00 [ora_m001_irpkm2]
osoracle 10301 1 19 1.0 0.0 0 0 exit Z 03:00:29 00:00:00 [ora_m000_irpkm2]

但是它到4:06的时候，仍然是可以连接的，这是因为irpkm2的listener在12月11日重启了。而其他有问题的listener是在7月6日启动的。

osoracle 29815 1 19 0.0 0.0 102676 13516 ep_pol S Jul 06 00:38:01 /test1/app/oracle/rdbms/12c/12.1.0.2.160119/bin/tnslsnr phwww -inherit
osoracle 29013 1 19 0.0 0.0 103184 14084 ep_pol S Dec 11 00:24:53 /test1/app/oracle/rdbms/12c/12.1.0.2.160119/bin/tnslsnr irpkm2 -inherit
osoracle 17773 1 19 0.3 0.0 103492 14248 ep_pol S Jul 06 18:45:58 /test1/app/oracle/rdbms/12c/12.1.0.2.160119/bin/tnslsnr pens2 -inherit
osoracle 17689 1 19 0.0 0.0 104992 15796 ep_pol S Jul 06 02:59:44 /test1/app/oracle/rdbms/12c/12.1.0.2.160119/bin/tnslsnr invest -inherit
osoracle 17214 1 19 0.0 0.0 103064 13888 ep_pol S Jul 06 00:40:27 /test1/app/oracle/rdbms/12c/12.1.0.2.160119/bin/tnslsnr rsas -inherit
osoracle 17205 1 19 0.0 0.0 102964 13776 ep_pol S Jul 06 01:02:28 /test1/app/oracle/rdbms/12c/12.1.0.2.160119/bin/tnslsnr omcp -inherit
osoracle 16471 1 19 0.0 0.0 104164 14864 ep_pol S Jul 06 00:34:04 /test1/app/oracle/rdbms/12c/12.1.0.2.160119/bin/tnslsnr pare -inherit
osoracle 16371 1 19 0.0 0.0 104952 15732 ep_pol S Jul 06 02:20:29 /test1/app/oracle/rdbms/12c/12.1.0.2.160119/bin/tnslsnr eginv -inherit
grid 29703 1 19 0.0 0.0 168068 13904 ep_pol S Jul 06 00:06:31 /oracle_grid/12.1.0/grid/bin/tnslsnr LISTENER_SCAN1 -no_crs_notify -inherit
grid 25070 1 19 0.0 0.0 168044 13676 ep_pol S Jul 06 00:06:43 /oracle_grid/12.1.0/grid/bin/tnslsnr LISTENER -no_crs_notify -inherit

这是可能因为在7月6日到12月11日之间，您修改了osoracle用户的nproc限制。

您上传的/etc/security/limits.conf显示已经设置 nproc 655360，但是OS层面上osoracle的进程数达到15000多个时，就无法创建osoracle进程了。
* soft nproc 655360
* hard nproc 655360
osoracle soft nproc 655360
osoracle hard nproc 655360

这个值在7月6日到12月11日之间，从15000修改到了655360，但是7月6日启动的listener进程会继承之前的limit值，并保存在/proc//limits中，
由该listener fork出来的进程会受保存的limits限制，而不是新的值655360。
修改limits后重新启动的listener及由listener fork出来的新进程，受新的limits限制。

4)实际上我们讨论了两个问题，一个是3:00开始产生defunct进程的问题，另一个是4:06开始数据库无法连接的问题。
4:06开始数据库无法连接的问题是由defunct进程累积，导致OS层面上osoracle的进程数达到15000多个之后，listener进程在旧的nproc limits值设置下无法fork新的进程导致的。
我们需要关注和解决的是”3:00开始产生defunct进程的问题”。

5)分享我的一些发现：
这些进程是刚刚创建几秒就退出，变成了僵尸进程，比如在OSW的PS输出中03:00:32捕捉到了 03:00:29创建的进程就变成僵尸进程了

zzz ***Mon Mar 19 03:00:32 CST 2018
osoracle 9899 1 19 1.5 0.0 0 0 exit Z 03:00:29 00:00:00 [oracle_9899_phw] <=======status ‘exit’, these processes were newly created at 03:00:29 but was exiting shortly.
《======start time 03:00:29，也就是该进程在03:00:29创建，1-2秒后就退出了。

而且03:00:32采样点中的所有defunct进程全部都是02:59:47到03:00:29之间创建的，最早的进程是：
osoracle 40955 1 19 8.0 0.0 0 0 exit Z 02:59:47 00:00:03 [oracle_40955_pa]

有两种可能性：
1)02:59:47启动了什么定时脚本，创建了一些进程，这些进程会在几秒后退出，触发了OS bug导致进程资源无法被清理。
2)或者，”进程短时间连接数据库并退出”是应用程序正常的行为，问题只是在于OS的某个资源无法被清理，或者某个系统调用无法正常处理，
导致之后所有尝试退出的进程都会挂住，并变成僵尸进程。
以当时僵尸进程累积的速度看，应该是所有退出进程都会变成僵尸进程，而不是某种特殊操作导致的。