Oralce数据库性能优化-关于aix上的filesystemcache 20190707 故障处理

探讨Oracle备份过程中的性能瓶颈，分析AIX系统内存参数maxclient%, maxperm%, maxpin%对Oracle数据库的影响，及如何通过调整这些参数解决备份问题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

转载于https://blog.youkuaiyun.com/msdnchina/article/details/43610167

白鳝的 Oracle优化日记：一个金牌DBA的故事.pdf

$ ps gv |sort +6b -nr |head -10
6357032 - A 0:39 560140 51948 75248 32768 187414 62252 0.0 4.0 ora_ar
15990982 - A 9:05 105024 9580 68900 32768 187414 62252 0.0 3.0 ora_cj
18546788 - A 0:00 1 4652 66904 32768 187414 62252 0.0 3.0 ora_w0
20906094 - A 4:21 226826 6728 66460 32768 187414 62252 0.0 3.0 ora_sm
10092588 - A 0:00 1 4176 66428 32768 187414 62252 0.0 3.0 ora_w0
16973992 - A 0:00 0 4176 66428 32768 187414 62252 0.0 3.0 ora_w0
30933238 - A 7:02 208737 8100 66280 32768 187414 62252 0.0 3.0 ora_mm
28442714 - A 0:00 0 3548 65800 32768 187414 62252 0.0 3.0 ora_w0
6291592 - A 0:25 29467 6044 65212 32768 187414 62252 0.0 3.0 ora_q0
19661020 - A 0:22 21732 5660 64472 32768 187414 62252 0.0 3.0 ora_re

见第165页：

这个FileSystemCache，nmon进入后，敲m之后就能看到。如下的是一个例子：

注意右上角的显示：

FileSystemCache（numperm） 66.8%

Process 28.3%

System 4.8%

Free 0.2% 以上三者的合计约等于100%

也就是说，此处显示全部物理内存的占用分布。

特别注意：

FileSystemCache（numperm）实际占用了 66.8%的物理内存。

但是，FileSystemCache最大能占用多少比例的物理内存呢？答案是：90%，来自于：

FileSystemCache对应的操作系统参数，可以用操作系统命令vmo -a -F（针对aix6.1）进行查看，主要是这几个参数：

maxclient%=20

maxperm%=20

maxpin%=80

minperm%=5

strict_maxperm=0

strict_maxclient=1

以上参数值来自工程实践。

值得注意的是，在oracle的所有官方文档中和在IBM的所有官方文档中，maxclient%和maxperm%都要求是保持默认值（即：90%），从工程实践来说，保持默认值是存在很大问题的。我一直怀疑保持默认值是美帝的“阴谋”---让国内客户花了钱，买了小机，却无法有效使用物理内存（即：让文件系统缓冲占用物理内存），而非技术上的要求。

失败的连接反馈给前端的错误是Ora-12500错误，系统级别的错误是internal limit restriction exceeded，以及错误12：Not enough space。通常，AIX的错误12都是指内存不够，但是这个系统的总内存是64G，还存在大量剩余空间，交换空间的使用率也不超过2%。
开始以为是连接数与文件句柄的限制，但是修改了系统参数maxuproc（每用户的最大进程限制）从2000到5000（系统当时实际使用的进程数小于1000个），用户文件句柄限制ofiles(descriptors)从4000到5000，都没有效果。
根据错误类型12，还是把问题定位在内存上面，查看SGA大小，占了45G，约系统总内存的70%，加上系统自身要消耗10%的内存，合计80%的内存，这部分内存是不能交换的，而OS的系统参数maxpin%正好默认是80%，修改maxpin%到90%，问题解决，所有的连接正常。
正常情况下，系统的内存分计算内存与非计算内存（如文件系统cache），SGA与系统的这部分内存都属于计算内存，在AIX的早期版本（如Aix 5.2)，如果内存不够，这部分内存是可以交换的，但是后期的版本，如Aix 5.3以后，因为large page的引进，这部分内存是长驻内存而不能交换的（不知道是进步还是退步）。所以，当这部分内存达到maxpin%的时候，就会发生内存不够，如果超出，就会发生系统hang住，所有的连接都无法进入，我们的这个实际案例是幸运的，只是在临界点附近颠簸，并没有引起系统的瘫痪。
修改maxpin%是一个解决方法，但同时也要注意maxperm%，maxclient%与minperm%的正确设置，以及其它内存的使用，因为除了系统以及SGA之外，还有进程空间（进程使用内存，有一部分是可交换的，有一部分也是属于系统级别，不可交换的），以及文件系统Cache，设置lru_file_repage=0，可以保证在minperm%之上，优先交换非计算内存。
至于刚上线的时候是正常的，为什么运行一段时间以后才不正常，可能是因为连接数的增加，导致系统消耗以及内存增加，慢慢的达到了临界点。

AIX 的 maxclient 和 maxperm 参数设置太大，导致 Oracle 发起备份后应用系统 down。

错误

查看 bphdb 日志，发现备份发起后，有严重的网络通讯问题，但是检查网络配置和 TCP 参数，都没有发现问题，这个网络连接问题最可能的情况是系统性能问题导致的：

12:29:40.340 [295180] <4> bphdb sync_server: INF - BACKUP START

12:29:40.454 [295180] <4> bphdb sync_server: INF - CONTINUE BACKUP message received.

12:29:40.574 [295180] <2> bphdb get_filelist: INF - Read filename:

12:29:40.574 [295180] <4> bphdb do_backup: INF - Processing /usr/openv/netbackup/script/hot_database_backup.sh

12:29:40.576 [295180] <4> bphdb do_backup: INF - Waiting for the child status.

12:29:40.604 [512146] <4> bphdb do_backup: INF - Child executing /usr/openv/netbackup/script/hot_database_backup.sh

12:38:27.751 [295180] <32> bphdb sighandler: FTL - bphdb lost connection with server

12:38:28.310 [295180] <4> bphdb sighandler: INF - Killing all children in the same process group.

查看 dbclient 日志：

12:38:41.969 [446546] <16> VxBSASendData: ERR - Could not do a bsa_put().

12:38:41.970 [446546] <2> xbsa_ProcessError: INF - entering

12:38:41.970 [446546] <2> xbsa_ProcessError: INF - leaving

12:38:41.970 [446546] <16> xbsa_SendData: ERR - VxBSASendData: Failed with error: Server Status: Communication with the server has not been initiated or the server status has not been retrieved from the serve

12:38:41.970 [446546] <2> sbterror: INF - entering

12:38:41.970 [446546] <2> sbterror: INF - Error=7501: VxBSASendData: Failed with error: Server Status: Communication with the server has not been initiated or the server status has not been retrieved from the serve.

12:38:41.970 [446546] <2> sbterror: INF - leaving

12:38:42.086 [446546] <2> sbtclose2: INF - entering

12:38:42.086 [446546] <2> int_CloseImage: INF - entering

12:38:42.137 [446546] <2> int_CloseImage: INF - Backup - closing

12:38:42.137 [446546] <2> xbsa_EndData: INF - entering

12:38:42.137 [446546] <4> VxBSAEndData: INF - entering EndData.

12:38:42.171 [446546] <4> finishTarImage: INF - FractionalObjectBytes: 0

12:38:42.171 [446546] <4> finishTarImage: INF - writing LF_END_U_LEN_FILE record

12:38:42.171 [446546] <4> write_LF_END_tarHeader: entering write_LF_END_tarHeader.

12:38:42.171 [446546] <16> writeToServer:

12:38:42.186 [446546] <16> write_LF_END_tarHeader: ERR - failed writing LF_END_U_LEN_FILE record on DATA socket

12:38:42.186 [446546] <16> finishTarImage: ERR - write_LF_END_tarHeader() failed.

12:38:42.187 [446546] <16> VxBSAEndData: ERR - EndData unable to bsa_finishTarImage().

12:38:42.187 [446546] <2> xbsa_ProcessError: INF - entering

12:38:42.187 [446546] <2> xbsa_ProcessError: INF - leaving

12:38:42.187 [446546] <16> xbsa_EndData: ERR - VxBSAEndData: Failed with error: Server Status: Communication with the server has not been initiated or the server status has not been retrieved from the serve

12:38:42.187 [446546] <2> xbsa_EndData: INF - leaving (3)

12:38:42.187 [446546] <16> int_CloseImage: ERR - Failed to process backup file

环境

Master Server:Windows 2003 x64/NBU 7.0.1

Client:AIX 5.3/NBU client agent 7.0.1

Database agent:Oracle 9.0.8

原因

查看系统的 topas，发现系统的 IDLE 达到 50%，并且操作系统磁盘 IO 为 100%，而数据库存储的磁盘阵列 IO才为 20%左右，这表明当 NBU 发起备份时，系统资源严重不足。

查看系统配置参数 maxclient 和 maxperm 设置都比较大：

# vmo -a

maxclient% = 80

maxfree = 1088

maxperm = 1613809

maxperm% = 80

maxpin = 1692137

maxpin% = 80

解决方案

调整 maxclient 和 maxperm 参数到 20% 后，Oracle 备份马上就正常了，CPU 资源和操作系统磁盘IO 资源显示正常。

vmo -a

maxclient% = 20

maxfree = 1088

maxperm = 403452

maxperm% = 20

maxpin = 1692140

maxpin% = 80

From Symantec

案例：
　　计费数据库数据库响应变慢，内存16G，裸设备，却存在很多的PI,PO情况。
　　
　　在检查与内存相关的系统参数，发现如下问题：
　　minperm% = 20， maxperm% = 80， maxclient% = 80
　　说明：以上三个参数为系统缺省配置，其表示，使用文件系统时，最多可使用80% * 16G=10.8G，用于缓存所访问的文件。
　　结论：由于以上参数采用系统缺省配置，文件系统缓存最大可以达到10.8G,在执行大量的文件cp操作后，系统的可用内存量迅速下降，在其后的计费过程中，由于大量page in/page out操作引起系统严重性能瓶颈。
　　优化：
　　将maxperm% = 30 ，maxclient% = 30
　　#vmo –o maxperm%=30 –P
　　#vmo –o maxclient%=30 –P