一个新生较忽略的错误,引以为戒!
问题症状:系统数据库(节点1主机发生重启,数据库实例暂时未启动),节点2操作系统负载不高,但是业务反应非常缓慢,观察数据库存在较多等待事件:
COUNT(*) EVENT
---------- --------------------------------------------------
1 Streams AQ: qmn coordinator idle wait
1 Streams AQ: waiting for messages in the queue
1 Streams AQ: qmn slave idle wait
1 SQL*Net message to client
1 DIAG idle wait
1 Streams AQ: waiting for time management or cleanup
tasks
2 control file sequential read
3 SQL*Net message from dblink
10 cursor: pin S wait on X
35 log file switch (archiving needed)
46 buffer busy waits
135 row cache lock
376 enq: SQ - contention
13 rows selected.
问题处理过程:
观察以上查询到的等待事件进行,发生等待最多的enq: SQ - contention和row cache lock看起来和序列有关,结合ASH视图查询,发生等待最多的SQL语句:
SQL> select sql_id,count(*) from v$active_session_history where event = 'enq: SQ - contention' and sample_time >= to_date('20150928 15:00:00', 'yyyymmdd hh24:mi:ss') group by sql_id;
SQL_ID COUNT(*)
--------------------------------------- ----------
74whb22kwku28 172933
fzc102gjarjdx 40257
SQL> select sql_text from v$sqlarea where sql_id='74whb22kwku28';
SQL_TEXT
--------------------------------------------------------------------------------
select GETOPERID from dual --->GETOPERID函数也是访问SEQ_MD_DEV_STORAGE.nextval
SQL> select sql_text from v$sqlarea where sql_id='fzc102gjarjdx';
SQL_TEXT
--------------------------------------------------------------------------------
select to_char(SEQ_MD_DEV_STORAGE.nextval) as ID from dual
同时查看了相关序列的信息:
SQL> SET LINE 200
SQL> col SEQUENCE_OWNER for a15
SQL> col SEQUENCE_NAME for a45
SQL> select * from dba_sequences a where sequence_name='SEQ_MD_DEV_STORAGE' and sequence_owner='SXYKCC';
SEQUENCE_OWNER SEQUENCE_NAME MIN_VALUE MAX_VALUE INCREMENT_BY CYCL ORDE CACHE_SIZE LAST_NUMBER
--------------- --------------------------------------------- ---------- ---------- ------------ ---- ---- ---------- -----------
SXYKCC SEQ_MD_DEV_STORAGE 1 1.0000E+16 1 N N 20 781075
序列的Cache看起来比较小,尝试加Cache会被hung住,尝试kill部分会话,kill之后并没有新的会话产生。此时注意到,等待事件情况变化为:
COUNT(*) EVENT
---------- ---------------------------------------------
32 log file switch (archiving needed)
19 buffer busy waits
14 row cache lock
12 cursor: pin S wait on X
2 SQL*Net message from dblink
2 control file sequential read
1 SQL*Net message to client
7 rows selected.
等待事件log file switch (archiving needed)通常为日志空间满,alert告警日志信息,并无明显的不能归档信息:
on Sep 28 15:20:03 CST 2015
>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=169
System State dumped to trace file /oracle/admin/pmcpdp/bdump/pmcpdp2_j003_25428018.trc
Mon Sep 28 15:21:33 CST 2015
Error 3135 trapped in 2PC on transaction 18.3.2426591. Cleaning up.
Error stack returned to user:
ORA-03135: connection lost contact
Mon Sep 28 15:22:23 CST 2015
>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=233
Mon Sep 28 15:24:28 CST 2015
>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=288
System State dumped to trace file /oracle/admin/pmcpdp/udump/pmcpdp2_ora_19792100.trc
Mon Sep 28 15:27:01 CST 2015
>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=657
Mon Sep 28 15:29:28 CST 2015
>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=735
System State dumped to trace file /oracle/admin/pmcpdp/udump/pmcpdp2_ora_34930770.trc
Mon Sep 28 15:31:40 CST 2015
>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=742
进一步检查归档路径,空间使用情况:
SQL> archive log list
Database log mode Archive Mode
Automatic archival Enabled
Archive destination /backup
Oldest online log sequence 2987
Next log sequence to archive 2987
Current log sequence 2994
$ df -k
Filesystem 1024-blocks Free %Used Iused %Iused Mounted on
/dev/hd4 5242880 2967360 44% 11564 2% /
/dev/hd2 10485760 6689804 37% 61859 4% /usr
/dev/hd9var 2097152 1238092 41% 9023 4% /var
/dev/hd3 5242880 4064048 23% 2957 1% /tmp
/dev/fwdump 1572864 1503456 5% 16 1% /var/adm/ras/platform
/dev/hd1 110100480 109831152 1% 385 1% /home
/dev/hd11admin 524288 523864 1% 5 1% /admin
/proc - - - - - /proc
/dev/hd10opt 5242880 4872412 8% 10861 1% /opt
/dev/livedump 524288 513072 3% 49 1% /var/adm/ras/livedump
/dev/lv00 524288 507776 4% 18 1% /var/adm/csd
/dev/oralv 41943040 23617468 44% 85100 2% /oracle
/dev/fslv00 104857600 104584732 1% 11 1% /arch02
NFS server cmspdb1 not responding still trying
--此处hung住
检查/backup挂载点相关信息:
/backup:
dev = /backup
vfs = nfs
nodename = cmspdb1
mount = true
options = cio,rw,bg,hard,nointr,rsize=32768,wsize=32768,proto=tcp,noac,vers=3,timeo=600
可以看到节点2的归档路径,实际是从节点1通过NFS挂载过来的归档路径
问题结论:
因为节点1上的NFS服务,为手动方式启动,重启主机以后,NFS服务未正常启动,所以节点2的归档路径变得不可用,导致归档无法正常进行,阻塞了大量会话。手动启动节点1上的NFS服务后,归档恢复正常,数据库等待事件消失。
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/30430420/viewspace-1811315/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/30430420/viewspace-1811315/
本文详细记录了一个数据库系统因节点1主机重启,NFS服务未自动恢复,导致节点2归档路径不可用,进而引发一系列等待事件的问题。通过对等待事件的分析,尤其是enq: SQ - contention和row cache lock,发现与序列有关。最终,重启NFS服务解决了问题,恢复了数据库的正常运行。
660

被折叠的 条评论
为什么被折叠?



