数据库故障分析与处理

本文详细记录了一个数据库系统因节点1主机重启,NFS服务未自动恢复,导致节点2归档路径不可用,进而引发一系列等待事件的问题。通过对等待事件的分析,尤其是enq: SQ - contention和row cache lock,发现与序列有关。最终,重启NFS服务解决了问题,恢复了数据库的正常运行。

一个新生较忽略的错误,引以为戒!

问题症状:系统数据库(节点1主机发生重启,数据库实例暂时未启动),节点2操作系统负载不高,但是业务反应非常缓慢,观察数据库存在较多等待事件:

  COUNT(*) EVENT

---------- --------------------------------------------------

         1 Streams AQ: qmn coordinator idle wait

         1 Streams AQ: waiting for messages in the queue

         1 Streams AQ: qmn slave idle wait

         1 SQL*Net message to client

         1 DIAG idle wait

         1 Streams AQ: waiting for time management or cleanup

            tasks

         2 control file sequential read

         3 SQL*Net message from dblink

        10 cursor: pin S wait on X

        35 log file switch (archiving needed)

        46 buffer busy waits

       135 row cache lock

       376 enq: SQ - contention

 

13 rows selected.

 

 

问题处理过程:

观察以上查询到的等待事件进行,发生等待最多的enq: SQ - contentionrow cache lock看起来和序列有关,结合ASH视图查询,发生等待最多的SQL语句:

SQL> select sql_id,count(*)  from v$active_session_history where event  = 'enq: SQ - contention' and sample_time >= to_date('20150928 15:00:00', 'yyyymmdd hh24:mi:ss') group by sql_id;

 

SQL_ID                                    COUNT(*)

--------------------------------------- ----------

74whb22kwku28                               172933

fzc102gjarjdx                                40257

 

SQL> select sql_text from v$sqlarea where sql_id='74whb22kwku28';

 

SQL_TEXT

--------------------------------------------------------------------------------

select GETOPERID from dual  --->GETOPERID函数也是访问SEQ_MD_DEV_STORAGE.nextval

 

SQL> select sql_text from v$sqlarea where sql_id='fzc102gjarjdx';

 

SQL_TEXT

--------------------------------------------------------------------------------

select to_char(SEQ_MD_DEV_STORAGE.nextval) as ID from dual

同时查看了相关序列的信息:

SQL> SET LINE 200

SQL> col SEQUENCE_OWNER for a15

SQL> col SEQUENCE_NAME for a45

SQL> select * from dba_sequences a where sequence_name='SEQ_MD_DEV_STORAGE' and sequence_owner='SXYKCC';

 

SEQUENCE_OWNER  SEQUENCE_NAME                                  MIN_VALUE  MAX_VALUE INCREMENT_BY CYCL ORDE CACHE_SIZE LAST_NUMBER

--------------- --------------------------------------------- ---------- ---------- ------------ ---- ---- ---------- -----------

SXYKCC          SEQ_MD_DEV_STORAGE                                     1 1.0000E+16            1 N    N            20      781075

 

序列的Cache看起来比较小,尝试加Cache会被hung住,尝试kill部分会话,kill之后并没有新的会话产生。此时注意到,等待事件情况变化为:

  COUNT(*) EVENT

---------- ---------------------------------------------

        32 log file switch (archiving needed)

        19 buffer busy waits

        14 row cache lock

        12 cursor: pin S wait on X

         2 SQL*Net message from dblink

         2 control file sequential read

         1 SQL*Net message to client

 

7 rows selected.

 

等待事件log file switch (archiving needed)通常为日志空间满,alert告警日志信息,并无明显的不能归档信息:

on Sep 28 15:20:03 CST 2015

>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=169

System State dumped to trace file /oracle/admin/pmcpdp/bdump/pmcpdp2_j003_25428018.trc

Mon Sep 28 15:21:33 CST 2015

Error 3135 trapped in 2PC on transaction 18.3.2426591. Cleaning up.

Error stack returned to user:

ORA-03135: connection lost contact

Mon Sep 28 15:22:23 CST 2015

>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=233

Mon Sep 28 15:24:28 CST 2015

>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=288

System State dumped to trace file /oracle/admin/pmcpdp/udump/pmcpdp2_ora_19792100.trc

Mon Sep 28 15:27:01 CST 2015

>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=657

Mon Sep 28 15:29:28 CST 2015

>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=735

System State dumped to trace file /oracle/admin/pmcpdp/udump/pmcpdp2_ora_34930770.trc

Mon Sep 28 15:31:40 CST 2015

>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=742

 

 

进一步检查归档路径,空间使用情况:

SQL> archive log list

Database log mode              Archive Mode

Automatic archival             Enabled

Archive destination            /backup

Oldest online log sequence     2987

Next log sequence to archive   2987

Current log sequence           2994

 

$ df -k

Filesystem    1024-blocks      Free %Used    Iused %Iused Mounted on

/dev/hd4          5242880   2967360   44%    11564     2% /

/dev/hd2         10485760   6689804   37%    61859     4% /usr

/dev/hd9var       2097152   1238092   41%     9023     4% /var

/dev/hd3          5242880   4064048   23%     2957     1% /tmp

/dev/fwdump       1572864   1503456    5%       16     1% /var/adm/ras/platform

/dev/hd1        110100480 109831152    1%      385     1% /home

/dev/hd11admin      524288    523864    1%        5     1% /admin

/proc                   -         -    -         -     -  /proc

/dev/hd10opt      5242880   4872412    8%    10861     1% /opt

/dev/livedump      524288    513072    3%       49     1% /var/adm/ras/livedump

/dev/lv00          524288    507776    4%       18     1% /var/adm/csd

/dev/oralv       41943040  23617468   44%    85100     2% /oracle

/dev/fslv00     104857600 104584732    1%       11     1% /arch02

NFS server cmspdb1 not responding still trying

--此处hung

 

检查/backup挂载点相关信息:

/backup:

        dev             = /backup

        vfs             = nfs

        nodename        = cmspdb1

        mount           = true

        options         = cio,rw,bg,hard,nointr,rsize=32768,wsize=32768,proto=tcp,noac,vers=3,timeo=600

可以看到节点2的归档路径,实际是从节点1通过NFS挂载过来的归档路径

 

问题结论:

因为节点1上的NFS服务,为手动方式启动,重启主机以后,NFS服务未正常启动,所以节点2的归档路径变得不可用,导致归档无法正常进行,阻塞了大量会话。手动启动节点1上的NFS服务后,归档恢复正常,数据库等待事件消失。

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/30430420/viewspace-1811315/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/30430420/viewspace-1811315/

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符  | 博主筛选后可见
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值