凌晨00:10左右接到监控电话,说库连接异常,随即起床登陆主机一看,库实例已然crash掉啦,查告警日志:
Mon Mar 23 20:45:24 2015
Thread 1 advanced to log sequence 234601 (LGWR switch)
Current log# 3 seq# 234601 mem# 0: /dev/oravg/rgsinas_redo13
Mon Mar 23 20:48:28 2015
KCF: write/open error block=0x840 online=1
file=204 /dev/vg01/rora_data207
error=65535 txt: ''
Automatic datafile offline due to write error on
file 204: /dev/vg01/rora_data207
Mon Mar 23 21:16:30 2015
Errors in file /oracle/admin/jhdb/bdump/jhdb_lgwr_4176.trc:
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 4169'
Mon Mar 23 21:16:32 2015
System State dumped to trace file /oracle/admin/jhdb/bdump/jhdb_lgwr_4176.trc
Killing enqueue blocker (pid=4169) on resource CF-00000000-00000000
by killing session 825.1
Mon Mar 23 21:21:42 2015
Errors in file /oracle/admin/jhdb/bdump/jhdb_lgwr_4176.trc:
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 4169'
Mon Mar 23 21:21:44 2015
System State dumped to trace file /oracle/admin/jhdb/bdump/jhdb_lgwr_4176.trc
Killing enqueue blocker (pid=4169) on resource CF-00000000-00000000
by terminating the process
LGWR: terminating instance due to error 2103
Mon Mar 23 21:21:55 2015
Instance terminated by LGWR, pid = 4176
随即用dbv检查file 204发现:
Corrupt block relative dba: 0x33000840 (file 204, block 2112)
gsinas01[/oracle]$dbv file='/dev/vg01/rora_data207';
DBVERIFY: Release 10.2.0.4.0 - Production on Fri Mar 27 10:24:43 2015
Copyright (c) 1982, 2007, Oracle. All rights reserved.
DBVERIFY - Verification starting : FILE = /dev/vg01/rora_data207
Page 2106 is influx - most likely media corrupt
Corrupt block relative dba: 0x3300083a (file 204, block 2112)
Fractured block found during dbv:
Data in bad block:
type: 6 format: 2 rdba: 0x3300083a
last change scn: 0x0cd1.9978968d seq: 0x2 flg: 0x06
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0x74ab0602
check value in block header: 0x414f
computed block checksum: 0x73bb
DBVERIFY - Verification complete
Total Pages Examined : 1048575
Total Pages Processed (Data) : 1020
Total Pages Failing (Data) : 0
Total Pages Processed (Index): 1145
Total Pages Failing (Index): 0
Total Pages Processed (Other): 18
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 1046391
Total Pages Marked Corrupt : 1
Total Pages Influx : 1
Highest block SCN : 2574816070 (3281.2574816070)
file为204,blk id为2112;
入库检查对应204 file为/dev/vg01/rora_data207,其状态如下:
select STATUS,name from v$datafile;
ONLINE /dev/oravg/rora_data204
ONLINE /dev/oravg/rora_data205
ONLINE /dev/oravg/rora_data189
ONLINE /dev/oravg/rora_data23
ONLINE /dev/vg01/rora_data206
RECOVER /dev/vg01/rora_data207
ONLINE /dev/vg01/rora_data208
ONLINE /dev/vg01/rora_data209
ONLINE /dev/vg01/rora_data210
ONLINE /dev/vg01/rora_data211
SQL> SELECT segment_name,segment_type FROM dba_extents WHERE file_id=204 AND 2112 BETWEEN block_id AND block_id + blocks -1;
SEGMENT_NAME SEGMENT_TYPE
--------------------------------------------------------------------------------- ------------------
IDX_CONSTRUCTIONCRM_SUCCFLAG
SQL> select count(*) from inas.CONSTRUCTIONCRM;
select count(*) from inas.CONSTRUCTIONCRM
*
ERROR at line 1:
ORA-00376: file 204 cannot be read at this time
ORA-01110: data file 204: '/dev/vg01/rora_data207'
SQL> SELECT owner,segment_name,segment_type FROM dba_extents WHERE file_id=204 AND 2112 BETWEEN block_id AND block_id + blocks -1;
OWNER SEGMENT_NAME SEGMENT_TYPE
------------------------------ --------------------------------------------------------------------------------- ------------------
INAS IDX_CONSTRUCTIONCRM_SUCCFLAG INDEX
所幸的是对应坏块的是个应用索引。
果断进行块恢复:
SQL> host
$ rman target /
Recovery Manager: Release 10.2.0.4.0 - Production on Tue Mar 24 10:00:34 2015
Copyright (c) 1982, 2007, Oracle. All rights reserved.
connected to target database: JHDB (DBID=923949890)
RMAN>
RMAN> RUN {
2> ALLOCATE CHANNEL ch00 TYPE 'SBT_TAPE' parms='ENV=(NB_ORA_CLIENT=gsinas01,NB_ORA_POLICY=OSS_gsinas01_Oracle_Full_Backup)';
3> BLOCKRECOVER DATAFILE 204 BLOCK 2112;
4> }
released channel: ORA_DISK_1
allocated channel: ch00
channel ch00: sid=1491 devtype=SBT_TAPE
channel ch00: Veritas NetBackup for Oracle - Release 7.6 (2014042414)
Starting blockrecover at 24-MAR-15
channel ch00: restoring block(s)
channel ch00: specifying block(s) to restore from backup set
restoring blocks of datafile 00204
channel ch00: reading from backup piece bk_20333_1_874973991
channel ch00: restored block(s) from backup piece 1
piece handle=bk_20333_1_874973991 tag=HOT_DB_BK_LEVEL0
channel ch00: block restore complete, elapsed time: 00:05:55
starting media recovery
media recovery complete, elapsed time: 00:03:46
Finished blockrecover at 24-MAR-15
released channel: ch00
RMAN>
RMAN> backup validate datafile 204;
Starting backup at 24-MAR-15
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=1491 devtype=DISK
channel ORA_DISK_1: starting full datafile backupset
channel ORA_DISK_1: specifying datafile(s) in backupset
input datafile fno=00204 name=/dev/vg01/rora_data207
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:04
Finished backup at 24-MAR-15
RMAN>
RMAN> recover datafile 204;
Starting recover at 24-MAR-15
using channel ORA_DISK_1
starting media recovery
archive log thread 1 sequence 234600 is already on disk as file /arch1/1_234600_704593860.dbf
archive log thread 1 sequence 234601 is already on disk as file /arch1/1_234601_704593860.dbf
archive log thread 1 sequence 234602 is already on disk as file /arch1/1_234602_704593860.dbf
archive log thread 1 sequence 234603 is already on disk as file /arch1/1_234603_704593860.dbf
archive log thread 1 sequence 234604 is already on disk as file /arch1/1_234604_704593860.dbf
archive log thread 1 sequence 234605 is already on disk as file /arch1/1_234605_704593860.dbf
archive log thread 1 sequence 234606 is already on disk as file /arch1/1_234606_704593860.dbf
archive log filename=/arch1/1_234600_704593860.dbf thread=1 sequence=234600
archive log filename=/arch1/1_234601_704593860.dbf thread=1 sequence=234601
archive log filename=/arch1/1_234602_704593860.dbf thread=1 sequence=234602
media recovery complete, elapsed time: 00:00:06
Finished recover at 24-MAR-15
RMAN>
入库查询:
SQL> alter database datafile 204 online;
Database altered.
SQL> select count(*) from inas.CONSTRUCTIONCRM;
COUNT(*)
----------
1499634
STATUS NAME
------- ------------------------------
ONLINE /dev/oravg/rora_data204
ONLINE /dev/oravg/rora_data205
ONLINE /dev/oravg/rora_data189
ONLINE /dev/oravg/rora_data23
ONLINE /dev/vg01/rora_data206
ONLINE /dev/vg01/rora_data207
ONLINE /dev/vg01/rora_data208
ONLINE /dev/vg01/rora_data209
ONLINE /dev/vg01/rora_data210
ONLINE /dev/vg01/rora_data211
再用dbv检查file情况:
$ dbv file='/dev/vg01/rora_data207';
DBVERIFY: Release 10.2.0.4.0 - Production on Tue Mar 24 10:14:30 2015
Copyright (c) 1982, 2007, Oracle. All rights reserved.
DBVERIFY - Verification starting : FILE = /dev/vg01/rora_data207
DBVERIFY - Verification complete
Total Pages Examined : 1048575
Total Pages Processed (Data) : 1020
Total Pages Failing (Data) : 0
Total Pages Processed (Index): 1146
Total Pages Failing (Index): 0
Total Pages Processed (Other): 18
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 1046391
Total Pages Marked Corrupt : 0
Total Pages Influx : 0
Highest block SCN : 2567939499 (3281.2567939499)
$
无报错,都恢复正常。
原本的错误:
ORA-01110: 数据文件 204: '/dev/vg01/rora_data207'
ORA-06512: 在 "INASQM.P_BACK_UP_EWS_QM", line 16
ORA-06512: 在 line 1
Tue Mar 24 08:14:54 2015
Errors in file /oracle/admin/jhdb/bdump/jhdb_j000_10062.trc:
ORA-12012: 自动执行作业 425 出错
ORA-00376: 此时无法读取文件 204
ORA-01110: 数据文件 204: '/dev/vg01/rora_data207'
ORA-06512: 在 "INASQM.P_BACK_UP_EWS_QM", line 16
ORA-06512: 在 line 1
已经终止报错:
Tue Mar 24 10:46:50 2015
Thread 1 advanced to log sequence 234608 (LGWR switch)
Current log# 2 seq# 234608 mem# 0: /dev/oravg/rgsinas_redo12
至此故障已经恢复,库恢复正常,无中断;然而之前的告警报错仍然有问题可循,以下供参考:
1 ORA-00494: enqueue [CF] held for too long (more than 900 seconds); LGWR:Terminating Instance Due to Error 2103 (文档 ID 1089733.1)
Applies to:
Oracle Database - Enterprise Edition - Version 10.2.0.4 and later
Information in this document applies to any platform.
Symptoms
You see the following issue in the alert.log:
KCF: write/open error block=0x9a989 online=1
file=7 /Oratemp/oradata/M6DTH/temp07.dbf
error=27063 txt: 'IBM AIX RISC System/6000 Error: 28: No space left on device
Additional information: -1
Additional information: 24576'
Automatic tempfile offline due to write error on
file 7: /Oratemp/oradata/M6DTH/temp07.dbf
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 2876448'
LGWR: terminating instance due to error 2103
ORA-02103: PCC: inconsistent cursor cache (out-of-range cuc ref)
Some or all of the following function codes may be found in the 'stack trace' portion of the associated trace file.:
------------- Call Stack ---------------
sskgpwwait
skgpwwait
ksliwat
kslwaitns_timed
kskthbwt
kslwait
kcbbpibr
kcfbsy
kcfbof
kcfofa
ksbabs
ksbrdp
opirip
opidrv
sou2o
opimai_real
main
__start()
Cause
Bug 7600269 closed as a duplicate of Bug:7411568
Bug:9273547 closed as a duplicate of Bug:7411568
Trace file/alert log matches that in the bug
1) Starts with space error and file offline
2) Sessions are waiting for CF enqueue
3) Arguments ksliwat kslwaitns_timed kskthbwt kslwait... match that in the call stack of bug
4) There is a wait on kcbzps and occurs with DBWR
Solution
Apply Patch 7411568
2 如果不打补丁用隐含参数也可以避免,但没有测试是否稳妥:
通过隐含参数禁止Oracle主动kill CF锁的持有进程,禁止Oracle去kill所有持有CF锁的进程:
_kill_controlfile_enqueue_blocker=false
也可以通过设置另外一个隐含参数禁止Oracle去kill持有锁的后台进程:
_kill_enqueue_blocker=1(默认为3)
进程请求CF enqueue的time out时间默认是900秒,这是由下面的参数控制的:
_controlfile_enqueue_timeout=900
进程持有CF enqueue的时间最长不能超过120秒:
_controlfile_enqueue_holding_time=120
3 网友惜分飞的不补充说明:
补充说明
1.在9i中使用blockrecover会在Archive destination中生产block的备份文件,类似469_519791_3063_2442393528.bkd(file 469 block 519791),需要注意Archive destination目录结尾需要”/”,不然可能出现直接写入和该目录并列的Archive destination+blockrecover产生文件;10g中不生成该文件;如果该block在9i中未备份,也不会在生产相关文件,而是利用归档恢复.
2.blockrecover需要还原从该block从备份之后的所有归档(如果被备份起来需要还原出来),如果缺少归档可能导致恢复失败,包括基于scn/seq/time的等
3.blockrecover在执行过程中或者执行异常终止,该block依然是坏块,不会对其他block产生影响.在使用blockrecover出现异常终止后的block修复,在后续blog中提供解决方法
4.对于不含事务的坏块,如果数据允许丢失可以通过设置event跳过坏块(特殊block除外),然后重建对象;对于含event的block 坏块,建议使用blockrecover处理或者直接恢复数据文件,如果没有备份,考试使用event跳过事务回滚,然后屏蔽坏块处理
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/13750068/viewspace-1471545/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/13750068/viewspace-1471545/