一次BT的故障处理

记录一次在我看来比较另类的故障处理。
环境:oracle  10.2.0.3,OS:windows 32bit。
现象:数据库startup后,10-30秒之内出现崩溃,自动关闭。
alert中的报错如下:
 
Thu Mar 21 09:27:24 2013
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 2
Using LOG_ARCHIVE_DEST_1 parameter default value as D:\oracle\product\10.2.0\db_1\RDBMS
Autotune of undo retention is turned on.
IMODE=BR
ILAT =18
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.3.0.
System parameters with non-default values:
  processes                = 150
  sga_max_size             = 1451229184
  __shared_pool_size       = 176160768
  __large_pool_size        = 8388608
  __java_pool_size         = 8388608
  __streams_pool_size      = 0
  nls_language             = SIMPLIFIED CHINESE
  nls_territory            = CHINA
  sga_target               = 1451229184
  control_files            = D:\ORACLE\PRODUCT\10.2.0\ORADATA\REPJSDC\CONTROL01.CTL, D:\ORACLE\PRODUCT\10.2.0\ORADATA\REPJSDC\CONTROL02.CTL, D:\ORACLE\PRODUCT\10.2.0\ORADATA\REPJSDC\CONTROL03.CTL
  db_block_size            = 8192
  __db_cache_size          = 1249902592
  compatible               = 10.2.0.3.0
  db_file_multiblock_read_count= 16
  undo_management          = AUTO
  undo_tablespace          = UNDOTBS1
  remote_login_passwordfile= EXCLUSIVE
  db_domain                =
  job_queue_processes      = 10
  audit_file_dest          = D:\ORACLE\PRODUCT\10.2.0\ADMIN\REPJSDC\ADUMP
  background_dump_dest     = D:\ORACLE\PRODUCT\10.2.0\ADMIN\REPJSDC\BDUMP
  user_dump_dest           = D:\ORACLE\PRODUCT\10.2.0\ADMIN\REPJSDC\UDUMP
  core_dump_dest           = D:\ORACLE\PRODUCT\10.2.0\ADMIN\REPJSDC\CDUMP
  db_name                  = repjsdc
  open_cursors             = 300
  pga_aggregate_target     = 203423744
PMON started with pid=2, OS id=1640
PSP0 started with pid=3, OS id=3588
MMAN started with pid=4, OS id=3988
DBW0 started with pid=5, OS id=2400
LGWR started with pid=6, OS id=1652
CKPT started with pid=7, OS id=3868
SMON started with pid=8, OS id=924
RECO started with pid=9, OS id=3664
CJQ0 started with pid=10, OS id=3916
MMON started with pid=11, OS id=2640
MMNL started with pid=12, OS id=3608
Thu Mar 21 09:27:25 2013
ALTER DATABASE   MOUNT
Thu Mar 21 09:27:29 2013
Setting recovery target incarnation to 1
Thu Mar 21 09:27:29 2013
Successful mount of redo thread 1, with mount id 526771261
Thu Mar 21 09:27:29 2013
Database mounted in Exclusive Mode
Completed: ALTER DATABASE   MOUNT
Thu Mar 21 09:27:30 2013
ALTER DATABASE OPEN
Thu Mar 21 09:27:30 2013
Beginning crash recovery of 1 threads
 parallel recovery started with 7 processes
Thu Mar 21 09:27:30 2013
Started redo scan
Thu Mar 21 09:27:30 2013
Completed redo scan
 383 redo blocks read, 101 data blocks need recovery
Thu Mar 21 09:27:30 2013
Started redo application at
 Thread 1: logseq 20550, block 3
Thu Mar 21 09:27:30 2013
Recovery of Online Redo Log: Thread 1 Group 3 Seq 20550 Reading mem 0
  Mem# 0: D:\ORACLE\PRODUCT\10.2.0\ORADATA\REPJSDC\REDO03.LOG
Thu Mar 21 09:27:30 2013
Completed redo application
Thu Mar 21 09:27:30 2013
Completed crash recovery at
 Thread 1: logseq 20550, block 386, scn 73904193314
 101 data blocks read, 101 data blocks written, 383 redo blocks read
Thu Mar 21 09:27:30 2013
Thread 1 advanced to log sequence 20551
Thread 1 opened at log sequence 20551
  Current log# 1 seq# 20551 mem# 0: D:\ORACLE\PRODUCT\10.2.0\ORADATA\REPJSDC\REDO01.LOG
Successful open of redo thread 1
Thu Mar 21 09:27:30 2013
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Thu Mar 21 09:27:30 2013
SMON: enabling cache recovery
Thu Mar 21 09:27:31 2013
Successfully onlined Undo Tablespace 1.
Thu Mar 21 09:27:31 2013
SMON: enabling tx recovery
Thu Mar 21 09:27:31 2013
Database Characterset is ZHS16GBK
Thu Mar 21 09:27:31 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_smon_924.trc:
ORA-00600: 内部错误代码, 参数: [kcbgcur_9], [4194333], [1], [4294967250], [8192], [], [], []
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
QMNC started with pid=21, OS id=3364
Thu Mar 21 09:27:31 2013
Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 1 out of maximum 100 non-fatal internal errors.
Thu Mar 21 09:27:31 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_smon_924.trc:
ORA-00600: 内部错误代码, 参数: [kcbgcur_9], [4194333], [1], [4294967250], [8192], [], [], []
Thu Mar 21 09:27:32 2013
Completed: ALTER DATABASE OPEN
Thu Mar 21 09:27:32 2013
Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 2 out of maximum 100 non-fatal internal errors.
Thu Mar 21 09:27:33 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_smon_924.trc:
ORA-00600: 内部错误代码, 参数: [kcbgcur_9], [4194333], [1], [4294967250], [8192], [], [], []
Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 3 out of maximum 100 non-fatal internal errors.
Thu Mar 21 09:27:34 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_smon_924.trc:
ORA-00600: 内部错误代码, 参数: [kcbgcur_9], [4194333], [1], [4294967250], [8192], [], [], []
Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 4 out of maximum 100 non-fatal internal errors.
Thu Mar 21 09:27:35 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_smon_924.trc:
ORA-00600: 内部错误代码, 参数: [kcbgcur_9], [4194333], [1], [4294967250], [8192], [], [], []
Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 5 out of maximum 100 non-fatal internal errors.
Thu Mar 21 09:27:46 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_smon_924.trc:
ORA-00600: 内部错误代码, 参数: [kcbgcur_9], [4194333], [1], [4294967250], [8192], [], [], []
Thu Mar 21 09:27:46 2013
Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 6 out of maximum 100 non-fatal internal errors.
Thu Mar 21 09:27:56 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_smon_924.trc:
ORA-00600: 内部错误代码, 参数: [kcbgcur_9], [4194333], [1], [4294967250], [8192], [], [], []
Thu Mar 21 09:27:56 2013
Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 7 out of maximum 100 non-fatal internal errors.
Thu Mar 21 09:27:59 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_smon_924.trc:
ORA-00600: 内部错误代码, 参数: [kcbgcur_9], [4194333], [1], [4294967250], [8192], [], [], []
Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 8 out of maximum 100 non-fatal internal errors.
Thu Mar 21 09:28:10 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_smon_924.trc:
ORA-00600: 内部错误代码, 参数: [kcbgcur_9], [4194333], [1], [4294967250], [8192], [], [], []
Thu Mar 21 09:28:13 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_pmon_1640.trc:
ORA-00474: SMON 进程因错误而终止
Thu Mar 21 09:28:13 2013
PMON: terminating instance due to error 474
Thu Mar 21 09:28:14 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_j000_2212.trc:
ORA-00474: SMON 进程因错误而终止
Thu Mar 21 09:28:14 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_j001_2404.trc:
ORA-00474: SMON 进程因错误而终止
Thu Mar 21 09:28:15 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_reco_3664.trc:
ORA-00474: SMON 进程因错误而终止
Thu Mar 21 09:28:15 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_ckpt_3868.trc:
ORA-00474: SMON 进程因错误而终止
Thu Mar 21 09:28:15 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_mman_3988.trc:
ORA-00474: SMON 进程因错误而终止
Thu Mar 21 09:28:15 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_psp0_3588.trc:
ORA-00474: SMON 进程因错误而终止
Thu Mar 21 09:28:15 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_lgwr_1652.trc:
ORA-00474: SMON 进程因错误而终止
Thu Mar 21 09:28:15 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_dbw0_2400.trc:
ORA-00474: SMON 进程因错误而终止
 
排查思路,之前处理过ORA-600的错误,此处的错误可能与之前的情况有相似点,按照undo表空间,损坏的方式处理,用隐含参数的方式启动数据库:  alter system set "_allow_resetlogs_corruption"=true scope=spfile;之后重建usdo表空间。正常恢复后发现情况并没有解决。
 
此时变的束手无策,接下来查看metalink,上面的处理方式是针对ORA-00600: internal error code, arguments: [4194], [3], [3], [], [], [], [], [],这一报错的,是比较经典的问题,具体解决方式参考盖国强大师的微博http://www.eygle.com/archives/2006/02/howto_resolve_ora_600_4194.html,和杨廷昆的微博http://yangtingkun.itpub.net/post/468/497211的解决方式,但是仔细看完后发现这个问题和我的情况并不是一回事,此时查看metalink给出的几条解释,
SMON crashes the instance on startup with ORA-600 [kcbgcur_9]

You shall get into this issue in following scenario. 

o A rollback segment is dropped. 
o It belongs to locally managed tablespace. 
o The rollback segment is marked as temporary for later cleanup. 

1. The following query is to check if it is a Locally Managed tablespace. 
SELECT EXTENT_MANAGEMENT FROM DBA_TABLESPACES WHERE TABLESPACE_NAME='&RBS'; 



--- Extent Management of the RBS tablespace is LOCAL 
--- RBS = Rollback Segment tablespace 

2. The following query will show the segments that are marked as TEMPORARY in the rollback segment tablespace. 

SELECT SEGMENT_NAME, SEGMENT_TYPE FROM DBA_SEGMENTS WHERE SEGMENT_TYPE='TEMPORARY' 
AND 
TABLESPACE_NAME='&RBS'; 



--- Returns TEMPORARY object / objects..

Cause

o This is due to unpublished Bug:2202310 and applicable to the Oracle versions lesser than 9.0.1.4.0 (or) 9.2.0.1.0

o SMON crashes the instance with ORA-600 [kcbgcur_9] error, while clearing the a rollback segment that is left in TEMPORARY status for later cleanup  in a Locally Managed tablespace.

Solution

o It is very rare to encounter this scenario.


o As this issue is not fixed in 8i releases we shall use the workarounds to avoid bug. 

o Whenever a rollback segment is dropped check the DBA_SEGMENTS view and ensure there is no TEMPORARY segments in the rollback segment tablespace before creating a new rollback segment.


SELECT SEGMENT_NAME, SEGMENT_TYPE FROM DBA_SEGMENTS WHERE SEGMENT_TYPE='TEMPORARY'
AND
TABLESPACE_NAME='&RBS';


o For further assistance please log a tar with oracle support to investigate the issue.

@ To resolve the scenario following actions shall be tried.
@1. Stop SMON doing the TEMP segment cleanup. This is inorder to understand weather we fall into the above bug. 
@event="10061 trace name context forever, level 10"
@2. If this allows the database to open, then drop the problem rollback segment tablespace.
@3. As the tablespace is dropped now, there wont be any conflicting cleanups for SMON.  
@4. Recreate the RBS tablespace and new rollback segments. 
@5. Remove the event from the init.ora


唯一找到这个相近的情况,描述的是8i中出现的bug,很少出现的bug,其中是本地管理表空间,而目前的环境并不是这样,但是给出的启示是smon清理temp segment或者某个segment的时候出现了故障。
 
此时仔细阅读alert报错,发现这样一句话
Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 1 out of maximum 100 non-fatal internal errors.
这个报错也和ora-600有关

When starting the database after a crash or for normal maintenance after a truncate table, the error below will occur 100 times and then the instance will crash. 

The errors are caused because SMON is trying to clean a temp segment in a tablespace by looking at the bitmap index in the tablespace.  The bitmap index is corrupted and this causes SMON to fail when trying to clean up the segment.

From the alert.log:

ORA-00600: internal error code, arguments: [5463], [], [], [], [], [], [], []
Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 53 out of maximum 100 non-fatal internal errors.

这个错误和我们的挺像,但是操作却不一样,我们借鉴下解决方式,

1. set event 10061 in the pfile to turn off  SMON from cleaning temp segments.  This will keep the database from crashing after SMON tries 100 times to clean the segment.

event = '10061 trace name context forever, level 10'

2. >select segment_name, segment_type from dba_segments where 
header_file=98 and segment_type='TEMPORARY'; 

SEGMENT_NAME SEGMENT_TYPE 
-------------------    ----------------------- 
98.110124                TEMPORARY 

98.110124

If there are no results make sure the datafile reference is not using the relative file number.

>select rfile#,name,ts#,file# from v$datafile;

3. Mark the segment as corrupted: 
exec dbms_space_admin.segment_corrupt('',98,110124) 

4. Drop the segment 
exec dbms_space_admin.segment_drop_corrupt('',98,110124) 

5. Rebuild all the bitmap indexes 
exec dbms_space_admin.tablespace_rebuild_bitmaps(''); 

6. Remove event 10061 from pfile and bounce database 

7.  exec dbms_space_admin.tablespace_verify (''); 

 

由于在启动过程中,很有可能破坏了数据库的一致性,建议执行exp后重新建库。

至此,问题解决了,总结一下,其实是一开始就走入了误区,才使得自己走了不少弯路,如果一开始就把入手点定在

Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 53 out of maximum 100 non-fatal internal errors.

这两句上,可能问题早就解决了。
 
问题分析:到了这步,结果变的很明朗了,SMON在实例启动的时候清理temp segment时被调用了53次后强行被关闭。我们手动管理了smon的清理功能。
 
下面是envent 10061的介绍:
Error:  ORA 10061
Text:   disable SMON from cleaning temp segment
-------------------------------------------------------------------------------

Explanation:
        This is NOT an error but is a special EVENT code.
        It should *NOT* be used unless explicitly requested by RD support.

        It allows you to prevent SMON from cleaning up temporary segments.
        If set in:

         7.0.X:
            This will prevent SMON performing cleanup of TEMP segments
            and from coalsecing extents. The cleanup action was performed
            every 3 minutes in 7.0.X releases.

        7.1.X:
            The event behaviour is modified in that TEMP segment cleanup
            is skipped on TIMEOUTS but will be performed if SMON is
            posted. The cleanup only occurs every 12 hours or when posted
            in 7.1.X releases.
            You can POST SMON by performing a failed CREATE TABLE command.

            A new event 10269 is for coalescing extents.

Usage:  
        event="10061 trace name context forever, level 10"

Articles:

@Archived Managing SMON Activity on multiple OPS instances   <45856.1>45856.1>

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/16969562/viewspace-756783/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/16969562/viewspace-756783/

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值