一次BT的故障处理

最新推荐文章于 2024-12-24 20:12:30 发布

转载最新推荐文章于 2024-12-24 20:12:30 发布 · 680 阅读

记录一次在我看来比较另类的故障处理。

环境：oracle 10.2.0.3，OS:windows 32bit。

现象：数据库startup后，10-30秒之内出现崩溃，自动关闭。

alert中的报错如下：

Thu Mar 21 09:27:24 2013
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 2
Using LOG_ARCHIVE_DEST_1 parameter default value as D:\oracle\product\10.2.0\db_1\RDBMS
Autotune of undo retention is turned on.
IMODE=BR
ILAT =18
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.3.0.
System parameters with non-default values:
processes                = 150
sga_max_size             = 1451229184
__shared_pool_size       = 176160768
__large_pool_size        = 8388608
__java_pool_size         = 8388608
__streams_pool_size      = 0
nls_language             = SIMPLIFIED CHINESE
nls_territory            = CHINA
sga_target               = 1451229184
control_files            = D:\ORACLE\PRODUCT\10.2.0\ORADATA\REPJSDC\CONTROL01.CTL, D:\ORACLE\PRODUCT\10.2.0\ORADATA\REPJSDC\CONTROL02.CTL, D:\ORACLE\PRODUCT\10.2.0\ORADATA\REPJSDC\CONTROL03.CTL
db_block_size            = 8192
__db_cache_size          = 1249902592
compatible               = 10.2.0.3.0
db_file_multiblock_read_count= 16
undo_management          = AUTO
undo_tablespace          = UNDOTBS1
remote_login_passwordfile= EXCLUSIVE
db_domain                =
job_queue_processes      = 10
audit_file_dest          = D:\ORACLE\PRODUCT\10.2.0\ADMIN\REPJSDC\ADUMP
background_dump_dest     = D:\ORACLE\PRODUCT\10.2.0\ADMIN\REPJSDC\BDUMP
user_dump_dest           = D:\ORACLE\PRODUCT\10.2.0\ADMIN\REPJSDC\UDUMP
core_dump_dest           = D:\ORACLE\PRODUCT\10.2.0\ADMIN\REPJSDC\CDUMP
db_name                  = repjsdc
open_cursors             = 300
pga_aggregate_target     = 203423744
PMON started with pid=2, OS id=1640
PSP0 started with pid=3, OS id=3588
MMAN started with pid=4, OS id=3988
DBW0 started with pid=5, OS id=2400
LGWR started with pid=6, OS id=1652
CKPT started with pid=7, OS id=3868
SMON started with pid=8, OS id=924
RECO started with pid=9, OS id=3664
CJQ0 started with pid=10, OS id=3916
MMON started with pid=11, OS id=2640
MMNL started with pid=12, OS id=3608
Thu Mar 21 09:27:25 2013
ALTER DATABASE   MOUNT
Thu Mar 21 09:27:29 2013
Setting recovery target incarnation to 1
Thu Mar 21 09:27:29 2013
Successful mount of redo thread 1, with mount id 526771261
Thu Mar 21 09:27:29 2013
Database mounted in Exclusive Mode
Completed: ALTER DATABASE   MOUNT
Thu Mar 21 09:27:30 2013
ALTER DATABASE OPEN
Thu Mar 21 09:27:30 2013
Beginning crash recovery of 1 threads
parallel recovery started with 7 processes
Thu Mar 21 09:27:30 2013
Started redo scan
Thu Mar 21 09:27:30 2013
Completed redo scan
383 redo blocks read, 101 data blocks need recovery
Thu Mar 21 09:27:30 2013
Started redo application at
Thread 1: logseq 20550, block 3
Thu Mar 21 09:27:30 2013
Recovery of Online Redo Log: Thread 1 Group 3 Seq 20550 Reading mem 0
Mem# 0: D:\ORACLE\PRODUCT\10.2.0\ORADATA\REPJSDC\REDO03.LOG
Thu Mar 21 09:27:30 2013
Completed redo application
Thu Mar 21 09:27:30 2013
Completed crash recovery at
Thread 1: logseq 20550, block 386, scn 73904193314
101 data blocks read, 101 data blocks written, 383 redo blocks read
Thu Mar 21 09:27:30 2013
Thread 1 advanced to log sequence 20551
Thread 1 opened at log sequence 20551
Current log# 1 seq# 20551 mem# 0: D:\ORACLE\PRODUCT\10.2.0\ORADATA\REPJSDC\REDO01.LOG
Successful open of redo thread 1
Thu Mar 21 09:27:30 2013
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Thu Mar 21 09:27:30 2013
SMON: enabling cache recovery
Thu Mar 21 09:27:31 2013
Successfully onlined Undo Tablespace 1.
Thu Mar 21 09:27:31 2013
SMON: enabling tx recovery
Thu Mar 21 09:27:31 2013
Database Characterset is ZHS16GBK
Thu Mar 21 09:27:31 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_smon_924.trc:
ORA-00600: 内部错误代码, 参数: [kcbgcur_9], [4194333], [1], [4294967250], [8192], [], [], []

replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
QMNC started with pid=21, OS id=3364
Thu Mar 21 09:27:31 2013
Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 1 out of maximum 100 non-fatal internal errors.
Thu Mar 21 09:27:31 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_smon_924.trc:
ORA-00600: 内部错误代码, 参数: [kcbgcur_9], [4194333], [1], [4294967250], [8192], [], [], []

Thu Mar 21 09:27:32 2013
Completed: ALTER DATABASE OPEN
Thu Mar 21 09:27:32 2013
Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 2 out of maximum 100 non-fatal internal errors.
Thu Mar 21 09:27:33 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_smon_924.trc:
ORA-00600: 内部错误代码, 参数: [kcbgcur_9], [4194333], [1], [4294967250], [8192], [], [], []

Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 3 out of maximum 100 non-fatal internal errors.
Thu Mar 21 09:27:34 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_smon_924.trc:
ORA-00600: 内部错误代码, 参数: [kcbgcur_9], [4194333], [1], [4294967250], [8192], [], [], []

Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 4 out of maximum 100 non-fatal internal errors.
Thu Mar 21 09:27:35 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_smon_924.trc:
ORA-00600: 内部错误代码, 参数: [kcbgcur_9], [4194333], [1], [4294967250], [8192], [], [], []

Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 5 out of maximum 100 non-fatal internal errors.
Thu Mar 21 09:27:46 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_smon_924.trc:
ORA-00600: 内部错误代码, 参数: [kcbgcur_9], [4194333], [1], [4294967250], [8192], [], [], []

Thu Mar 21 09:27:46 2013
Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 6 out of maximum 100 non-fatal internal errors.
Thu Mar 21 09:27:56 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_smon_924.trc:
ORA-00600: 内部错误代码, 参数: [kcbgcur_9], [4194333], [1], [4294967250], [8192], [], [], []

Thu Mar 21 09:27:56 2013
Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 7 out of maximum 100 non-fatal internal errors.
Thu Mar 21 09:27:59 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_smon_924.trc:
ORA-00600: 内部错误代码, 参数: [kcbgcur_9], [4194333], [1], [4294967250], [8192], [], [], []

Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 8 out of maximum 100 non-fatal internal errors.
Thu Mar 21 09:28:10 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_smon_924.trc:
ORA-00600: 内部错误代码, 参数: [kcbgcur_9], [4194333], [1], [4294967250], [8192], [], [], []

Thu Mar 21 09:28:13 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_pmon_1640.trc:
ORA-00474: SMON 进程因错误而终止

Thu Mar 21 09:28:13 2013
PMON: terminating instance due to error 474
Thu Mar 21 09:28:14 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_j000_2212.trc:
ORA-00474: SMON 进程因错误而终止

Thu Mar 21 09:28:14 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_j001_2404.trc:
ORA-00474: SMON 进程因错误而终止

Thu Mar 21 09:28:15 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_reco_3664.trc:
ORA-00474: SMON 进程因错误而终止

Thu Mar 21 09:28:15 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_ckpt_3868.trc:
ORA-00474: SMON 进程因错误而终止

Thu Mar 21 09:28:15 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_mman_3988.trc:
ORA-00474: SMON 进程因错误而终止

Thu Mar 21 09:28:15 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_psp0_3588.trc:
ORA-00474: SMON 进程因错误而终止

Thu Mar 21 09:28:15 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_lgwr_1652.trc:
ORA-00474: SMON 进程因错误而终止

Thu Mar 21 09:28:15 2013
Errors in file d:\oracle\product\10.2.0\admin\repjsdc\bdump\repjsdc_dbw0_2400.trc:
ORA-00474: SMON 进程因错误而终止

排查思路，之前处理过ORA-600的错误，此处的错误可能与之前的情况有相似点，按照undo表空间，损坏的方式处理，用隐含参数的方式启动数据库： alter system set "_allow_resetlogs_corruption"=true scope=spfile;之后重建usdo表空间。正常恢复后发现情况并没有解决。

此时变的束手无策，接下来查看metalink，上面的处理方式是针对ORA-00600: internal error code, arguments: [4194], [3], [3], [], [], [], [], []，这一报错的，是比较经典的问题，具体解决方式参考盖国强大师的微博http://www.eygle.com/archives/2006/02/howto_resolve_ora_600_4194.html，和杨廷昆的微博http://yangtingkun.itpub.net/post/468/497211的解决方式，但是仔细看完后发现这个问题和我的情况并不是一回事，此时查看metalink给出的几条解释，

SMON crashes the instance on startup with ORA-600 [kcbgcur_9]

You shall get into this issue in following scenario.

o A rollback segment is dropped.
o It belongs to locally managed tablespace.
o The rollback segment is marked as temporary for later cleanup.

1. The following query is to check if it is a Locally Managed tablespace.
SELECT EXTENT_MANAGEMENT FROM DBA_TABLESPACES WHERE TABLESPACE_NAME='&RBS';

--- Extent Management of the RBS tablespace is LOCAL
--- RBS = Rollback Segment tablespace

2. The following query will show the segments that are marked as TEMPORARY in the rollback segment tablespace.

SELECT SEGMENT_NAME, SEGMENT_TYPE FROM DBA_SEGMENTS WHERE SEGMENT_TYPE='TEMPORARY'
AND
TABLESPACE_NAME='&RBS';

--- Returns TEMPORARY object / objects..

Cause

o This is due to unpublished Bug:2202310 and applicable to the Oracle versions lesser than 9.0.1.4.0 (or) 9.2.0.1.0

o SMON crashes the instance with ORA-600 [kcbgcur_9] error, while clearing the a rollback segment that is left in TEMPORARY status for later cleanup in a Locally Managed tablespace.

Solution

o It is very rare to encounter this scenario.

o As this issue is not fixed in 8i releases we shall use the workarounds to avoid bug.

o Whenever a rollback segment is dropped check the DBA_SEGMENTS view and ensure there is no TEMPORARY segments in the rollback segment tablespace before creating a new rollback segment.

SELECT SEGMENT_NAME, SEGMENT_TYPE FROM DBA_SEGMENTS WHERE SEGMENT_TYPE='TEMPORARY'
AND
TABLESPACE_NAME='&RBS';

o For further assistance please log a tar with oracle support to investigate the issue.

@ To resolve the scenario following actions shall be tried.
@1. Stop SMON doing the TEMP segment cleanup. This is inorder to understand weather we fall into the above bug.
@event="10061 trace name context forever, level 10"
@2. If this allows the database to open, then drop the problem rollback segment tablespace.
@3. As the tablespace is dropped now, there wont be any conflicting cleanups for SMON.
@4. Recreate the RBS tablespace and new rollback segments.
@5. Remove the event from the init.ora

唯一找到这个相近的情况，描述的是8i中出现的bug，很少出现的bug，其中是本地管理表空间，而目前的环境并不是这样，但是给出的启示是smon清理temp segment或者某个segment的时候出现了故障。

此时仔细阅读alert报错，发现这样一句话

Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 1 out of maximum 100 non-fatal internal errors.

这个报错也和ora-600有关

When starting the database after a crash or for normal maintenance after a truncate table, the error below will occur 100 times and then the instance will crash.

The errors are caused because SMON is trying to clean a temp segment in a tablespace by looking at the bitmap index in the tablespace. The bitmap index is corrupted and this causes SMON to fail when trying to clean up the segment.

From the alert.log:

ORA-00600: internal error code, arguments: [5463], [], [], [], [], [], [], []
Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 53 out of maximum 100 non-fatal internal errors.

这个错误和我们的挺像，但是操作却不一样，我们借鉴下解决方式，

1. set event 10061 in the pfile to turn off SMON from cleaning temp segments. This will keep the database from crashing after SMON tries 100 times to clean the segment.

event = '10061 trace name context forever, level 10'

2. >select segment_name, segment_type from dba_segments where
header_file=98 and segment_type='TEMPORARY';

SEGMENT_NAME SEGMENT_TYPE
------------------- -----------------------
98.110124 TEMPORARY

98.110124

If there are no results make sure the datafile reference is not using the relative file number.

>select rfile#,name,ts#,file# from v$datafile;

3. Mark the segment as corrupted:
exec dbms_space_admin.segment_corrupt('',98,110124)

4. Drop the segment
exec dbms_space_admin.segment_drop_corrupt('',98,110124)

5. Rebuild all the bitmap indexes
exec dbms_space_admin.tablespace_rebuild_bitmaps('');

6. Remove event 10061 from pfile and bounce database

7. exec dbms_space_admin.tablespace_verify ('');

由于在启动过程中，很有可能破坏了数据库的一致性，建议执行exp后重新建库。

至此，问题解决了，总结一下，其实是一开始就走入了误区，才使得自己走了不少弯路，如果一开始就把入手点定在

Non-fatal internal error happenned while SMON was doing temporary segment drop.
SMON encountered 53 out of maximum 100 non-fatal internal errors.

这两句上，可能问题早就解决了。

问题分析：到了这步，结果变的很明朗了，SMON在实例启动的时候清理temp segment时被调用了53次后强行被关闭。我们手动管理了smon的清理功能。

下面是envent 10061的介绍：

Error:  ORA 10061
Text:   disable SMON from cleaning temp segment
-------------------------------------------------------------------------------

Explanation:
        This is NOT an error but is a special EVENT code.
        It should *NOT* be used unless explicitly requested by RD support.

        It allows you to prevent SMON from cleaning up temporary segments.
        If set in:

         7.0.X:
            This will prevent SMON performing cleanup of TEMP segments
            and from coalsecing extents. The cleanup action was performed
            every 3 minutes in 7.0.X releases.

        7.1.X:
            The event behaviour is modified in that TEMP segment cleanup
            is skipped on TIMEOUTS but will be performed if SMON is
            posted. The cleanup only occurs every 12 hours or when posted
            in 7.1.X releases.
            You can POST SMON by performing a failed CREATE TABLE command.

            A new event 10269 is for coalescing extents.

Usage:  
        event="10061 trace name context forever, level 10"

Articles:

@Archived Managing SMON Activity on multiple OPS instances   <45856.1>45856.1>

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/16969562/viewspace-756783/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/16969562/viewspace-756783/