今天在一个报表数据库后台发现了这个错误。简单描述一下问题的解决过程。
详细的错误信息为:
Fri Feb 20 08:16:44 2009 Errors in file /u1/oracle/admin/repdb01/bdump/repdb01_j015_5099.trc: ORA-00600: internal error code, arguments: [17069], [0x6A5DEE1E0], [], [], [], [], [], [] Fri Feb 20 08:16:47 2009 Errors in file /u1/oracle/admin/repdb01/bdump/repdb01_j015_5099.trc: ORA-00600: internal error code, arguments: [17069], [0x6A5DEE1E0], [], [], [], [], [], []
进一步检查对应的trace文件:
bash-2.03$ more /u1/oracle/admin/repdb01/bdump/repdb01_j015_5099.trc /u1/oracle/admin/repdb01/bdump/repdb01_j015_5099.trcOracle 9i Enterprise Edition Release 9.2.0.4.0 - 64bit Production With the Partitioning, OLAP and Oracle Data Mining options JServer Release 9.2.0.4.0 - Production ORACLE_HOME = /data/oracle/product/920 System name: SunOS Node name: newreport Release: 5.8 Version: Generic_117350-26 Machine: sun4u Instance name: repdb01 Redo thread mounted by this instance: 1 Oracle process number: 35 Unix process pid: 5099, image: oracle@newreport (J015)
*** SESSION ID:(12.28191) 2009-02-20 08:16:44.060 *** 2009-02-20 08:16:44.060 ksedmp: internal or fatal error ORA-00600: internal error code, arguments: [17069], [0x6A5DEE1E0], [], [], [], [], [], [] Current SQL statement for this session: DECLARE job BINARY_INTEGER := :job; next_date DATE := :mydate; broken BOOLEAN := FALSE; BEGIN P_GENERATE_REPDATA('FR20T000002000000 0000032'); :mydate := next_date; IF broken THEN :b := 1; ELSE :b := 0; END IF; END; ----- Call Stack Trace ----- calling call entry argument values in hex location type point (? means dubious value) -------------------- -------- -------------------- ---------------------------- ksedmp()+328 CALL ksedst()+0 FFFFFFFF7FFF6430 ? 000000000 ? 000000000 ? 00000003E ? FFFFFFFF7FFF6CC8 ? 1031D56C8 ? kgeriv()+208 PTR_CALL 0000000000000000 000000000 ? 000103400 ? 0001035D9 ? 000102C00 ? 1035D9000 ? 1035D9C28 ? kgesiv()+108 CALL kgeriv()+0 1035D9E88 ? 1036C7148 ? 000000258 ? 0000013C8 ? FFFFFFFF7FFF7608 ? 1035DB258 ? kgesic1()+32 CALL kgesiv()+0 1035D9E88 ? 1036C7148 ? 0000042AD ? 000000001 ? FFFFFFFF7FFF7608 ? 004000000 ? kglgob()+1972 CALL kgesic1()+0 1035D9E88 ? 1036C7148 ? 0000042AD ? 000000002 ? 6A5DEE1E0 ? 0000010A0 ? kgldpo()+524 CALL kglgob()+0 000000000 ? 000000000 ? 6A5DEE1E0 ? FFFFFFFF7FFF77A8 ? 000080000 ? 000000010 ? kgldon()+248 CALL kgldpo()+0 000000000 ? 000000000 ? 69A79DA60 ? 000000001 ? 000000002 ? FFFFFFFF7FFF7BEE ? pkldon()+108 CALL kgldon()+0 1035D9E88 ? FFFFFFFF7FFF7DE0 ? 69A79DA60 ? 000000001 ? 000000000 ? FFFFFFFF7FFF7D8E ? pkloud()+204 CALL pkldon()+0 FFFFFFFF7FFFA1A0 ? FFFFFFFF7FFF7DE0 ? 69A79DA60 ? 000000001 ? 000000000 ? FFFFFFFF7FFF7D8E ? phnnrl_name_resolve CALL pkloud()+0 1033FCA90 ? _by_loading()+280 FFFFFFFF7FFF7E3C ? 000000000 ? 000000000 ? 000030000 ? 6A475CE18 ? phngdl_get_defining CALL phnnrl_name_resolve 000000000 ? 000020015 ? _libunit()+124 _by_loading()+0 FFFFFFFF7FFF9830 ? FFFFFFFF7FFF8160 ? 000020015 ? 000000000 ? phnrpls_resolve_pre CALL phngdl_get_defining FFFFFFFF7FFF9830 ? fix_libscope()+12 _libunit()+0 FFFFFFFF7FFF85A8 ? FFFFFFFF7FFF843C ? 000000000 ? 000000000 ? 000000000 ?
无论是从trace文件对应的名称,还是从trace文件中对应的语句都可以确定,引起问题的是一个JOB.检查metalink,Oracle在文档Doc ID: 39616.1中对这个错误的已知bug,进行了汇总,不过这些bug的描述似乎没有和当前十分相符的。查看文档的描述,发现ORA-600错误的第二个参数,这里是0x6A5DEE1E0,代表Library Cache Object Handle.看来问题可能和LATCH有关。
但是根据信息在V$LATCH和V$LATCH_CHILDREN视图中,没有找到有价值的信息。
这个JOB由于失败会自动再次执行,检查JOB运行时的V$LOCK信息:
SQL> SELECT ADDR, TYPE, ID1, ID2, LMODE, REQUEST, BLOCK 2 FROM V$LOCK 3 WHERE SID = 75;
ADDR TY ID1 ID2 LMODE REQUEST BLOCK ---------------- -- ---------- ---------- ---------- ---------- ---------- 0000000690342780 CU -1.703E+09 6 6 0 0 00000006903426F8 JQ 0 63 6 0 0
从V$LOCK 中看不到什么特别有价值的信息,接着检查V$SESSION_WAIT ,看看这个JOB 在等待什么:
SQL> SELECT EVENT, P1TEXT, P1RAW, P2TEXT, P2RAW, STATE 2 FROM V$SESSION_WAIT 3 WHERE SID = 75;
EVENT P1TEXT P1RAW P2TEXT P2RAW STATE ----------------- --------------- ---------------- ------------ ---------------- ------- library cache pin handle address 00000006A5DEE1E0 pin address 00000006B1A971A8 WAITING
这次的信息就明显了,ORA-600错误的第二个参数就是V$SESSION_WAIT视图的P1RAW的值,而且从等待事件上也可以看到,问题就是出现在LIBRARY CACHE PIN的过程中。
重新查看METALINK的信息,这个错误可能发生在一个长时间运行的进程,在其运行过程中,所依赖的对象被编译或者删除了。
检查JOB调用的过程的状态:
SQL> SELECT OWNER, OBJECT_NAME, OBJECT_TYPE, STATUS 2 FROM DBA_OBJECTS 3 WHERE OWNER = 'FUJIANREP' 4 AND OBJECT_NAME = 'P_GENERATE_REPDATA';
OWNER OBJECT_NAME OBJECT_TYPE STATUS ------------------------------ ------------------------------ ------------------ ------- FUJIANREP
果然问题过程处于不正常的状态。
将JOB至于BROKEN状态,避免JOB再次运行:
SQL> EXEC DBMS_JOB.BROKEN(63, TRUE)
PL/SQL procedure successfully completed.
SQL> COMMIT;
Commit complete.
杀掉JOB 对应的PROCESS :
SQL> SELECT SPID FROM V$PROCESS WHERE ADDR IN (SELECT PADDR FROM V$SESSION WHERE SID = 75);
SPID ------------ 14927
SQL> HOST kill -9 14927
下面用重新编译该过程:
SQL> ALTER PROCEDURE P_GENERATE_REPDATA COMPILE; ALTER PROCEDURE P_GENERATE_REPDATA COMPILE * ERROR at line 1: ORA-04021: timeout occurred while waiting to lock object FUJIANREP.P_GENERATE_REPDATA
由于从V$LOCK和V$LATCH无法得到信息,只能看看有没有其他人当前在访问P_GENERATE_REPDATA所依赖的对象:
SQL> SELECT * FROM V$ACCESS 2 WHERE (OWNER, OBJECT) IN 3 (SELECT REFERENCED_OWNER, REFERENCED_NAME 4 FROM DBA_DEPENDENCIES 5 WHERE OWNER = 'FUJIANREP' 6 AND NAME = 'P_GENERATE_REPDATA');
SID OWNER OBJECT TYPE ---------- ------------------------------ ------------------------------ ------------------------ 54 FUJIANREP CAT_BUYER SYNONYM 54 FUJIANREP CAT_CATEGORY SYNONYM 54 FUJIANREP CAT_DOSEAGE_FORM SYNONYM 54 FUJIANREP CAT_DRUG SYNONYM 54 FUJIANREP CAT_ENTERPRISE SYNONYM 54 FUJIANREP CAT_METRIC SYNONYM 54 FUJIANREP CAT_ORG SYNONYM 54 FUJIANREP CAT_PRODUCT SYNONYM 54 FUJIANREP CAT_QUALITY_DEFINE SYNONYM 54 FUJIANREP GOV_CAT_BUYER TABLE 54 FUJIANREP GOV_CAT_ENTERPRISE TABLE 54 FUJIANREP GOV_S_MO_BU TABLE 54 FUJIANREP GOV_S_MO_BU_EN TABLE 54 FUJIANREP GOV_S_MO_BU_PR TABLE 54 FUJIANREP GOV_S_MO_EN TABLE 54 FUJIANREP GOV_S_MO_ME TABLE 54 FUJIANREP GOV_S_MO_ME_CA TABLE 54 FUJIANREP GOV_S_MO_ME_PR TABLE 54 FUJIANREP GOV_S_MO_ORDER TABLE 54 FUJIANREP GOV_S_YE_ORDER TABLE 54 FUJIANREP GRP_HOSPITAL TABLE 54 FUJIANREP GRP_LEVEL TABLE 54 FUJIANREP ORD_ORDER TABLE 54 FUJIANREP ORD_ORDER_ITEM TABLE 54 FUJIANREP ORD_ORDER_ITEM_REP CURSOR 54 FUJIANREP ORD_ORDER_RECEIVE TABLE 54 FUJIANREP ORD_ORDER_RECEIVE_REP SYNONYM 54 FUJIANREP ORD_ORDER_REP CURSOR 54 FUJIANREP ORD_ORDER_RETURN TABLE 54 FUJIANREP ORD_ORDER_RETURN_REP CURSOR 54 FUJIANREP PLT_PLAT CURSOR 54 FUJIANREP USER_TAB_PARTITIONS CURSOR 54 NDMAIN CAT_BUYER TABLE 54 NDMAIN CAT_CATEGORY TABLE 54 NDMAIN CAT_DOSEAGE_FORM TABLE 54 NDMAIN CAT_DRUG TABLE 54 NDMAIN CAT_ENTERPRISE TABLE 54 NDMAIN CAT_METRIC TABLE 54 NDMAIN CAT_ORG TABLE 54 NDMAIN CAT_PRODUCT TABLE 54 NDMAIN CAT_QUALITY_DEFINE TABLE 54 NDMAIN ORD_ORDER VIEW 54 NDMAIN ORD_ORDER_ITEM VIEW 54 NDMAIN ORD_ORDER_RECEIVE VIEW 54 NDMAIN ORD_ORDER_RETURN VIEW 54 NDMAIN PLT_PLAT TABLE 54 PUBLIC USER_TAB_PARTITIONS SYNONYM 54 SYS STANDARD PACKAGE 145 SYS STANDARD PACKAGE 54 SYS SYS_STUB_FOR_PURITY_ANALYSIS PACKAGE 54 SYS USER_TAB_PARTITIONS VIEW
51 rows selected.
对象果然被其他人所访问,看看这个会话在做什么:
SQL> SELECT SID, SERIAL#, USERNAME, PROGRAM, TERMINAL 2 FROM V$SESSION 3 WHERE SID = 54;
SID SERIAL# USERNAME PROGRAM TERMINAL ---------- ---------- ------------------------------ ------------ ---------- 54 26216 FUJIANREP PlSqlDev.exe LIBY
没想到是同事的连接的会话,看看他在干什么:
SQL> SELECT SQL_TEXT FROM V$SQL 2 WHERE ADDRESS IN 3 (SELECT SQL_ADDRESS FROM V$SESSION 4 WHERE SID = 54);
SQL_TEXT --------------------------------------------------------------------- ALTER TABLE GOV_S_MO_EN TRUNCATE PARTITION P200901 居然是TRUNCATE分区操作,难怪会导致过程处于INVALID状态,不过这个操作应该不会持续很长时间的,难道这个操作一直没有完成吗:
SQL> SELECT EVENT, P1TEXT, P1, P2TEXT, P2, P3TEXT, P3, SECONDS_IN_WAIT 2 FROM V$SESSION_WAIT WHERE SID = 54;
EVENT P1TEXT P1 P2TEXT P2 P3TEXT P3 SECONDS_IN_WAIT ------------------------- ------- ---- -------- -------- -------- ---- --------------- db file sequential read file# 1 block# 170158 blocks 1 3995643
这个等待已经发生了几十天了,显然这是一个僵死的会话。
从后台kill掉对应的进程:
SQL> SELECT SPID FROM V$PROCESS 2 WHERE ADDR IN (SELECT PADDR FROM V$SESSION WHERE SID = 54);
SPID ------------ 12974
SQL> HOST kill -9 12974
切换为FUJIANREP用户,再次编译过程:
SQL> ALTER PROCEDURE P_GENERATE_REPDATA COMPILE;
Procedure altered.
至此,问题解决。将JOB重新设置BROKEN即可。
SQL> EXEC DBMS_JOB.BROKEN(63, FALSE)
PL/SQL procedure successfully completed.
SQL> COMMIT;
Commit complete.