问题描述
某联通公司的一个系统主库由于电源故障存储宕机,电源故障恢复后由于存储存在坏块,数据库无法启动。于是根据应急预案切换到备库。在切换过程中出现问题,在没有分析清楚问题的情况下采用强制打开数据库的方式,导致备库中的12个数据文件在备库打开后处于OFFLINE DROP状态,和这12个文件相关的部分表无法访问。采用临时措施将这些表改名后重建这些表暂时维持业务系统的运行。客户希望:
找出DG切换失败的原因
将相关受损的数据表28日前的历史数据抢救出来
针对DG任何进行可用性检查以及切换时的处理步骤提出建议
DG切换故障分析
从日志上看,28日凌晨3:38开始主库和备库的通讯出现了异常:
Thu Apr 28 01:27:53 2011
Media Recovery Waiting for thread 1 seq# 87118 (in transit)
Thu Apr 28 03:38:53 2011
RFS: Possible network disconnect with primary database
Thu Apr 28 04:24:47 2011
RFS: Possible network disconnect with primary database
Thu Apr 28 04:26:35 2011
RFS: Possible network disconnect with primary database
Thu Apr 28 05:37:44 2011
alter database recover managed standby database cancel
Thu Apr 28 05:37:48 2011
MRP0: Background Media Recovery user canceled with status 16037
Recovery interrupted.
MRP0: Background Media Recovery process shutdown
5:37,操作人员进行了故障处理,首先结束恢复管理模式,然后以READ ONLY方式打开了数据库,该过程成功完成。
Thu Apr 28 05:37:44 2011
alter database recover managed standby database cancel
Thu Apr 28 05:37:48 2011
MRP0: Background Media Recovery user canceled with status 16037
Recovery interrupted.
MRP0: Background Media Recovery process shutdown
Thu Apr 28 05:37:51 2011
Managed Standby Recovery Cancelled
Completed: alter database recover managed standby database c
Thu Apr 28 05:37:53 2011
alter database open read only
6:16,数据库被成功关闭。
Thu Apr 28 06:16:09 2011
Shutting down instance: further logons disabled
Shutting down instance (immediate)
License high water mark = 8
Thu Apr 28 06:16:12 2011
ALTER DATABASE CLOSE NORMAL
Thu Apr 28 06:16:12 2011
SMON: disabling cache recovery
Thu Apr 28 06:16:13 2011
Completed: ALTER DATABASE CLOSE NORMAL
Thu Apr 28 06:16:13 2011
ALTER DATABASE DISMOUNT
Completed: ALTER DATABASE DISMOUNT
ARCH: Archiving is disabled
Shutting down archive processes
Archiving is disabled
Thu Apr 28 06:16:13 2011
ARCH shutting down
Thu Apr 28 06:16:13 2011
ARCH shutting down
Thu Apr 28 06:16:13 2011
ARC0: Archival stopped
随后数据库再次以STANDBY DB模式被MOUNT,然后备份了控制文件,随后成功关闭。
Thu Apr 28 06:16:42 2011
Standby Database mounted.
Completed: alter database mount standby database
Thu Apr 28 06:17:15 2011
alter database backup controlfile to trace as '/tmp/myconctltrace.ctl'
Completed: alter database backup controlfile to trace as '/tm
Thu Apr 28 06:17:20 2011
alter database backup controlfile to trace as '/tmp/myconctltrace.ctl.xxxxxx'
Completed: alter database backup controlfile to trace as '/tm
Shutting down instance: further logons disabled
Shutting down instance (immediate)
License high water mark = 3
Thu Apr 28 06:17:30 2011
ALTER DATABASE CLOSE NORMAL
ORA-1109 signalled during: ALTER DATABASE CLOSE NORMAL...
Thu Apr 28 06:17:30 2011
ALTER DATABASE DISMOUNT
Completed: ALTER DATABASE DISMOUNT
6:21:53重建了控制文件。
Thu Apr 28 06:21:53 2011
CREATE CONTROLFILE REUSE DATABASE "ICD1" NORESETLOGS ARCHIVELOG
-- SET STANDBY TO MAXIMIZE PERFORMANCE
MAXLOGFILES 5
MAXLOGMEMBERS 3
MAXDATAFILES 233
MAXINSTANCES 1
MAXLOGHISTORY 11344
LOGFILE
GROUP 1 (
'/dev/vg_index/rora9_redo111',
'/dev/vg_data/rora9_redo121'
) SIZE 280M,
GROUP 2 (
'/dev/vg_index/rora9_redo112',
'/dev/vg_data/rora9_redo122'
) SIZE 280M,
GROUP 3 (
'/dev/vg_data/rora9_redo131',
'/dev/vg_index/rora9_redo132'
) SIZE 280M
-- STANDBY LOGFILE
-- GROUP 4 (
-- '/dev/vg_index/rora9_sgredo111',
-- '/dev/vg_data/rora9_sgredo121'
-- ) SIZE 280M,
-- GROUP 5 (
-- '/dev/vg_index/rora9_sgredo112',
-- '/dev/vg_data/rora9_sgredo122'
-- ) SIZE 280M
DATAFILE
'/dev/vg_data/rora9_system',
'/dev/vg_index/rora9_rbs1',
……
……
'/dev/vg_index/rlv_data40'
CHARACTER SET ZHS16GBK6:27:02进行了switchover。
Thu Apr 28 06:25:30 2011
alter database mount standby database
Thu Apr 28 06:25:34 2011
Successful mount of redo thread 1, with mount id 1176475930
Thu Apr 28 06:25:34 2011
Standby Database mounted.
Completed: alter database mount standby database
Thu Apr 28 06:27:02 2011
alter database commit to switchover to primary
Thu Apr 28 06:27:02 2011
ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY
Database not recovered through End-Of-REDO
Database not recovered through End-Of-REDO
Switchover: Media recovery required - standby not in limbo
ORA-16139 signalled during: alter database commit to switchover to primary...
Thu Apr 28 06:27:15 2011
Restarting dead background process QMN0
QMN0 started with pid=10
Thu Apr 28 06:28:02 2011
ALTER DATABASE RECOVER automatic standby database
Thu Apr 28 06:28:02 2011
Media Recovery Start
Thu Apr 28 06:28:02 2011
Media Recovery Log /tongns/arch/1_87118.dbf
Errors with log /tongns/arch/1_87118.dbf
ORA-279 signalled during: ALTER DATABASE RECOVER automatic standby database...
Thu Apr 28 06:28:10 2011
ALTER DATABASE RECOVER CONTINUE DEFAULT
Thu Apr 28 06:28:10 2011
Media Recovery Log /tongns/arch/1_87118.dbf
Errors with log /tongns/arch/1_87118.dbf
ORA-308 signalled during: ALTER DATABASE RECOVER CONTINUE DEFAULT ...
Thu Apr 28 06:28:10 2011
ALTER DATABASE RECOVER CONTINUE DEFAULT
Thu Apr 28 06:28:10 2011
Media Recovery Log /tongns/arch/1_87118.dbf
Errors with log /tongns/arch/1_87118.dbf
ORA-308 signalled during: ALTER DATABASE RECOVER CONTINUE DEFAULT ...
Thu Apr 28 06:28:10 2011
ALTER DATABASE RECOVER CANCEL
Thu Apr 28 06:28:11 2011
Media Recovery Cancelled
Completed: ALTER DATABASE RECOVER CANCEL
由于主库故障,因此SWITCHOVER操作在等待最新的REDO LOG信息时出现了故障,因此SWITCHOVER未能成功。
6:29:14,再次创建控制文件:
Thu Apr 28 06:29:14 2011
CREATE CONTROLFILE REUSE DATABASE "ICD1" NORESETLOGS ARCHIVELOG
-- SET STANDBY TO MAXIMIZE PERFORMANCE
MAXLOGFILES 5
MAXLOGMEMBERS 3
MAXDATAFILES 233
MAXINSTANCES 1
MAXLOGHISTORY 11344
LOGFILE
随后又多次创建控制文件。
6:32:34,以RESETLOGS方式重建控制文件
Thu Apr 28 06:32:34 2011
CREATE CONTROLFILE REUSE DATABASE "ICD1" RESETLOGS ARCHIVELOG
由于多次重建控制文件均报错,在控制文件生成的TRACE中,包含下面的信息:
# Take files offline to match current control file.
ALTER DATABASE DATAFILE '/dev/vg_data/rlv_data21' OFFLINE DROP;
ALTER DATABASE DATAFILE '/dev/vg_data/rlv_data22' OFFLINE DROP;
ALTER DATABASE DATAFILE '/dev/vg_data/rlv_data1' OFFLINE DROP;
ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data23' OFFLINE DROP;
ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data24' OFFLINE DROP;
ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data25' OFFLINE DROP;
ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data26' OFFLINE DROP;
ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data27' OFFLINE DROP;
ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data28' OFFLINE DROP;
ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data29' OFFLINE DROP;
ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data30' OFFLINE DROP;
为了顺利重建控制文件,6:38:27,DBA删除了有问题的文件:
Thu Apr 28 06:38:27 2011
ALTER DATABASE DATAFILE '/dev/vg_data/rlv_data21' OFFLINE DROP
Completed: ALTER DATABASE DATAFILE '/dev/vg_data/rlv_data21'
Thu Apr 28 06:38:32 2011
ALTER DATABASE DATAFILE '/dev/vg_data/rlv_data22' OFFLINE DROP
Completed: ALTER DATABASE DATAFILE '/dev/vg_data/rlv_data22'
Thu Apr 28 06:38:32 2011
ALTER DATABASE DATAFILE '/dev/vg_data/rlv_data1' OFFLINE DROP
Completed: ALTER DATABASE DATAFILE '/dev/vg_data/rlv_data1' O
Thu Apr 28 06:38:32 2011
ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data23' OFFLINE DROP
Completed: ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data23'
Thu Apr 28 06:38:32 2011
ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data24' OFFLINE DROP
Completed: ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data24'
Thu Apr 28 06:38:32 2011
ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data25' OFFLINE DROP
Completed: ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data25'
Thu Apr 28 06:38:32 2011
ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data26' OFFLINE DROP
Completed: ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data26'
Thu Apr 28 06:38:32 2011
ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data27' OFFLINE DROP
Completed: ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data27'
Thu Apr 28 06:38:32 2011
ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data28' OFFLINE DROP
Completed: ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data28'
Thu Apr 28 06:38:32 2011
ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data29' OFFLINE DROP
Completed: ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data29'
Thu Apr 28 06:38:34 2011
ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data30' OFFLINE DROP
Completed: ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data30'
Thu Apr 28 06:39:52 2011
重建控制文件后,数据库可以顺利打开,打开后发现有些表无法访问,经检查发现这些文件均为执行了OFFLINE DROP命令的文件。
-------------------------------------------------------------------------------------------------------------------------------
DG故障分析
由于备库已经OPEN,因此检查OFFLINE状态的文件,从文件头上我们可以看到:
NAME CHECKPOINT_CHANGE# CHECKPOIN STATUS
------------------------------ ------------------ --------- -------
/dev/vg_data/rlv_data21 12036594418511 10-FEB-11 OFFLINE
/dev/vg_data/rlv_data22 12036594418511 10-FEB-11 OFFLINE
/dev/vg_data/rlv_data1 0 OFFLINE
/dev/vg_index/rlv_data23 0 OFFLINE
/dev/vg_index/rlv_data24 0 OFFLINE
/dev/vg_index/rlv_data25 0 OFFLINE
/dev/vg_index/rlv_data26 0 OFFLINE
/dev/vg_index/rlv_data27 0 OFFLINE
/dev/vg_index/rlv_data28 0 OFFLINE
/dev/vg_index/rlv_data29 0 OFFLINE
/dev/vg_index/rlv_data30 0 OFFLINE
检查控制文件:
NAME CHECKPOINT_CHANGE# CHECKPOIN STATUS
------------------------------ ------------------ --------- -------
/dev/vg_data/rlv_data21 0 OFFLINE
/dev/vg_data/rlv_data22 0 OFFLINE
/dev/vg_data/rlv_data1 0 OFFLINE
/dev/vg_index/rlv_data23 0 OFFLINE
/dev/vg_index/rlv_data24 0 OFFLINE
/dev/vg_index/rlv_data25 0 OFFLINE
/dev/vg_index/rlv_data26 0 OFFLINE
/dev/vg_index/rlv_data27 0 OFFLINE
/dev/vg_index/rlv_data28 0 OFFLINE
/dev/vg_index/rlv_data29 0 OFFLINE
/dev/vg_index/rlv_data30 0 OFFLINE
从上述信息看,上述数据文件的状态是不正确的,即使不做OFFLINE DROP,也无法使用。有些文件甚至是空文件。
通过对相关文件在ALERT LOG中的日志信息的分析发现这些文件都是采用手工文件管理方式手工处理的,而不是DG数据文件自动管理产生的。从日志中可以看到:
ALTER SYSTEM SET standby_file_management='MANUAL' SCOPE=BOTH;
Mon Mar 7 20:35:49 2011
alter database datafile '/ora10gsy/product/10.2/dbs/UNNAMED00109' offline drop
Mon Mar 7 20:35:49 2011
ORA-1516 signalled during: alter database datafile '/ora10gsy/product/10.2/d...
Mon Mar 7 20:36:11 2011
alter database datafile '/opt/oracle/product/9.2/dbs/UNNAMED00109' offline drop
Mon Mar 7 20:36:12 2011
Completed: alter database datafile '/opt/oracle/product/9.2/d
Mon Mar 7 20:36:48 2011
alter database rename file '/opt/oracle/product/9.2/dbs/UNNAMED00109' to '/dev/vg_data/rlv_data21'
Mon Mar 7 20:36:48 2011
ORA-1511 signalled during: alter database rename file '/opt/oracle/product/9....
Mon Mar 7 20:37:06 2011
ALTER SYSTEM SET standby_file_management='MANUAL' SCOPE=BOTH;
Mon Mar 7 20:37:09 2011
alter database rename file '/opt/oracle/product/9.2/dbs/UNNAMED00109' to '/dev/vg_data/rlv_data21'
Mon Mar 7 20:37:09 2011
Completed: alter database rename file '/opt/oracle/product/
Mon Mar 7 20:37:17 2011
ALTER SYSTEM SET standby_file_management='AUTO' SCOPE=BOTH;
首先standby_file_management被设置为manual,然后将数据文件进行重新命名。经过和相关操作人员确认,是因为主库创建新文件时,备库的裸设备不存在,从而导致在备库中生成了一个UNNAMEDxxxxxx的文件,为了将这个文件重新迁移到裸设备上,做了上述操作。
经过确认,采用这种方式处理数据文件缺少了一个步骤,就是执行alter database create datafile ... As '....';
从而导致该数据文件一直处于未完成修改的状态,因此DATAGUARD一直未对此文件进行REDO APPLY。并且在DATAGUARD中查看这些文件的状态是RECOVER状态。实际上这些文件大多数是空文件,里面并无任何数据。
DG故障分析结论
由于DATAGUARD中的这些文件本身存在问题,导致这类文件处于非正常状态,在DATAGURAD切换后,无法实现这些文件的切换。由于这些文件中有些文件需要1月份以来的REDO LOG,有些需要4月7日的REDO LOG,由于这些归档日志已经被删除,当时已经无条件进行修复。
针对DG/ADG运维的建议:
尽可能在主库添加数据文件前,在DG上准备好相关裸设备
如果出现忘记准备裸设备的情况,处理过程要严格按照官方要求(参考后面的建议部分)
定期检查DG,及时发现SATTUS为RECOVER状态的文件,及时进行纠正
作者:白鳝