DATAGUARD健康不容忽视-DG故障案例

文章讲述了联通公司系统主库因电源故障引发的DG切换失败案例,详细分析了故障原因,涉及数据文件的OFFLINEDROP问题、控制文件重建及手动管理操作的不当,最后给出了DG运维的改进建议。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

问题描述

某联通公司的一个系统主库由于电源故障存储宕机,电源故障恢复后由于存储存在坏块,数据库无法启动。于是根据应急预案切换到备库。在切换过程中出现问题,在没有分析清楚问题的情况下采用强制打开数据库的方式,导致备库中的12个数据文件在备库打开后处于OFFLINE DROP状态,和这12个文件相关的部分表无法访问。采用临时措施将这些表改名后重建这些表暂时维持业务系统的运行。客户希望:

找出DG切换失败的原因

将相关受损的数据表28日前的历史数据抢救出来

针对DG任何进行可用性检查以及切换时的处理步骤提出建议

DG切换故障分析

从日志上看,28日凌晨3:38开始主库和备库的通讯出现了异常:

Thu Apr 28 01:27:53 2011

Media Recovery Waiting for thread 1 seq# 87118 (in transit)

Thu Apr 28 03:38:53 2011

RFS: Possible network disconnect with primary database

Thu Apr 28 04:24:47 2011

RFS: Possible network disconnect with primary database

Thu Apr 28 04:26:35 2011

RFS: Possible network disconnect with primary database

Thu Apr 28 05:37:44 2011

 alter database recover managed standby database cancel

Thu Apr 28 05:37:48 2011

MRP0: Background Media Recovery user canceled with status 16037

Recovery interrupted.

MRP0: Background Media Recovery process shutdown

5:37,操作人员进行了故障处理,首先结束恢复管理模式,然后以READ ONLY方式打开了数据库,该过程成功完成。

Thu Apr 28 05:37:44 2011

 alter database recover managed standby database cancel

Thu Apr 28 05:37:48 2011

MRP0: Background Media Recovery user canceled with status 16037

Recovery interrupted.

MRP0: Background Media Recovery process shutdown

Thu Apr 28 05:37:51 2011

Managed Standby Recovery Cancelled

Completed:  alter database recover managed standby database c

Thu Apr 28 05:37:53 2011

 alter database open read only

6:16,数据库被成功关闭。

Thu Apr 28 06:16:09 2011

Shutting down instance: further logons disabled

Shutting down instance (immediate)

License high water mark = 8

Thu Apr 28 06:16:12 2011

ALTER DATABASE CLOSE NORMAL

Thu Apr 28 06:16:12 2011

SMON: disabling cache recovery

Thu Apr 28 06:16:13 2011

Completed: ALTER DATABASE CLOSE NORMAL

Thu Apr 28 06:16:13 2011

ALTER DATABASE DISMOUNT

Completed: ALTER DATABASE DISMOUNT

ARCH: Archiving is disabled

Shutting down archive processes

Archiving is disabled

Thu Apr 28 06:16:13 2011

ARCH shutting down

Thu Apr 28 06:16:13 2011

ARCH shutting down

Thu Apr 28 06:16:13 2011

ARC0: Archival stopped

随后数据库再次以STANDBY DB模式被MOUNT,然后备份了控制文件,随后成功关闭。

Thu Apr 28 06:16:42 2011

Standby Database mounted.

Completed: alter database mount standby database

Thu Apr 28 06:17:15 2011

alter database backup controlfile to trace as '/tmp/myconctltrace.ctl'

Completed: alter database backup controlfile to trace as '/tm

Thu Apr 28 06:17:20 2011

alter database backup controlfile to trace as '/tmp/myconctltrace.ctl.xxxxxx'

Completed: alter database backup controlfile to trace as '/tm

Shutting down instance: further logons disabled

Shutting down instance (immediate)

License high water mark = 3

Thu Apr 28 06:17:30 2011

ALTER DATABASE CLOSE NORMAL

ORA-1109 signalled during: ALTER DATABASE CLOSE NORMAL...

Thu Apr 28 06:17:30 2011

ALTER DATABASE DISMOUNT

Completed: ALTER DATABASE DISMOUNT

6:21:53重建了控制文件。

Thu Apr 28 06:21:53 2011

CREATE CONTROLFILE REUSE DATABASE "ICD1" NORESETLOGS  ARCHIVELOG

--  SET STANDBY TO MAXIMIZE PERFORMANCE

    MAXLOGFILES 5

    MAXLOGMEMBERS 3

    MAXDATAFILES 233

    MAXINSTANCES 1

    MAXLOGHISTORY 11344

LOGFILE

  GROUP 1 (

    '/dev/vg_index/rora9_redo111',

    '/dev/vg_data/rora9_redo121'

  ) SIZE 280M,

  GROUP 2 (

    '/dev/vg_index/rora9_redo112',

    '/dev/vg_data/rora9_redo122'

  ) SIZE 280M,

  GROUP 3 (

    '/dev/vg_data/rora9_redo131',

    '/dev/vg_index/rora9_redo132'

  ) SIZE 280M

-- STANDBY LOGFILE

--   GROUP 4 (

--     '/dev/vg_index/rora9_sgredo111',

--     '/dev/vg_data/rora9_sgredo121'

--   ) SIZE 280M,

--   GROUP 5 (

--     '/dev/vg_index/rora9_sgredo112',

--     '/dev/vg_data/rora9_sgredo122'

--   ) SIZE 280M

DATAFILE

  '/dev/vg_data/rora9_system',

  '/dev/vg_index/rora9_rbs1',

   ……

……

  '/dev/vg_index/rlv_data40'

CHARACTER SET ZHS16GBK6:27:02进行了switchover。

Thu Apr 28 06:25:30 2011

 alter database mount standby database

Thu Apr 28 06:25:34 2011

Successful mount of redo thread 1, with mount id 1176475930

Thu Apr 28 06:25:34 2011

Standby Database mounted.

Completed:  alter database mount standby database

Thu Apr 28 06:27:02 2011

alter database commit to switchover to primary

Thu Apr 28 06:27:02 2011

ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY

Database not recovered through End-Of-REDO

Database not recovered through End-Of-REDO

Switchover: Media recovery required - standby not in limbo

ORA-16139 signalled during: alter database commit to switchover to primary...

Thu Apr 28 06:27:15 2011

Restarting dead background process QMN0

QMN0 started with pid=10

Thu Apr 28 06:28:02 2011

ALTER DATABASE RECOVER  automatic standby database  

Thu Apr 28 06:28:02 2011

Media Recovery Start

Thu Apr 28 06:28:02 2011

Media Recovery Log /tongns/arch/1_87118.dbf

Errors with log /tongns/arch/1_87118.dbf

ORA-279 signalled during: ALTER DATABASE RECOVER  automatic standby database...

Thu Apr 28 06:28:10 2011

ALTER DATABASE RECOVER    CONTINUE DEFAULT  

Thu Apr 28 06:28:10 2011

Media Recovery Log /tongns/arch/1_87118.dbf

Errors with log /tongns/arch/1_87118.dbf

ORA-308 signalled during: ALTER DATABASE RECOVER    CONTINUE DEFAULT  ...

Thu Apr 28 06:28:10 2011

ALTER DATABASE RECOVER    CONTINUE DEFAULT  

Thu Apr 28 06:28:10 2011

Media Recovery Log /tongns/arch/1_87118.dbf

Errors with log /tongns/arch/1_87118.dbf

ORA-308 signalled during: ALTER DATABASE RECOVER    CONTINUE DEFAULT  ...

Thu Apr 28 06:28:10 2011

ALTER DATABASE RECOVER CANCEL

Thu Apr 28 06:28:11 2011

Media Recovery Cancelled

Completed: ALTER DATABASE RECOVER CANCEL

由于主库故障,因此SWITCHOVER操作在等待最新的REDO LOG信息时出现了故障,因此SWITCHOVER未能成功。

6:29:14,再次创建控制文件:

Thu Apr 28 06:29:14 2011

CREATE CONTROLFILE REUSE DATABASE "ICD1" NORESETLOGS  ARCHIVELOG

--  SET STANDBY TO MAXIMIZE PERFORMANCE

    MAXLOGFILES 5

    MAXLOGMEMBERS 3

    MAXDATAFILES 233

    MAXINSTANCES 1

    MAXLOGHISTORY 11344

LOGFILE

随后又多次创建控制文件。

6:32:34,以RESETLOGS方式重建控制文件

Thu Apr 28 06:32:34 2011

CREATE CONTROLFILE REUSE DATABASE "ICD1" RESETLOGS  ARCHIVELOG

由于多次重建控制文件均报错,在控制文件生成的TRACE中,包含下面的信息:

# Take files offline to match current control file.

ALTER DATABASE DATAFILE '/dev/vg_data/rlv_data21' OFFLINE DROP;

ALTER DATABASE DATAFILE '/dev/vg_data/rlv_data22' OFFLINE DROP;

ALTER DATABASE DATAFILE '/dev/vg_data/rlv_data1' OFFLINE DROP;

ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data23' OFFLINE DROP;

ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data24' OFFLINE DROP;

ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data25' OFFLINE DROP;

ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data26' OFFLINE DROP;

ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data27' OFFLINE DROP;

ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data28' OFFLINE DROP;

ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data29' OFFLINE DROP;

ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data30' OFFLINE DROP;

为了顺利重建控制文件,6:38:27,DBA删除了有问题的文件:

Thu Apr 28 06:38:27 2011

ALTER DATABASE DATAFILE '/dev/vg_data/rlv_data21' OFFLINE DROP

Completed: ALTER DATABASE DATAFILE '/dev/vg_data/rlv_data21'

Thu Apr 28 06:38:32 2011

ALTER DATABASE DATAFILE '/dev/vg_data/rlv_data22' OFFLINE DROP

Completed: ALTER DATABASE DATAFILE '/dev/vg_data/rlv_data22'

Thu Apr 28 06:38:32 2011

ALTER DATABASE DATAFILE '/dev/vg_data/rlv_data1' OFFLINE DROP

Completed: ALTER DATABASE DATAFILE '/dev/vg_data/rlv_data1' O

Thu Apr 28 06:38:32 2011

ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data23' OFFLINE DROP

Completed: ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data23'

Thu Apr 28 06:38:32 2011

ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data24' OFFLINE DROP

Completed: ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data24'

Thu Apr 28 06:38:32 2011

ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data25' OFFLINE DROP

Completed: ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data25'

Thu Apr 28 06:38:32 2011

ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data26' OFFLINE DROP

Completed: ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data26'

Thu Apr 28 06:38:32 2011

ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data27' OFFLINE DROP

Completed: ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data27'

Thu Apr 28 06:38:32 2011

ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data28' OFFLINE DROP

Completed: ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data28'

Thu Apr 28 06:38:32 2011

ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data29' OFFLINE DROP

Completed: ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data29'

Thu Apr 28 06:38:34 2011

ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data30' OFFLINE DROP

Completed: ALTER DATABASE DATAFILE '/dev/vg_index/rlv_data30'

Thu Apr 28 06:39:52 2011

重建控制文件后,数据库可以顺利打开,打开后发现有些表无法访问,经检查发现这些文件均为执行了OFFLINE DROP命令的文件。

-------------------------------------------------------------------------------------------------------------------------------

DG故障分析

由于备库已经OPEN,因此检查OFFLINE状态的文件,从文件头上我们可以看到:

NAME                           CHECKPOINT_CHANGE# CHECKPOIN STATUS

------------------------------ ------------------ --------- -------

/dev/vg_data/rlv_data21            12036594418511 10-FEB-11 OFFLINE

/dev/vg_data/rlv_data22            12036594418511 10-FEB-11 OFFLINE

/dev/vg_data/rlv_data1                          0           OFFLINE

/dev/vg_index/rlv_data23                        0           OFFLINE

/dev/vg_index/rlv_data24                        0           OFFLINE

/dev/vg_index/rlv_data25                        0           OFFLINE

/dev/vg_index/rlv_data26                        0           OFFLINE

/dev/vg_index/rlv_data27                        0           OFFLINE

/dev/vg_index/rlv_data28                        0           OFFLINE

/dev/vg_index/rlv_data29                        0           OFFLINE

/dev/vg_index/rlv_data30                        0           OFFLINE

检查控制文件:

NAME                           CHECKPOINT_CHANGE# CHECKPOIN STATUS

------------------------------ ------------------ --------- -------

/dev/vg_data/rlv_data21                         0           OFFLINE

/dev/vg_data/rlv_data22                         0           OFFLINE

/dev/vg_data/rlv_data1                          0           OFFLINE

/dev/vg_index/rlv_data23                        0           OFFLINE

/dev/vg_index/rlv_data24                        0           OFFLINE

/dev/vg_index/rlv_data25                        0           OFFLINE

/dev/vg_index/rlv_data26                        0           OFFLINE

/dev/vg_index/rlv_data27                        0           OFFLINE

/dev/vg_index/rlv_data28                        0           OFFLINE

/dev/vg_index/rlv_data29                        0           OFFLINE

/dev/vg_index/rlv_data30                        0           OFFLINE

从上述信息看,上述数据文件的状态是不正确的,即使不做OFFLINE DROP,也无法使用。有些文件甚至是空文件。

通过对相关文件在ALERT LOG中的日志信息的分析发现这些文件都是采用手工文件管理方式手工处理的,而不是DG数据文件自动管理产生的。从日志中可以看到:

  ALTER SYSTEM SET standby_file_management='MANUAL' SCOPE=BOTH;

Mon Mar  7 20:35:49 2011

 alter database datafile '/ora10gsy/product/10.2/dbs/UNNAMED00109' offline drop

Mon Mar  7 20:35:49 2011

ORA-1516 signalled during:  alter database datafile '/ora10gsy/product/10.2/d...

Mon Mar  7 20:36:11 2011

alter database datafile '/opt/oracle/product/9.2/dbs/UNNAMED00109' offline drop

Mon Mar  7 20:36:12 2011

Completed: alter database datafile '/opt/oracle/product/9.2/d

Mon Mar  7 20:36:48 2011

alter database rename file '/opt/oracle/product/9.2/dbs/UNNAMED00109' to  '/dev/vg_data/rlv_data21'

Mon Mar  7 20:36:48 2011

ORA-1511 signalled during: alter database rename file '/opt/oracle/product/9....

Mon Mar  7 20:37:06 2011

ALTER SYSTEM SET standby_file_management='MANUAL' SCOPE=BOTH;

Mon Mar  7 20:37:09 2011

  alter database rename file '/opt/oracle/product/9.2/dbs/UNNAMED00109' to  '/dev/vg_data/rlv_data21'

Mon Mar  7 20:37:09 2011

Completed:   alter database rename file '/opt/oracle/product/

Mon Mar  7 20:37:17 2011

ALTER SYSTEM SET standby_file_management='AUTO' SCOPE=BOTH;

首先standby_file_management被设置为manual,然后将数据文件进行重新命名。经过和相关操作人员确认,是因为主库创建新文件时,备库的裸设备不存在,从而导致在备库中生成了一个UNNAMEDxxxxxx的文件,为了将这个文件重新迁移到裸设备上,做了上述操作。

经过确认,采用这种方式处理数据文件缺少了一个步骤,就是执行alter database create datafile ... As '....';

从而导致该数据文件一直处于未完成修改的状态,因此DATAGUARD一直未对此文件进行REDO APPLY。并且在DATAGUARD中查看这些文件的状态是RECOVER状态。实际上这些文件大多数是空文件,里面并无任何数据。

DG故障分析结论

 

由于DATAGUARD中的这些文件本身存在问题,导致这类文件处于非正常状态,在DATAGURAD切换后,无法实现这些文件的切换。由于这些文件中有些文件需要1月份以来的REDO LOG,有些需要4月7日的REDO LOG,由于这些归档日志已经被删除,当时已经无条件进行修复。

 

针对DG/ADG运维的建议:

尽可能在主库添加数据文件前,在DG上准备好相关裸设备

如果出现忘记准备裸设备的情况,处理过程要严格按照官方要求(参考后面的建议部分)

定期检查DG,及时发现SATTUS为RECOVER状态的文件,及时进行纠正

作者:白鳝

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值