磁盘rejecting I/O to offline device故障导致4TB生产数据库无法访问

本文记录了一次Oracle数据库因磁盘性能问题导致的故障处理过程,包括故障现象、系统环境介绍、问题排查及解决方法。

1、项目背景

某重要项目使用oracle数据库,基于ADG构建一主一备,数据总量在4TB左右,正常运行了近5年。最近备库因磁盘性能严重下降导致主从同步延迟了1.4T的归档日志未被应用(每天生产350G的日志,延迟了4天),目前所有的应用都切换到主库,备库进行检修。然而刚切换2天,主库于周5晚上挂掉了,所以就有了这篇文章,旨在为遇到相同问题的朋友提供一些处理思路和方法,当然希望大家不会看到这篇文章最好。

2、系统环境

2台数据库服务器:system x3950 x6 ,一共16块硬盘,前4块硬盘大小为300GB,后12块大小为4T,以4块硬盘为一组做的raid10.
操作系统版本:redhat6.5
数据库版本:oracle11.2.0.4
OGG版本:ogg12.2.0.1
数据库架构:基于ADG构建一主一备

3、故障描述

1、磁盘分区挂载的几个目录执行ls无法显示信息,报input/output错误
2、/var/log/messages相关报错信息如下
Jan 31 03:47:03 test-db-2 kernel: sd 1:2:0:0: rejecting I/O to offline device
Jan 31 03:49:47 test-db-2 kernel: sd 1:2:0:0: rejecting I/O to offline device
Jan 31 03:49:47 test-db-2 kernel: sd 1:2:0:0: rejecting I/O to offline device
Jan 31 03:49:47 test-db-2 kernel: sd 1:2:0:0: rejecting I/O to offline device
Jan 31 03:49:47 test-db-2 kernel: sd 1:2:0:0: rejecting I/O to offline device
Jan 31 03:49:47 test-db-2 kernel: sd 1:2:0:0: rejecting I/O to offline device
Jan 31 03:49:47 test-db-2 kernel: sd 1:2:0:0: rejecting I/O to offline device
Jan 31 03:49:47 test-db-2 kernel: sd 1:2:0:0: rejecting I/O to offline device
Jan 31 03:49:47 test-db-2 kernel: sd 1:2:0:0: rejecting I/O to offline device
Jan 31 03:49:47 test-db-2 kernel: sd 1:2:0:0: rejecting I/O to offline device
Jan 31 03:49:47 test-db-2 kernel: sd 1:2:0:0: rejecting I/O to offline device
Jan 31 03:49:47 test-db-2 kernel: sd 1:2:0:0: rejecting I/O to offline device
Jan 31 03:49:47 test-db-2 kernel: sd 1:2:0:0: rejecting I/O to offline device
3、数据库无法访问,其中45个数据文件在sdc2,sdc3分区上.
4、数据库备份在sdd1分区上,无法基于备份来及时恢复。

4、问题排查

1、确认数据库无法正常打开的原因
2、确认数据库服务器上硬盘是否有告警情况

5、问题分析与解决

1、磁盘IO错误导致磁盘分区上的45个数据文件无法方法,所以数据库无法直接打开,将45个数据文件offline,尝试打开数据库后,发现访问业务表报了大量的数据文件不存在,此时数据库无法对外提供正常访问。
2、服务器硬盘16块,以4块盘为一组做的raid10,分别对应sda,sdb,sdc,sdd,目前发现sdd对应的一块磁盘亮黄灯告警,正常情况下一块磁盘亮黄灯应该是不影响的,这里可能存在逻辑错误,可以将这几个分区umount通过fsck检查一下是否有坏块(遗憾的是通过fsck无法检测到这几个分区),接下来考虑重启动服务器(先关机、在启动)
重新启动的原因:
1、数据库无法正常对外提供访问
2、数据库备份无法访问
3、只有一个块磁盘亮黄灯,在重启过程中看看是否有其他报错信息或文件系统是否能自动修复
在数据库服务器关闭过程中,花了近1个小时都没有正常关闭,一直处理如下界面 在这里插入图片描述
最后,通过服务器上的开关键强制关闭了服务器,接下来将有亮黄灯的磁盘拔了出来,重新启动服务器
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
数据库服务器正常启动后发现sdc2,sdc3挂载并访问了,sdd目前无法正常挂载,尝试打开数据库并将所有处于offline的数据库文件变成online
首先将30号数据文件进行recover再online
SQL> alter database datafile 30 online;
alter database datafile 30 online
*
ERROR at line 1:
ORA-01113: file 30 needs media recovery
ORA-01110: data file 30: ‘/test/test2/teststg01.dbf’

SQL> recover datafile 30;
Media recovery complete.
SQL> select file#,status,name from v$datafile;


30 OFFLINE /test/test2/teststg01.dbf
31 RECOVER /test/test2/teststg02.dbf
32 RECOVER /test/test2/teststg03.dbf
33 RECOVER /test/test2/bigdata01.dbf

SQL> alter database datafile 30 online;
Database altered.
批量执行
SQL>recover datafile 31,32,33,34,35,36,37,38,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60, 63;

SQL>alter database datafile 31,32,33,34,35,36,37,38,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,63 online;

到此所有数据文件都online了,接着执行dump备份并拷贝到其他服务器上

6、总结与思考

当把所有业务数据导出来了后,数据库出故障前的数据都存在,接下来就是及时恢复业务了,后面要完成的工作还有很多,后期接着更新。但通过这次数据库故障还是发现很多问题,所以有必要进行相应的总结与思考,避免再出现类似的情况。

当备库出现问题时,以下几点应该可以做到更好

1、虽然及时向公司报备了,但没有引起导相关领导的足够重视(因为有主库或领导关心的点不在这里,费用从哪里出的问题)
2、应该立即到机房进行巡检,两台设备是同时采购的(因进机房手续繁锁、分工有遗漏或不明确)
3、应该将主库上的备份及时备份其他服务器上
4、准备好其他备用服务器,最好是相同配置的,如果没有相同配置的,也可以降低配置,将最重要的业务数据实时同步过去。

[29015.022358] mpt3sas_cm0: search for end-devices: complete [29015.022358] mpt3sas_cm0: search for end-devices: start [29015.022358] mpt3sas_cm0: search for PCIe end-devices: complete [29015.022359] mpt3sas_cm0: search for expanders: start [29015.022376] expander present: handle(0x0017), sas_addr(0x5347379d0121f03f) [29015.022395] expander present: handle(0x0031), sas_addr(0x53473799ef5f403f) [29015.022410] mpt3sas_cm0: search for expanders: complete [29015.022415] mpt3sas_cm0: host reset: SUCCESS scmd(00000000fa36dd4f) [29067.001533] sd 20:0:1:0: Device offlined - not ready after error recovery [29067.001542] sd 20:0:1:0: [sdb] tag#5442 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT cmd_age=137s [29067.001544] sd 20:0:1:0: [sdb] tag#5442 CDB: Write(16) 8a 00 00 00 00 00 00 7f bf 40 00 00 00 20 00 00 [29067.001547] blk_update_request: I/O error, dev sdb, sector 8372032 op 0x1:(WRITE) flags 0x1000 phys_seg 4 prio class 0 [29067.004364] XFS (sdb1): metadata I/O error in "xfs_buf_iodone_callback_error" at daddr 0x7fb740 len 32 error 5 [29067.044360] sd 20:0:12:0: Power-on or device reset occurred [29067.044489] sd 20:0:35:0: Power-on or device reset occurred [29067.044556] sd 20:0:30:0: Power-on or device reset occurred [29067.045517] sd 20:0:25:0: Power-on or device reset occurred [29067.046935] sd 20:0:13:0: Power-on or device reset occurred [29067.047438] sd 20:0:16:0: Power-on or device reset occurred [29067.047634] sd 20:0:33:0: Power-on or device reset occurred [29067.048062] sd 20:0:6:0: Power-on or device reset occurred [29067.049161] sd 20:0:26:0: Power-on or device reset occurred [29067.049320] sd 20:0:14:0: Power-on or device reset occurred [29067.049443] sd 20:0:21:0: Power-on or device reset occurred [29067.050046] sd 20:0:22:0: Power-on or device reset occurred [29067.050188] sd 20:0:34:0: Power-on or device reset occurred [29067.050220] sd 20:0:5:0: Power-on or device reset occurred [29067.050536] sd 20:0:3:0: Power-on or device reset occurred [29067.050600] sd 20:0:10:0: Power-on or device reset occurred [29067.051766] sd 20:0:36:0: Power-on or device reset occurred [29067.051872] sd 20:0:31:0: Power-on or device reset occurred [29067.052466] sd 20:0:19:0: Power-on or device reset occurred [29067.052849] sd 20:0:0:0: Power-on or device reset occurred [29067.158704] sd 20:0:17:0: Power-on or device reset occurred [29067.231964] sd 20:0:18:0: Power-on or device reset occurred [29067.257535] mpt3sas_cm0: removing unresponding devices: start [29067.257537] mpt3sas_cm0: removing unresponding devices: end-devices [29067.257541] mpt3sas_cm0: Removing unresponding devices: pcie end-devices [29067.257541] mpt3sas_cm0: removing unresponding devices: expanders [29067.257542] mpt3sas_cm0: removing unresponding devices: complete [29067.257546] sd 20:0:1:0: device_unblock and setting to running, handle(0x0019) [29067.257557] mpt3sas_cm0: scan devices: start [29067.257564] sd 20:0:1:0: rejecting I/O to offline device [29067.257566] blk_update_request: I/O error, dev sdb, sector 29324460136 op 0x0:(READ) flags 0x84700 phys_seg 128 prio class 0 [29067.257575] sd 20:0:1:0: rejecting I/O to offline device [29067.257578] blk_update_request: I/O error, dev sdb, sector 19329441936 op 0x1:(WRITE) flags 0x21800 phys_seg 2 prio class 0
08-28
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值