Hang with log file sync

最新推荐文章于 2025-07-23 10:33:25 发布

最新推荐文章于 2025-07-23 10:33:25 发布 · 264 阅读

文章标签：

本文详细记录了昨天数据库核心库出现故障的过程，包括ORA-01555错误和UNDO表空间耗尽等问题，通过分析等待事件和系统日志，最终定位到SAN配置错误导致的物理链路丢失，以及误操作触发的数据库宕机。同时，讨论了UNDO表空间配置与实际使用情况不符的原因，以及业务调整导致的问题。文章强调了操作谨慎的重要性，并提供了故障恢复的步骤。

昨天一套核心库出现故障，几百个session堵塞，最后宕机，由于数据库是客服库，很敏感，所以办公室顿时嘈杂起来。
我们处理这类故障的原则是先恢复业务，其它的都随后在展开,在随后的分析工作中，走了点弯路，下面我把整个故障的情况展现给大家。
首先，在alert里面发现了ora-01555，这个是从7点左右开始的，一直持续到下午2点都有这样的事件.

Tue May 22 07:11:56 2012
ORA-01555 caused by SQL statement below (SQL ID: gd31z0zdh33nj, Query Duration=101 sec, SCN: 0x0b3c.37ef1c98)

Tue May 22 08:45:34 2012
Errors in file /u01/oracle/app/oracle/admin/crmkf/bdump/crmkf1_smon_8004.trc:
ORA-30036: unable to extend segment by 4 in undo tablespace 'UNDOTBS1'

直到八点多报出ORA-30036,在随后一直有ORA-01555，的确当时UNDO消耗殆尽，大家都走入了误区，
把数据库Hang都集中在undo相关的因素上，在这个基础上去找原因，但是最后也没啥进展。
看看如下的统计，这是在事故发生的前几分钟的信息：

ORA-01555 caused by SQL statement below (SQL ID: gd31z0zdh33nj, Query Duration=101 sec, SCN: 0x0b3c.37ef1c98):
ORA-01555 caused by SQL statement below (SQL ID: 18hwx0w69f9pg, Query Duration=107 sec, SCN: 0x0b3c.381970a9):
ORA-01555 caused by SQL statement below (SQL ID: 18hwx0w69f9pg, Query Duration=152 sec, SCN: 0x0b3c.380ff215):
ORA-01555 caused by SQL statement below (SQL ID: 18hwx0w69f9pg, Query Duration=73 sec, SCN: 0x0b3c.3822c0ef):
ORA-01555 caused by SQL statement below (SQL ID: dx2gpf5jk1jnu, Query Duration=113 sec, SCN: 0x0b3c.38e2030f):
ORA-01555 caused by SQL statement below (SQL ID: f1v5vww09yzzd, Query Duration=170 sec, SCN: 0x0b3c.3a53f9b0):
ORA-01555 caused by SQL statement below (SQL ID: 948xj46pvh0j4, Query Duration=28 sec, SCN: 0x0b3c.3a84c77d):
ORA-01555 caused by SQL statement below (SQL ID: gd31z0zdh33nj, Query Duration=186 sec, SCN: 0x0b3c.3a7d6206):
ORA-01555 caused by SQL statement below (SQL ID: gd31z0zdh33nj, Query Duration=226 sec, SCN: 0x0b3c.3a76b4ab):
ORA-01555 caused by SQL statement below (SQL ID: dx2gpf5jk1jnu, Query Duration=29 sec, SCN: 0x0b3c.3b871228):
ORA-01555 caused by SQL statement below (SQL ID: 02pnhfvmwua8r, Query Duration=37 sec, SCN: 0x0b3c.3be2e941):
ORA-01555 caused by SQL statement below (SQL ID: 05m5p2na5f8yd, Query Duration=6 sec, SCN: 0x0b3c.3c17f5fa):
ORA-01555 caused by SQL statement below (SQL ID: 1xj2gpb5x1a38, Query Duration=6 sec, SCN: 0x0b3c.3c24d864):
ORA-01555 caused by SQL statement below (SQL ID: 4qcd9jnyt4j27, Query Duration=66 sec, SCN: 0x0b3c.3c4c2d04):
ORA-01555 caused by SQL statement below (SQL ID: 2xp5nc6rmd42t, Query Duration=15 sec, SCN: 0x0b3c.3c63797d):
ORA-01555 caused by SQL statement below (SQL ID: 8r4xzv0amrkyb, Query Duration=64 sec, SCN: 0x0b3c.3c5c7c15):
ORA-01555 caused by SQL statement below (SQL ID: 8v307x8qzc78y, Query Duration=12 sec, SCN: 0x0b3c.3d084c98):

很短小的事务也失败，这个有点不可理解，duration=6的也报出ora-01555。
再来看看当时数据库的等待事件：

col samp_time for a25
col event for a30
select session_id sid,
       to_char(SAMPLE_TIME,'YY-MM-DD:HH:MI:SS') samp_time,
       session_state,
       event,
       p1,
       p2,
       p3,
       wait_time,
       time_waited
from DBA_HIST_ACTIVE_SESS_HISTORY
  where to_char(SAMPLE_TIME,'YY-MM-DD:HH24:MI:SS') between
 '12-05-22:14:45:00' and '12-05-22:14:55:00' --and event like'cursor%' and p1=2057858921
  order by sample_id desc;

SID SAMP_TIME                        SESSION_STATE  EVENT                  P1         P2         P3        TIME_WAITED
---------- ------------------------- ------------- ---------------------- ---------- ---------- ---------- -----------
      5608 12-05-22:02:52:30         WAITING       log file sync                2734          0          0      986328
      5613 12-05-22:02:52:30         WAITING       log file sync                2734          0          0      986320
      5615 12-05-22:02:52:30         WAITING       log file sync                2734          0          0      986324
      5616 12-05-22:02:52:30         WAITING       log file sync                2734          0          0      986314
      5619 12-05-22:02:52:30         WAITING       log file sync                2734          0          0      986327
      5621 12-05-22:02:52:30         WAITING       log file sync                2734          0          0      986321
      5622 12-05-22:02:52:30         WAITING       log file sync                2734          0          0      986311
      5623 12-05-22:02:52:30         WAITING       enq: FB - contention   1178730502          5   42451933      498021
      5625 12-05-22:02:52:30         WAITING       log file sync                2734          0          0      986323
      5627 12-05-22:02:52:30         WAITING       log file sync                2734          0          0      986321
      5631 12-05-22:02:52:30         WAITING       log file sync                2734          0          0      986322
      5633 12-05-22:02:52:30         WAITING       log file sync                2734          0          0      986323
      5634 12-05-22:02:52:30         WAITING       log file sync                2734          0          0      986327
      5637 12-05-22:02:52:30         WAITING       log file sync                2734          0          0      986308
      5638 12-05-22:02:52:30         WAITING       enq: FB - contention   1178730502          5   42451933      498022
      5640 12-05-22:02:52:30         WAITING       log file sync                2734          0          0      986328
      5641 12-05-22:02:52:30         WAITING       log file sync          4294967295          0          0      986328
      5642 12-05-22:02:52:30         WAITING       log file sync                2734          0          0      986329
      5656 12-05-22:02:52:30         WAITING       log file sync                2734          0          0      986315
      5658 12-05-22:02:52:30         WAITING       log file sync                2734          0          0      986278

这里只摘录了一部分，看看等待的统计信息:

SQL>  select event,count(*) from
  2   dba_hist_active_sess_history
  3   where to_char(SAMPLE_TIME,'YY-MM-DD:HH24:MI:SS') between
  4   '12-05-22:14:45:00' and '12-05-22:14:55:00' 
  5   group by event
  6   order by 2 desc;

EVENT                              COUNT(*)
---------------------------------  --------
log file sync                         12370
enq: FB - contention                   1018
buffer busy waits                       497
db file sequential read                 290
db file parallel write                  231
direct path read                        165
control file sequential read             92
enq: HW - contention                     79
log file parallel write                  27
control file parallel write              25
read by other session                    13
gcs log flush sync                        2
wait for scn ack                          1

当时数据库基本都在等待log file sync,enq:FB-contention，尤其log file sync等待的时间是s级别。这个基本可以断定IO肯定有问题了，
于是查看系统log,有了实质性的发现，导致redo写入异常，从而引发数据库的不正常，废话少说了，看下面log：

May 22 14:51:14 zjddkf02 vmunix: LVM: WARNING: VG 64 0x090000: LV 3: Some I/O requests to this LV are waiting
May 22 14:51:14 zjddkf02 vmunix: LVM: WARNING: VG 64 0x140000: LV 10: Some I/O requests to this LV are waiting
May 22 14:51:18 zjddkf02 vmunix: LVM: WARNING: VG 64 0x090000: LV 1: Some I/O requests to this LV are waiting
May 22 14:51:19 zjddkf02 vmunix: LVM: WARNING: VG 64 0x140000: LV 12: Some I/O requests to this LV are waiting
May 22 14:51:26 zjddkf02 vmunix: LVM: WARNING: VG 64 0x110000: LV 18: Some I/O requests to this LV are waiting
May 22 14:51:26 zjddkf02 vmunix: LVM: VG 64 0x170000: PVLink 1 0x000078 Failed! The PV is not accessible.
May 22 14:51:29 zjddkf02 vmunix: LVM: VG 64 0x120000: PVLink 1 0x000035 Failed! The PV is not accessible.
May 22 14:51:26 zjddkf02 vmunix:        indefinitely for an unavailable PV. These requests will be queued until
May 22 14:51:38 zjddkf02  above message repeats 25 times
May 22 14:51:38 zjddkf02 vmunix: LVM: WARNING: VG 64 0x190000: LV 2: Some I/O requests to this LV are waiting
May 22 14:51:26 zjddkf02 vmunix:        the PV becomes available (or a timeout is specified for the LV).
May 22 14:51:38 zjddkf02  above message repeats 25 times
May 22 14:51:38 zjddkf02 vmunix:        indefinitely for an unavailable PV. These requests will be queued until
May 22 14:51:38 zjddkf02 vmunix:        the PV becomes available (or a timeout is specified for the LV).
May 22 14:51:39 zjddkf02 vmunix: LVM: VG 64 0x090000: PVLink 1 0x000054 Failed! The PV is not accessible.
May 22 14:51:41 zjddkf02 vmunix: LVM: WARNING: VG 64 0x190000: LV 14: Some I/O requests to this LV are waiting
May 22 14:51:47 zjddkf02 vmunix: LVM: VG 64 0x130000: PVLink 1 0x000037 Failed! The PV is not accessible.
May 22 14:51:47 zjddkf02 vmunix: LVM: VG 64 0x130000: PVLink 1 0x000038 Failed! The PV is not accessible.
May 22 14:51:41 zjddkf02 vmunix:        indefinitely for an unavailable PV. These requests will be queued until
May 22 14:51:47 zjddkf02 vmunix: LVM: VG 64 0x130000: PVLink 1 0x00003c Failed! The PV is not accessible.
May 22 14:51:41 zjddkf02 vmunix:        the PV becomes available (or a timeout is specified for the LV).
May 22 14:51:47 zjddkf02 vmunix: LVM: VG 64 0x130000: PVLink 1 0x000044 Failed! The PV is not accessible.
May 22 14:51:49 zjddkf02 cmdisklockd[4459]: Still trying to inquire cluster lock disk /dev/disk/disk274
May 22 14:51:59 zjddkf02 vmunix: LVM: VG 64 0x180000: PVLink 1 0x000083 Failed! The PV is not accessible.
May 22 14:51:59 zjddkf02 vmunix: LVM: VG 64 0x180000: PVLink 1 0x000085 Failed! The PV is not accessible.
...
May 22 14:55:37 zjddkf02  above message repeats 4 times
May 22 14:55:37 zjddkf02 vmunix: LVM: WARNING: VG 64 0x110000: LV 19: Some I/O requests to this LV are waiting
May 22 14:55:38 zjddkf02 vmunix: LVM: VG 64 0x090000: PVLink 1 0x000054 Recovered.
May 22 14:55:38 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x090000: LV 1: All I/O requests to this LV that were
May 22 14:55:37 zjddkf02 vmunix:        indefinitely for an unavailable PV. These requests will be queued until
May 22 14:55:38 zjddkf02 vmunix:        waiting indefinitely for an unavailable PV have now completed.
May 22 14:55:37 zjddkf02 vmunix:        the PV becomes available (or a timeout is specified for the LV).
May 22 14:55:38 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x090000: LV 3: All I/O requests to this LV that were
May 22 14:55:38 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x090000: LV 5: All I/O requests to this LV that were
May 22 14:55:38 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x090000: LV 4: All I/O requests to this LV that were
May 22 14:55:38 zjddkf02 vmunix: LVM: VG 64 0x140000: PVLink 1 0x000045 Recovered.
May 22 14:55:38 zjddkf02 vmunix: LVM: VG 64 0x140000: PVLink 1 0x000046 Recovered.
May 22 14:55:38 zjddkf02 vmunix: LVM: VG 64 0x140000: PVLink 1 0x000047 Recovered.
May 22 14:55:38 zjddkf02 vmunix:        waiting indefinitely for an unavailable PV have now completed.
May 22 14:55:38 zjddkf02  above message repeats 3 times
May 22 14:55:38 zjddkf02 vmunix: LVM: VG 64 0x140000: PVLink 1 0x000049 Recovered.
May 22 14:55:38 zjddkf02 vmunix: LVM: VG 64 0x140000: PVLink 1 0x000050 Recovered.
May 22 14:55:38 zjddkf02 vmunix: LVM: VG 64 0x140000: PVLink 1 0x000053 Recovered.
May 22 14:55:38 zjddkf02 vmunix: LVM: VG 64 0x140000: PVLink 1 0x000051 Failed! The PV is not accessible.
May 22 14:55:38 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x140000: LV 5: All I/O requests to this LV that were
May 22 14:55:38 zjddkf02 vmunix:        waiting indefinitely for an unavailable PV have now completed.
May 22 14:55:38 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x140000: LV 16: All I/O requests to this LV that were
May 22 14:55:38 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x140000: LV 2: All I/O requests to this LV that were
May 22 14:55:38 zjddkf02 vmunix: LVM: WARNING: VG 64 0x140000: LV 12: Some I/O requests to this LV are waiting
May 22 14:55:38 zjddkf02 vmunix:        indefinitely for an unavailable PV. These requests will be queued until
May 22 14:55:38 zjddkf02 vmunix:        the PV becomes available (or a timeout is specified for the LV).
May 22 14:55:38 zjddkf02 vmunix:        waiting indefinitely for an unavailable PV have now completed.
May 22 14:55:38 zjddkf02 vmunix: LVM: WARNING: VG 64 0x190000: LV 14: Some I/O requests to this LV are waiting
May 22 14:55:38 zjddkf02 vmunix: LVM: WARNING: VG 64 0x130000: LV 16: Some I/O requests to this LV are waiting
May 22 14:55:38 zjddkf02 vmunix:        indefinitely for an unavailable PV. These requests will be queued until
May 22 14:55:38 zjddkf02  above message repeats 2 times
May 22 14:55:38 zjddkf02 vmunix:        the PV becomes available (or a timeout is specified for the LV).
May 22 14:55:38 zjddkf02  above message repeats 2 times
May 22 14:55:39 zjddkf02 vmunix: LVM: VG 64 0x150000: PVLink 1 0x000056 Recovered.
May 22 14:55:39 zjddkf02 vmunix: LVM: VG 64 0x150000: PVLink 1 0x000057 Failed! The PV is not accessible.
May 22 14:55:39 zjddkf02 vmunix: LVM: VG 64 0x150000: PVLink 1 0x000058 Failed! The PV is not accessible.
May 22 14:55:39 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x150000: LV 29: All I/O requests to this LV that were
May 22 14:55:39 zjddkf02 vmunix:        waiting indefinitely for an unavailable PV have now completed.
May 22 14:55:39 zjddkf02 vmunix: LVM: VG 64 0x120000: PVLink 1 0x000027 Recovered.
May 22 14:55:39 zjddkf02 vmunix: LVM: VG 64 0x120000: PVLink 1 0x000028 Recovered.
May 22 14:55:39 zjddkf02 vmunix: LVM: VG 64 0x120000: PVLink 1 0x000029 Recovered.
May 22 14:55:39 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x120000: LV 27: All I/O requests to this LV that were
May 22 14:55:39 zjddkf02 vmunix:        waiting indefinitely for an unavailable PV have now completed.
May 22 14:55:39 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x130000: LV 16: All I/O requests to this LV that were
May 22 14:55:39 zjddkf02 vmunix: LVM: VG 64 0x180000: PVLink 1 0x000083 Recovered.
May 22 14:55:39 zjddkf02 vmunix: LVM: VG 64 0x180000: PVLink 1 0x000084 Recovered.
May 22 14:55:39 zjddkf02 vmunix:        waiting indefinitely for an unavailable PV have now completed.

摘录部分信息，从14:51:14开始出现问题，pv不可得，物理链路丢失，但是到14:55:39已经recovered完成。唉，悲催的，有人动了SAN的相关配置文件。
又是误操作，，，，

再来说说UNDO tablespace，其实我们的UNDO完全是够用的，看看现在的配置：

SQL> show parameter undo

NAME                                 TYPE                             VALUE
------------------------------------ -------------------------------- ------------------------------
_gc_undo_affinity                    boolean                          FALSE
undo_management                      string                           AUTO
undo_retention                       integer                          36000
undo_tablespace                      string                           UNDOTBS1

SQL> SELECT d.undo_size/(1024*1024) "ACTUAL UNDO SIZE [MByte]",
  2  SUBSTR(e.value,1,25) "UNDO RETENTION [Sec]",
  3  (TO_NUMBER(e.value) * TO_NUMBER(f.value) *
  4  g.undo_block_per_sec) / (1024*1024)
  5  "NEEDED UNDO SIZE [MByte]"
  6  FROM (
  7  SELECT SUM(a.bytes) undo_size
  8  FROM v$datafile a,
  9  v$tablespace b,
 10  dba_tablespaces c
 11  WHERE c.contents = 'UNDO'
 12  AND c.status = 'ONLINE'
 13  AND b.name = c.tablespace_name
 14  AND a.ts# = b.ts#
 15  ) d,
 16  v$parameter e,
 17  v$parameter f,
 18  (
 19  SELECT MAX(undoblks/((end_time-begin_time)*3600*24))
 20  undo_block_per_sec
 21  FROM v$undostat
 22  ) g
 23  WHERE e.name = 'undo_retention'
 24  AND f.name = 'db_block_size'
 25  /

ACTUAL UNDO SIZE [MByte] UNDO RETENTION [Sec]                NEEDED UNDO SIZE [MByte]
------------------------ ----------------------------------- ------------------------
                   96000 36000                                                35242.5

要保证undo retention需要的表空间大小事35242.5M，我们的UNDO表空间大小是96000M,为什么早晨会开始报ORA-01555呢，
经过和应用沟通，最近他们把本来晚上跑的批处理，报表，放在白天开始跑。隐隐蛋疼的感觉，这么大的操作，也能随意调时间，
白天业务忙，事务量高，一个报表都要跑1-2小时，不报01555才怪呢。

汇报到这里吧，IT系统有风险，操作需谨慎，可惜误操作的哥们，撞到雷了，，，，一定要小心，小心，再小心