今天同事发过来一个问题,在asm实例中执行查询v$asm_diskgroup的时候,session就hang住了。查询等待事件发现是enq: DD – contention。这个情况还是第一次遇到,不是很清楚,于是让他收集了一些trace发过来。对于hang的问题,我们一般建议做3级的hanganalyze。从hanganalyze的trace file文件中我们可以看到下列的一些信息:
Chains most likely to have caused the hang:
[a] Chain 1 Signature: 'kfk: async disk IO'<='enq: DD - contention'
Chain 1 Signature Hash: 0x92d661bb
[b] Chain 2 Signature: 'kfk: async disk IO'<='enq: DD - contention'
Chain 2 Signature Hash: 0x92d661bb
[c] Chain 3 Signature: 'kfk: async disk IO'<='enq: DD - contention'
Chain 3 Signature Hash: 0x92d661bb
===============================================================================
Non-intersecting chains:
-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
Oracle session identified by:
{
instance: 4 (+asm.+asm4)
os id: 12115
process id: 39, oracle@bi-db04 (TNS V1-V3)
session id: 961
session serial #: 5
}
is waiting for 'enq: DD - contention' with wait info:
{
p1: 'name|mode'=0x44440006
p2: 'disk group'=0x0
p3: 'type'=0x1
time in wait: 24.505926 sec
timeout after: never
wait id: 9
blocking: 0 sessions
current sql: select group_number,name,total_mb,free_mb from v$asm_diskgroup
short stack: ksedsts()+461
wait history:
* time between current wait and wait #1: 0.000236 sec
1. event: 'SQL*Net message to client'
time waited: 0.000003 sec
wait id: 8 p1: 'driver id'=0x62657100
p2: '#bytes'=0x1
* time between wait #1 and #2: 0.001586 sec
2. event: 'SQL*Net message from client'
time waited: 32.863016 sec
wait id: 7 p1: 'driver id'=0x62657100
p2: '#bytes'=0x1
* time between wait #2 and #3: 0.000002 sec
3. event: 'SQL*Net message to client'
time waited: 0.000001 sec
wait id: 6 p1: 'driver id'=0x62657100
p2: '#bytes'=0x1
}
and is blocked by
=> Oracle session identified by:
{
instance: 4 (+asm.+asm4)
os id: 25609
process id: 31, oracle@bi-db04 (TNS V1-V3)
session id: 450
session serial #: 2057
}
which is waiting for 'kfk: async disk IO' with wait info:
{
p1: 'count'=0x1
p2: 'intr'=0x0
p3: 'timeout'=0xffffffff
time in wait: 9412 min 34 sec
heur. time in wait: 9412 min 34 sec
timeout after: never
wait id: 83
blocking: 7 sessions
current sql: select grpnum_kfdsk, number_kfdsk, compound_kfdsk, incarn_kfdsk, mntsts_kfdsk, hdrsts_kfdsk, compat_kfdsk, mode_kfdsk, state_kfdsk, redun_kfdsk, libnam_kfdsk, totmb_kfdsk, usedmb_kfdsk, asmname_kfdsk, failname_kfdsk, label_kfdsk, path_kfdsk, udid_kfdsk, kfkid_kfdsk, crdate_kfdsk, mtdate_kfdsk, timer_kfdsk , dbcompat_k
short stack: ksedsts()+461
wait history:
* time between current wait and wait #1: 0.000029 sec
1. event: 'kfk: async disk IO'
time waited: 0.010643 sec
wait id: 82 p1: 'count'=0x1
p2: 'intr'=0x0
p3: 'timeout'=0xffffffff
* time between wait #1 and #2: 0.000028 sec
2. event: 'kfk: async disk IO'
time waited: 0.004111 sec
wait id: 81 p1: 'count'=0x1
p2: 'intr'=0x0
p3: 'timeout'=0xffffffff
* time between wait #2 and #3: 0.000028 sec
3. event: 'kfk: async disk IO'
time waited: 0.001966 sec
wait id: 80 p1: 'count'=0x1
p2: 'intr'=0x0
p3: 'timeout'=0xffffffff
}
Chain 1 Signature: 'kfk: async disk IO'<='enq: DD - contention'
Chain 1 Signature Hash: 0x92d661bb
这里我们可以看到我们执行查询v$asm_diskgroup的session当前的等待事件是enq: DD – contention,而这个会话被另外一个会话所阻塞。这里我们可以看到是这个session的等待事件是kfk: async disk IO。而且一直处于这个等待上。现在得依靠我们强大的metalink了,我们上去搜索了一番,发现Bug 14002325 : SELECT * FROM V$ASM_DISKGROUP HANGING.它的描述是比较符合我们这个案例的。为什么这么说呢?我们可以把我们的堆栈拿出来和bug的堆栈做对比。这是bug提供的堆栈信息。
Short stack dump:
ksedsts ksdxfstk ksdxcb sspuser __sighandler read kfk_reap_ios_from_subsys
kfk_reap_ios kfk_io1 kfkRequest kfk_transitIO kfdParallelIO kfdDiscoverDeep
kfgDiscoverDeep kfgDiscoverGroup kfgTableCb kfgGrpTableCbInternal
kfgGrpTableCb qerfxFetch opifch2 kpoal8 opiodr ttcpip opitsk opiino opiodr
opidrv sou2o opimai_real ssthrdmain main __libc_start_main _start
因为我们做的hanganalyze的short stack不全,于是我让他又做了一个完整的errorstack。内容如下。
----- Error Stack Dump -----
----- Current SQL Statement for this session (sql_id=b8a2pdbq3p2mj) -----
select grpnum_kfdsk, number_kfdsk, compound_kfdsk, incarn_kfdsk, mntsts_kfdsk, hdrsts_kfdsk, compat_kfdsk, mode_kfdsk, state_kfdsk, redun_kfdsk, libnam_kfdsk, totmb_kfdsk, usedmb_kfdsk, asmname_kfdsk, failname_kfdsk, label_kfdsk, path_kfdsk, udid_kfdsk, kfkid_kfdsk, crdate_kfdsk, mtdate_kfdsk, timer_kfdsk , dbcompat_kfdsk, product_kfdsk , osmb_kfdsk, prefrd_kfdsk , coldmb_kfdsk, hotmb_kfdsk, blksz_kfdsk, flags_kfdsk ,asmhashval_kfdsk from x$kfdsk
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
skdstdst()+36 call kgdsdst() 000000000 ? 000000000 ?
7FFF42C57A68 ? 000000001 ?
000000001 ? 000000003 ?
ksedst1()+98 call skdstdst() 000000000 ? 000000000 ?
7FFF42C57A68 ? 000000001 ?
000000000 ? 000000003 ?
ksedst()+34 call ksedst1() 000000001 ? 000000001 ?
7FFF42C57A68 ? 000000001 ?
000000000 ? 000000003 ?
dbkedDefDump()+2741 call ksedst() 000000001 ? 000000001 ?
7FFF42C57A68 ? 000000001 ?
000000000 ? 000000003 ?
ksedmp()+36 call dbkedDefDump() 000000003 ? 000000000 ?
7FFF42C57A68 ? 000000001 ?
000000000 ? 000000003 ?
ksdxfdmp()+1837 call ksedmp() 000000003 ? 000000000 ?
7FFF42C57A68 ? 000000001 ?
000000000 ? 000000003 ?
ksdxcb()+1876 call ksdxfdmp() 7FFF42C5D360 ? 000000011 ?
000000003 ? 7FFF42C5D2C0 ?
7FFF42C5D220 ? 000000003 ?
sspuser()+112 call ksdxcb() 000000001 ? 000000011 ?
000000001 ? 000000001 ?
7FFF42C5D220 ? 000000003 ?
__sighandler() call sspuser() 000000001 ? 000000011 ?
000000001 ? 000000001 ?
7FFF42C5D220 ? 000000003 ?
io_queue_run()+100 signal __sighandler() 7F194F920000 ? 000000001 ?
000000080 ?
FFFFFFFFFFFFFFFF ?
7FFF42C5EFB0 ? 7FFF42C5EFB0 ?
skgfrliopo()+273 call io_queue_run() 7F194F920000 ? 000000001 ?
000000080 ?
FFFFFFFFFFFFFFFF ?
7FFF42C5EFB0 ? 7FFF42C5EFB0 ?
skgfospo()+180 call skgfrliopo() 7FFF42C5F178 ?
FFFFFFFFFFFFFFFF ?
7FFF42C5EFB0 ? 7FFF42C5EFB0 ?
000000000 ? 0FFFFFFFF ?
skgfrwat()+394 call skgfospo() 7FFF42C5EFB0 ? 7FFF42C5EFB0 ?
7F194CE32038 ? 000000000 ?
000000001 ? 0FFFFFFFF ?
ksfdwtio()+261 call skgfrwat() 7FFF42C5F178 ? 7F194CE32038 ?
000000001 ? 7FFF42C5F1A8 ?
7FFF42C5DFA8 ? 0FFFFFFFF ?
ksfdwat_internal()+ call ksfdwtio() 000000001 ? 000000028 ?
180 07FFFFFFF ? 7FFF42C5F2F8 ?
7FFF42C5DFA8 ? 0FFFFFFFF ?
kfk_reap_ufs_async_ call ksfdwat_internal() 000000001 ? 000000028 ?
io()+96 07FFFFFFF ? 7FFF42C5F2F8 ?
7FFF42C5DFA8 ? 0FFFFFFFF ?
kfk_reap_ios_from_s call kfk_reap_ufs_async_ 000000001 ? 000000028 ?
ubsys()+292 io() 7FFF42C5DFA8 ? 7FFF42C5F2F8 ?
7FFF42C5DFA8 ? 0FFFFFFFF ?
kfk_reap_ios()+2033 call kfk_reap_ios_from_s 000000001 ? 07FFFFFFF ?
ubsys() 000000001 ? 7FFF42C5F2F8 ?
7FFF42C5DFA8 ? 0FFFFFFFF ?
kfk_io1()+1103 call kfk_reap_ios() 7F194CE48938 ? 00000000A ?
0FFFFFFFF ? 7FFF42C5F2F8 ?
7FFF42C5DFA8 ? 0FFFFFFFF ?
kfkRequest()+9 call kfk_io1() 7F194CE48938 ? 00000000A ?
FFFFFFFFFFFFFFFF ?
000000001 ? 7FFF42C5DFA8 ?
00A7B4DD0 ?
kfk_transitIO()+169 call kfkRequest() 7F194CE48938 ? 00000000A ?
1 FFFFFFFFFFFFFFFF ?
000000001 ? 7FFF42C5DFA8 ?
00A7B4DD0 ?
kfdParallelIO()+207 call kfk_transitIO() 7FFF42C603D0 ? 000000000 ?
0 000000000 ? 000000000 ?
7FFF42C60380 ? 00000000A ?
kfdDiscoverDeep()+1 call kfdParallelIO() 00000000A ? 000000000 ?
94 000000000 ? 000000000 ?
000000000 ? 7FFF42C60558 ?
kfgDiscoverDeep()+3 call kfdDiscoverDeep() 0D1B9BF30 ? 000000000 ?
66 000000000 ? 000000000 ?
000000000 ? 7FFF42C60558 ?
kfgDiscoverGroup()+ call kfgDiscoverDeep() 000000000 ? 000000000 ?
864 000000000 ? 000000000 ?
000000000 ? 7FFF42C60558 ?
kfgTableCb()+1562 call kfgDiscoverGroup() 7F194CB7FD88 ? 000000000 ?
000000000 ? 000000000 ?
0D1B9BF30 ? 000000000 ?
kfdDskTableCbIntern call kfgTableCb() 7F194CBA9E58 ? 7FFF00000001 ?
al()+228 7F194CB9AAA0 ? 000000000 ?
000000007 ? 7F1900000000 ?
kfdDskTableCb()+51 call kfdDskTableCbIntern 7F194CBA9E58 ? 7FFF00000001 ?
al() 000000007 ? 7F1900000000 ?
0CAF615A0 ? 000000001 ?
qerfxFetch()+2210 call kfdDskTableCb() 7F194CBA9E58 ? 7FFF00000001 ?
000000007 ? 7F1900000000 ?
0CAF615A0 ? 000000001 ?
opifch2()+3022 call qerfxFetch() 0CAF63260 ? 7F194CBA9AD0 ?
0081D2D48 ? 7FFF42C60D90 ?
000000001 ? 000000001 ?
kpoal8()+2934 call opifch2() 000000089 ? 000000005 ?
7FFF42C613B0 ? 7F194CBA8F90 ?
000000001 ? 000000001 ?
opiodr()+916 call kpoal8() 00000005E ? 00000001C ?
7FFF42C64310 ? 7F194CBA8F90 ?
000000001 ? 100000001 ?
ttcpip()+2242 call opiodr() 00000005E ? 00000001C ?
7FFF42C64310 ? 000000000 ?
008BE8B30 ? 100000001 ?
opitsk()+1668 call ttcpip() 00A7CA8B0 ? 00852A310 ?
7FFF42C64310 ? 000000000 ?
7FFF42C63D68 ? 7FFF42C64504 ?
opiino()+961 call opitsk() 00A7CA8B0 ? 000000001 ?
7FFF42C64310 ? 000000000 ?
7FFF42C63D68 ? 7FFF42C64504 ?
opiodr()+916 call opiino() 00000003C ? 000000004 ?
7FFF42C65A88 ? 000000000 ?
7FFF42C63D68 ? 7FFF42C64504 ?
opidrv()+565 call opiodr() 00000003C ? 000000004 ?
7FFF42C65A88 ? 000000000 ?
008BE85E0 ? 7FFF42C64504 ?
sou2o()+98 call opidrv() 00000003C ? 000000004 ?
7FFF42C65A88 ? 000000000 ?
008BE85E0 ? 7FFF42C64504 ?
opimai_real()+128 call sou2o() 7FFF42C65A60 ? 00000003C ?
000000004 ? 7FFF42C65A88 ?
008BE85E0 ? 7FFF42C64504 ?
ssthrdmain()+252 call opimai_real() 000000002 ? 7FFF42C65C50 ?
000000004 ? 7FFF42C65A88 ?
008BE85E0 ? 7FFF42C64504 ?
main()+196 call ssthrdmain() 000000002 ? 7FFF42C65C50 ?
000000001 ? 000000000 ?
008BE85E0 ? 7FFF42C64504 ?
__libc_start_main() call main() 000000002 ? 7FFF42C65DE8 ?
+230 000000001 ? 000000000 ?
008BE85E0 ? 7FFF42C64504 ?
_start()+36 call __libc_start_main() 0008BA95C ? 000000002 ?
7FFF42C65DD8 ? 0038087C0 ?
008BE85E0 ? 000000002 ?
可以看到堆栈的内容一摸一样,和Bug 14002325 : SELECT * FROM V$ASM_DISKGROUP HANGING.但是遗憾的是这篇文档没有告诉我们形成的原因,也没有告诉我们解决的办法。它只说了这是一个bug。面对这样的情况,我们能做的唯一办法就是规避。那么怎么规避这个问题呢?那么我们需要了解下原理。
首先当我们查询v$asm_disk和v$asm_diskgroup这两个视图的时候,ASM实例会启动pzNN进程,该进程会去和gpnp的守护进程进行通信。这一步主要的目的是从gpnp的profile文件当中获取到discovery string。然后根绝这个discovery string去查活动的disk。我们在执行这个查找discovery string的操作的时候,将会把这些信息记录到RBAL的trace文件里面。据Oracle称是为了出于bebug的目的。要想规避这个问题,我们可以查询v$asm_disk_stat和v$asm_diskgroup_stat这两个视图,查询这两个视图的时候不需要去和gpnp守护进程通信,也不用去查discovery_string。这样我们就规避了这个问题。我们再看一下这两个视图的定义。
SQL> desc v$asm_diskgroup
Name Null? Type
-------------------------------------------------------------------------------------------------- -------- -------
GROUP_NUMBER NUMBER
NAME VARCHAR2(30)
SECTOR_SIZE NUMBER
BLOCK_SIZE NUMBER
ALLOCATION_UNIT_SIZE NUMBER
STATE VARCHAR2(11)
TYPE VARCHAR2(6)
TOTAL_MB NUMBER
FREE_MB NUMBER
HOT_USED_MB NUMBER
COLD_USED_MB NUMBER
REQUIRED_MIRROR_FREE_MB NUMBER
USABLE_FILE_MB NUMBER
OFFLINE_DISKS NUMBER
COMPATIBILITY VARCHAR2(60)
DATABASE_COMPATIBILITY VARCHAR2(60)
VOTING_FILES VARCHAR2(1)
SQL> desc v$asm_diskgroup_stat
Name Null? Type
-------------------------------------------------------------------------------------------------- ------- -------
GROUP_NUMBER NUMBER
NAME VARCHAR2(30)
SECTOR_SIZE NUMBER
BLOCK_SIZE NUMBER
ALLOCATION_UNIT_SIZE NUMBER
STATE VARCHAR2(11)
TYPE VARCHAR2(6)
TOTAL_MB NUMBER
FREE_MB NUMBER
HOT_USED_MB NUMBER
COLD_USED_MB NUMBER
REQUIRED_MIRROR_FREE_MB NUMBER
USABLE_FILE_MB NUMBER
OFFLINE_DISKS NUMBER
COMPATIBILITY VARCHAR2(60)
DATABASE_COMPATIBILITY VARCHAR2(60)
VOTING_FILES VARCHAR2(1)
SQL> select * from v$FIXED_VIEW_DEFINITION where view_name ='GV$ASM_DISKGROUP';
VIEW_NAME VIEW_DEFINITION
----------------- -----------------------------------------------------------------------------------------------------------------------------------------------
GV$ASM_DISKGROUP
select g.inst_id, g.number_kfgrp, g.name_kfgrp, g.sector_kfgrp, g.blksize_kfgrp, g.ausize_kfgrp, decode(g.state_kfgrp, 0, 'INVALID', 1, 'UNKNOWN', 2, 'DISMOUNTED', 3
, 'CREATING', 4, 'MOUNTING', 5, decode(bitand(g.lflags_kfgrp, 1), 1, 'QUIESCING', 'MOUNTED'), 6, 'DISMOUNTING', 7, 'CONNECTED', 8, 'BROKEN', 9, 'CONNECTING'
, 10, 'BREAKING', 11, 'DROPPING', 12, 'DROPPING', 255, 'RESTRICTED'), decode(g.type_kfgrp, 1, 'EXTERN', 2, 'NORMAL', 3, 'HIGH'), g.totmb_kfgrp, g.freemb_kfgrp, g.hotmb
_kfgrp, g.coldmb_kfgrp, g.minspc_kfgrp, g.usable_kfgrp, g.offline_kfgrp, g.compat_kfgrp, g.dbcompat_kfgrp, decode(bitand(g.flags32_kfgrp, 2048), 2048, 'Y', 'N') from x
$kfgrp g where state_kfgrp != 0
SQL> select * from v$FIXED_VIEW_DEFINITION where view_name ='GV$ASM_DISKGROUP_STAT';
VIEW_NAME VIEW_DEFINITION
----------------- ---------------------------------------------------------------------------------------------------------------------------------------------
GV$ASM_DISKGROUP_STAT
select g.inst_id, g.number_kfgrp, g.name_kfgrp, g.sector_kfgrp, g.blksize_kfgrp, g.ausize_kfgrp, decode(g.state_kfgrp, 0, 'INVALID', 1, 'UNKNOWN', 2, 'DISMOUNTED', 3
, 'CREATING', 4, 'MOUNTING', 5, decode(bitand(g.lflags_kfgrp, 1), 1, 'QUIESCING', 'MOUNTED'), 6, 'DISMOUNTING', 7, 'CONNECTED', 8, 'BROKEN', 9, 'CONNECTING'
, 10, 'BREAKING', 11, 'DROPPING', 12, 'DROPPING', 255, 'RESTRICTED'), decode(g.type_kfgrp, 1, 'EXTERN', 2, 'NORMAL', 3, 'HIGH'), g.totmb_kfgrp, g.freemb_kfgrp, g.hotmb
_kfgrp, g.coldmb_kfgrp, g.minspc_kfgrp, g.usable_kfgrp, g.offline_kfgrp, g.compat_kfgrp, g.dbcompat_kfgrp, decode(bitand(g.flags32_kfgrp, 2048), 2048, 'Y', 'N') from x
$kfgrp_stat g where state_kfgrp != 0 and number_kfgrp != 0
可以看到这两个视图的字段是一摸一样的。而它们查询的基表一个是x$kfgrp,另外一个是x$kfgrp_stat.
分享到:
更多