背景概述
某医院总院数据库都连接正常,但分院客户端每空闲一段时间,数据库连接就夯住或直接断开。分析此类问题本能的反应问题出在网络层面。
问题详细描述
引起数据连接异常的问题,多种多样,具体问题具体分析,
sqlnet网络连接跟踪具体设置如下:
在客户端sqlnet.ora文件中加入以下内容
TRACE_LEVEL_CLIENT=16
TRACE_FILE_CLIENT=CLIENT
TRACE_TIMESTAMP_CLIENT=ON
trace_directory_client=C:\oracle\trace
详细解释一下以上参数值:
TRACE_LEVEL_CLIENT –开启客户端跟踪级别
TRACE_LEVEL_LISTENER的取值范围为0~16,当然级别越高,收集的信息就相对越全面,系统默认是0,即不生成trace信息
off or 0 for no trace output
user or 4 for user trace information
admin or 10 for administration trace information
support or 16 for Oracle Support Services trace information
TRACE_FILE_CLIENT --设置客户端和服务器端的trace文件的名称
TRACE_TIMESTAMP_CLIENT --是否在trace中写入每条trace信息的dd-mon-yyyy hh:mi:ss:mi时间戳
TRACE_DIRECTORY_CLIENT --设置客户端和服务器端的trace文件的目录
重大事件支持细节
客户端跟踪内容如下:
(3880) [11-3月 -2014 12:21:10:750] ntt2err: soc 808 error - operation=5, ntresnt[0]=517, ntresnt[1]=54, ntresnt[2]=0
(3880) [11-3月 -2014 12:21:10:750] ntt2err: exit
(3880) [11-3月 -2014 12:21:10:750] nttrd: exit
(3880) [11-3月 -2014 12:21:10:750] nsprecv: error exit
(3880) [11-3月 -2014 12:21:10:750] nserror: entry
(3880) [11-3月 -2014 12:21:10:750] nserror: nsres: id=0, op=68, ns=12547, ns2=12560; nt[0]=517, nt[1]=54, nt[2]=0; ora[0]=0, ora[1]=0, ora[2]=0
(3880) [11-3月 -2014 12:21:10:750] nsrdr: error exit
(3880) [11-3月 -2014 12:21:10:750] snsbitts_ts: entry
(3880) [11-3月 -2014 12:21:10:750] snsbitts_ts: acquired the bit
(3880) [11-3月 -2014 12:21:10:750] snsbitts_ts: normal exit
(3880) [11-3月 -2014 12:21:10:750] snsbitcl_ts: entry
(3880) [11-3月 -2014 12:21:10:750] snsbitcl_ts: normal exit
(3880) [11-3月 -2014 12:21:10:750] snsbitts_ts: entry
(3880) [11-3月 -2014 12:21:10:750] snsbitts_ts: acquired the bit
(3880) [11-3月 -2014 12:21:10:750] snsbitts_ts: normal exit
(3880) [11-3月 -2014 12:21:10:750] nsdo: nsctxrnk=0
(3880) [11-3月 -2014 12:21:10:750] snsbitcl_ts: entry
(3880) [11-3月 -2014 12:21:10:750] snsbitcl_ts: normal exit
(3880) [11-3月 -2014 12:21:10:750] nsdo: error exit
(3880) [11-3月 -2014 12:21:10:750] nioqrc: wanted 1 got 0, type 0
(3880) [11-3月 -2014 12:21:10:750] nioqper: error from nioqrc
(3880) [11-3月 -2014 12:21:10:750] nioqper: ns main err code: 12547
(3880) [11-3月 -2014 12:21:10:750] nioqper: ns (2) err code: 12560
(3880) [11-3月 -2014 12:21:10:750] nioqper: nt main err code: 517
(3880) [11-3月 -2014 12:21:10:750] nioqper: nt (2) err code: 54
(3880) [11-3月 -2014 12:21:10:750] nioqper: nt OS err code: 0
(3880) [11-3月 -2014 12:21:10:750] nioqer: entry
(3880) [11-3月 -2014 12:21:10:750] nioqer: incoming err = 12151
(3880) [11-3月 -2014 12:21:10:750] nioqce: entry
(3880) [11-3月 -2014 12:21:10:750] nioqce: exit
(3880) [11-3月 -2014 12:21:10:750] nioqer: returning err = 3135
当夯住的时候跟踪内容如下:
(3880) [11-3月 -2014 12:21:10:781] nsprecv: reading from transport...
(3880) [11-3月 -2014 12:21:10:781] nttrd: entry
我们可以看到在错误出现之前,从服务端发过来的包没有到客户端,从客户端的跟踪诊断可以看到handshake is completed
(3880) [11-3月 -2014 12:21:10:843] nscon: connect handshake is complete
接着显示最后一个数据包从客户机发送(nspsend)
(3880) [11-3月 -2014 12:21:10:859] nspsend: plen=168, type=6
(3880) [11-3月 -2014 12:21:10:859] nttwr: entry
(3880) [11-3月 -2014 12:21:10:859] nttwr: socket 808 had bytes written=168
(3880) [11-3月 -2014 12:21:10:859] nttwr: exit
(3880) [11-3月 -2014 12:21:10:859] nspsend: packet dump
(3880) [11-3月 -2014 12:21:10:859] nspsend: 00 A8 00 00 06 00 00 00 |........|
(3880) [11-3月 -2014 12:21:10:859] nspsend: 00 00 DE AD BE EF 00 9E |........|
(3880) [11-3月 -2014 12:21:10:859] nspsend: 0A 20 01 00 00 04 00 00 |........|
(3880) [11-3月 -2014 12:21:10:859] nspsend: 04 00 03 00 00 00 00 00 |........|
然后客户端等待一段时间,例子是5分钟。显示接收数据包,但包从未到来。
(3880) [11-3月 -2014 12:21:10:984] nsrdr: recving a packet
(3880) [11-3月 -2014 12:21:10:984] nsprecv: entry
(3880) [11-3月 -2014 12:21:10:984] nsprecv: reading from transport...
(3880) [11-3月 -2014 12:21:10:984] nttrd: entry
(3880) [11-3月 -2014 12:21:32:406] ntt2err: entry
(3880) [11-3月 -2014 12:21:32:406] ntt2err: soc 352 error - operation=5, ntresnt[0]=517, ntresnt[1]=54, ntresnt[2]=0
(3880) [11-3月 -2014 12:21:32:406] ntt2err: exit
(3880) [11-3月 -2014 12:21:32:406] nttrd: exit
(3880) [11-3月 -2014 12:21:32:406] nsprecv: error exit
(3880) [11-3月 -2014 12:21:32:406] nserror: entry
(3880) [11-3月 -2014 12:21:32:406] nserror: nsres: id=0, op=68, ns=12547, ns2=12560; nt[0]=517, nt[1]=54, nt[2]=0; ora[0]=0, ora[1]=0, ora[2]=0
(3880) [11-3月 -2014 12:21:32:406] nsrdr: error exit
(3880) [11-3月 -2014 12:21:32:406] snsbitts_ts: entry
(3880) [11-3月 -2014 12:21:32:406] snsbitts_ts: acquired the bit
(3880) [11-3月 -2014 12:21:32:406] snsbitts_ts: normal exit
(3880) [11-3月 -2014 12:21:32:406] snsbitcl_ts: entry
(3880) [11-3月 -2014 12:21:32:406] snsbitcl_ts: normal exit
(3880) [11-3月 -2014 12:21:32:406] snsbitts_ts: entry
(3880) [11-3月 -2014 12:21:32:406] snsbitts_ts: acquired the bit
(3880) [11-3月 -2014 12:21:32:406] snsbitts_ts: normal exit
(3880) [11-3月 -2014 12:21:32:406] nsdo: nsctxrnk=0
(3880) [11-3月 -2014 12:21:32:406] snsbitcl_ts: entry
(3880) [11-3月 -2014 12:21:32:406] snsbitcl_ts: normal exit
(3880) [11-3月 -2014 12:21:32:406] nsdo: error exit
(3880) [11-3月 -2014 12:21:32:406] nioqrc: wanted 1 got 0, type 0
(3880) [11-3月 -2014 12:21:32:406] nioqper: error from nioqrc
(3880) [11-3月 -2014 12:21:32:406] nioqper: ns main err code: 12547
(3880) [11-3月 -2014 12:21:32:406] nioqper: ns (2) err code: 12560
(3880) [11-3月 -2014 12:21:32:406] nioqper: nt main err code: 517
(3880) [11-3月 -2014 12:21:32:406] nioqper: nt (2) err code: 54
(3880) [11-3月 -2014 12:21:32:406] nioqper: nt OS err code: 0
(3880) [11-3月 -2014 12:21:32:406] nioqer: entry
(3880) [11-3月 -2014 12:21:32:406] nioqer: incoming err = 12151
(3880) [11-3月 -2014 12:21:32:406] nioqce: entry
(3880) [11-3月 -2014 12:21:32:406] nioqce: exit
结论及解决方案
通过以上的方法,从跟踪信息可以确定是网络层的问题由于丢包引起的,主要还是由于防火墙上策略设置的问题,最后查明原来在防火墙上禁用了长连接。开启后问题解决。