IPS阻挡未知archlog导致standbyDB无法同步
Environment:HPUX 11.31 ORACLE 10.2.0.4 PDB 3nodes RAC Standby 2nodes RAC
Symptom:standbyDB无法同步,越来越多的archlog未被apply,且存在gap,standby DB在等待113171这个log,但是这个log传不过来。primary DB
2号节点alertlog报错如下:
ORA-03135: connection lost contact
Tue Jul 24 12:18:06 2012
FAL[server, ARC5]: FAL archival, error 3135 closing archivelog file 'sfc12stb'
FAL[server, ARC5]: FAL archive failed, see trace file.
Tue Jul 24 12:18:06 2012
Errors in file /apps/oracle/admin/sfc12db/bdump/sfc12db2_arc5_16226.trc:
ORA-16055: FAL request rejected
ARCH: FAL archive failed. Archiver continuing Tue Jul 24 12:18:06 2012 ORACLE Instance sfc12db2 - Archival Error. Archiver continuing.
Tue Jul 24 12:19:59 2012
Solution:
根据alertlog中的保存检索MOS,如下
ORA - 03135 : connection lost contact while shipping from Primary Server to Standby server [ID 739522.1] –附录1
Changes There may be a firewall rule changed or the firewall is newly installed.
根据这个信息联系网络部门同事,可网络部门同事严重误导了我,说我的主库和备库之间是纯路由网络没有firewall,这样我就开始不断尝试其他的办法试图解决这个问题。
1, 重启2号节点。
StandbyDB:Alter database recover managed standby database cancel;
alter database recover managed standby database disconnect;
依然无效。问题就是当standby视图获取archive gap 113171时直接被
二号节点拒绝ORA-03135: connection lost contact。
2. 轮流重启了所有节点,且重启的standby DB的os没有效果gap 113171这个log无法传送过来。
3. 尝试手动传输113171这个log。
A.将sequence=113171的这个log从ASM传到filesystem步骤如下:
1.Log ontothe target database that is local to the ASM instance as the sys user.
2.create source directory with in the target database.
SQL>create or replace directory SOURCE_DIR as '+DGARCH/SFC12DB/ARCHIVELOG/2012_07_24';
Directory created.
(In this example +DGARCH/SFC12DB/ARCHIVELOG/is the source directory where the datafile is located and where you wish to copy the file from.)
3.create destination directory with in database.
SQL>createor replace directory ORACLE_DEST as '/tmp';
Directory created.
(In this example /restoreisthe destination directory where the datafile is to be copied to.)
4.Executethe dbms_file_transfer package.
SQL>
BEGIN
dbms_file_transfer.copy_file(source_directory_object =>
' SOURCE_DIR ', source_file_name => 'thread_2_seq_113171.540.789475773',
destination_directory_object =>'ORACLE_DEST',
destination_file_name =>'thread_2_seq_113171.540.789475773');
END;
/
Ps:oracle 11G中ASM的文件是可以直接copy的就不用这么麻烦了。
B.利用FTP传输到standbyDB
问题是我这个archlog 17M
当ftp传输到13.9M时卡住不动。无法传输ok…..,我有试着传输大于17M的文件可以传送。小于17M的也可以。难道是这个log有问题?我又尝试在同一个网段传送这个log居然可以传送(primaryDB 172.16.50.31/32/33 standbyDB 172.16.51.151/152),纠结了。这时已经是凌晨一点多了,距离问题发生已经13个小时,这十个来小时高度紧张,思考疲惫死了,睡觉去明天再找网络部门排查。
4. 一夜没睡好,第二天找网络部门同事一起来看,为什么跨网段后,这个113171 log就无法完整传输?和两位网络部门的同事一起尝试不同网段之间传送这个文件,还是无法完整传输。这事网络部门同事想到我们公司还有一个IPSintrusion prevention system入侵防御系统。检查日志终于找到问题的源头,未知的原因导致IPS将113171这个文件当成DOS攻击,阻止了主库和备库之间的传输。(我有想掐死他们的冲动,昨天问过他们好多次这两个网段间是否有firewall,他们一直肯定的说没有,好吧别给我解释IPS和firewall不是一回事…..)
5. 问题的原因找到网络同事修改IPS策略之后可以完整传送这个log,
在standby DB手动注册这个log
Alter database register logfile'/apps/oracle/1_117472_679867280.dbf';
这个gap消除之后 standbydb 又报了两个gap是其他两个thread的
可是这两个log已经没有(备份策略是备份后删除archivelog只保留最近六个小时的),这样需要从磁带中抓取那两个log
脚本如下:
RUN
{
allocate channel dev0 type'sbt_tape'
parms'SBT_LIBRARY=/opt/omni/lib/libob2oracle8_64bit.so,ENV=(OB2BARTYPE=Oracle8,OB2APPNAME=sfc12db,OB2BARLIST=sfc12rc3_fullDB)';
SETARCHIVELOG DESTINATIONTO'/tmp';
RESTORE ARCHIVELOG SEQUENCE 113854 thread 3 ; --只是恢复需要的log
release channel dev0;
}
附录1.使用From Until语句
restore archivelog from sequence 69346 until sequence 69397 thread 1;
restore archivelog from sequence 75193 until sequence 75263 thread 2;
2.使用Between语句
restore archivelog sequence between 134 and 136 thread 1;
restore archivelog sequence between 56 and 58 thread 2;
对备库进行设置:
alter database recover managed standby database cancel;
alter database recover managed standby database disconnect;
当所有的log都被applied之后
alter database recover managed standby database cancel;
alter database recover managed standby database using current logfile disconnect;---启用real time apply。处理OK
思考:原来的archivelog 都可以正常的传送而不会被挡掉,发生了什么使得IPS将这个log当作了DOS攻击?.......
附录一
ORA - 03135 : connection lost contact while shipping from Primary Server to Standby server [ID 739522.1] | ||
| ||
| Modified 19-AUG-2011 Type PROBLEM Status MODERATED |
|
In this Document
Symptoms
Changes
Cause
Solution
References
This document is being delivered to you via Oracle Support's Rapid Visibility (RaV) process and therefore has not been subject to an independent technical review. |
Applies to:
Oracle Net Services - Version: 10.1.0.2.0 to 11.2.0.2 - Release: 10.1 to 11.2
Information in this document applies to any platform.
Symptoms
customer reported intermittently the archiver process hits error and
following seen in the primary node alert log
Errors in file /opt/oracle/admin/ORCL/bdump/orcl1_arc4_28059.trc:
ORA-03135: connection lost contact
Mon Sep 22 22:24:51 2008
FAL[server, ARC4]: FAL archive failed, see trace file.
Mon Sep 22 22:24:51 2008
Errors in file /opt/oracle/admin/ORCL/bdump/orcl1_arc4_28059.trc:
ORA-16055: FAL request rejected
.
Changes
There may be a firewall rule changed or the firewall is newly installed.
Cause
Mainly ORA-3135 occurs when the connection is broken because of underlying network issues.
In this case, an intermediate firewall between primary and secondary server is altering the data inside the sqlnet packet.
The trace clearly shows that primary sends a data with
[22-SEP-2008 22:24:51:414] nspsend: 53 53 3D 28 50 52 4F 54 |SS=(PROT|
[22-SEP-2008 22:24:51:414] nspsend: 4F 43 4F 4C 3D 54 43 50 |OCOL=TCP|
[22-SEP-2008 22:24:51:414] nspsend: 29 28 48 4F 53 54 3D 70 |)(HOST=
h|
[22-SEP-2008 22:24:51:414] nspsend: 6E 79 6D 65 72 63 75 72 |
ostname1|
[22-SEP-2008 22:24:51:414] nspsend: 79 64 62 30 39 2D 76 69 |
_db011-v|
[22-SEP-2008 22:24:51:414] nspsend: 72 74 29 28 50 4F 52 54 |
ip)(PORT|
[22-SEP-2008 22:24:51:414] nspsend: 3D 31 35 32 31 29 29 61 |=1521))a|
[22-SEP-2008 22:24:51:414] nspsend: 20 4C 0B 05 01 00 0D 00 |.L......|
and the firewall / intermediate network device is changing the hostname to IP address (possibly a
rule set in the firewall).
[22-SEP-2008 22:24:51:586] nsprecv: 53 53 3D 28 50 52 4F 54 |SS=(PROT|
[22-SEP-2008 22:24:51:586] nsprecv: 4F 43 4F 4C 3D 54 43 50 |OCOL=TCP|
[22-SEP-2008 22:24:51:586] nsprecv: 29 28 48 4F 53 54 3D 31 |)(HOST=
1|
[22-SEP-2008 22:24:51:586] nsprecv: 36 31 2E 32 32 31 2E 32 |
11.111.1|
[22-SEP-2008 22:24:51:586] nsprecv: 31 37 2E 35 35 29 28 50 |
23.45)(P|
[22-SEP-2008 22:24:51:586] nsprecv: 4F 52 54 3D 37 31 36 39 |ORT=7171|
[22-SEP-2008 22:24:51:586] nsprecv: 29 29 61 20 4C 0B 05 01 |))a.L...|
Solution
Involve the firewall administration team to check whether the firewall /intermediate network device can be stopped from altering the sqlnet packet.
If this is not the case, then we need to take matching sqlnet server traces at support level to narrow the issue.
Enabling the server trace on the primary and restarting the archiver process will generate sqlnet trace for the archiver process.
Solution to tracing archiver process without getting tons of trace for other background processes and connections is to
1. Enabled Sever tracing on primary
2. Kill the archive process that is archiving log files remote (dont kill the wrong archiver)
3. Disable trace
Do these 3 actions as quickly as possible. Tie the server trace to the new archive process via PID.
Oracle net server trace will have the PID in net trace name
Archive process will be traced untill the process is stopped/killed. So you may wish to kill the process again, once error reproduced.
WARNING !!!Small chance the process wont re-spawn
Then enable the sqlnet tracing on the standby server to get the corresponding process which takes the request from this archiver process.
Once error is hit match the archiver sqlnet trace file to see the network issue if any.
Note: you can easily cross check the sqlnet server trace and archiver process by means of Process ID which is appended to the trace file name in the bdump.
For enabling tracing referNote 395525.1 How to Enable Oracle SQLNet Client , Server , Listener , Kerberos and External procedure Tracing from Net Manager