客户的一套Oracle Active DataGuard环境中 ,主库在每天的最高峰的时间段内都会收到如下的报错:
Fri Apr 24 17:25:59 2015
ORA-16198: LGWR received timedout error from KSR
LGWR: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16198)
LGWR: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
Error 16198 for archive log file 1 to 'afabdg01'
参考如下的MOS文章:
Redo Transport Services fails with ORA-16198 when using SYNC (synchronous) mode (Doc ID 808469.1)
In this Document
Symptoms |
Cause |
Solution |
References |
Applies to:
Oracle Database - Enterprise Edition - Version 10.2.0.1 and laterInformation in this document applies to any platform.
***Checked for relevance on 26-Feb-2014***
This will affect LGWR SYNC transport mode in 10.2.0.x databases and SYNC transport mode in 11.2.0.x databases
Symptoms
Redo Transport Services failed with ORA-16198 from primary database to either the physical standby database or logical standby database using LGWR SYNC mode.
The primary alert log file showed:
ORA-16198: LGWR received timedout error from KSR
LGWR: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16198)
LGWR: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
Fri Feb 6 21:22:26 2009
Errors in file /u01/app/oracle/admin/crthpd01/bdump/crthpd01_lgwr_2793488.trc:
ORA-16198: Timeout incurred on internal channel during remote archival
LGWR: Network asynch I/O wait error 16198 log 2 service '(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=tcp)(HOST=abc)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=xyz_STANDBY_XPT.world)(INSTANCE_NAME=xyz)(SERVER=dedicated)))'
Fri Feb 6 21:22:26 2009
Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED
LGWR: Failed to archive log 2 thread 1 sequence 628 (16198)
Fri Feb 6 21:22:27 2009
If you use Data Guard Broker, then the primary drc log showed:
DG 2009-04-12-12:12:08 0 2 0 RSM detected log transport problem: log transport for database 'xyz_STANDBY' has the following error.
DG 2009-04-12-12:12:08 0 2 0 ORA-16198: Timeout incurred on internal channel during remote archival
DG 2009-04-12-12:12:08 0 2 0 RSM0: HEALTH CHECK ERROR: ORA-16737: the redo transport service for standby database "xyz_STANDBY" has an error
DG 2009-04-12-12:12:08 0 2 678445062 Operation CTL_GET_STATUS cancelled during phase 2, error = ORA-16778
DG 2009-04-12-12:12:08 0 2 678445062 Operation CTL_GET_STATUS cancelled during phase 2, error = ORA-16778
Cause
The NET_TIMEOUT attribute in the LOG_ARCHIVE_DEST_2 on the primary is set too low so that
LNS couldn't finish sending redo block in 10 seconds in this example.
log_archive_dest_2 service="(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PR
OTOCOL=tcp)(HOST=abc)(PORT=1521)))(CONNECT
_DATA=(SERVICE_NAME=xyz_STANDBY_XPT.world)(
INSTANCE_NAME=xyz)(SERVER=dedicated)))",
LGWR SYNC AFFIRM delay=0 OPTIONAL max_failure=0
max_connections=1 reopen=300 db_unique_name="
xyz_STANDBY" register net_timeout=10 valid
_for=(online_logfile,primary_role)
Noticed that you used LGWR SYNC log transport mode and NET_TIMEOUT was set to 10 .
Solution
You'll need to increase the NET_TIMEOUT value in the LOG_ARCHIVE_DEST_2 on the primary to at least 15 to 20 seconds depends on your network speed.
If you don't use Data Guard Broker, then you could change LOG_ARCHIVE_DEST_2 from SQL*Plus using ALTER SYSTEM command. For example,
SQL>ALTER SYSTEM SET LOG_ARCHIVE_DEST_2 SERVICE=xyz_STANDBY
LGWR SYNC DB_UNIQUE_NAME=xyz_STANDBY NET_TIMEOUT=30 VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE)
If you use Data Guard Broker, then you will need to modify NetTimeout property from DGMGRL or Grid Control.
For example, connect to the DGMGRL command-line interface from the primary machine,
DGMGRL> connect sys/
DGMGRL> EDIT DATABASE '' SET PROPERTY NetTimeout = 30;
=======================================================================
Note: If NET_TIMEOUT attribute has already been set to 30, and you still get ORA-16198, that means
LNS couldn't finish sending redo block in 30 seconds.
The slowness may caused by:
1. Operating System. Please keep track of OS usage (like iostat).
2. Network. Please keep track network flow (like tcpdump).
The purpose here is to figure out if the slowness is caused by temporary OS glitch or temporary network glitch.
出现这个报错是由于在默认的NET_TIMEOUT时间(10秒)内主库 LGWR进程没有将数据完整的发送到备库,可以 将NET_TIMEOUT设置为15或者30秒来增加LGWR发送数据到备库的时间,减少出现这个问题的几率。如果NET_TIMEOUT 设置为30秒仍然存在此问题,那么就需要考虑是否是主库到备库的网络存在性能问题或存在 一定的故障,对于WAN外网的Standby数据库最好不要使用LGWR SYNC进行实时同步,使用ARC NSYNC同步更合适。
--end--
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/23135684/viewspace-1713244/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/23135684/viewspace-1713244/