Windows下RAC归档满和ORA-04031问题
(V 1.0)
文章版权所有Jusin Hao(luckyfriends),支持原创,转载请注明。
2012年8月
版本信息
日期 | 版本 | 描述 | 作者/修改人 | 备注 |
2012-8-31 | 1.0 | 创建 | Jusin Hao | |
目 录
3. 问题2:ASM INSTANCE 操作报ORA-04031现象... 12
1. 介绍
1.1. 编写目的
本文档用于记录******安装操作过程。
1.2. 文档说明
本文档包含****的安装操作等内容。
1.3. 定义
1.4. 参考文档
2. 问题1:归档满现象及环境信息:
# localhost name resolution is handled within DNS itself.
# 127.0.0.1 localhost
# ::1 localhost
10.139.5.13 db01
10.139.5.14 db02
10.139.5.15 db01-vip
10.139.5.16 db02-vip
10.139.5.17 scan
10.10.10.1 db01-pri
10.10.10.2 db02-pri
10.139.5.20 dbbackup
10.139.5.21 appbackup
SQL> select inst_id,instance_name from gv$instance;
INST_ID INSTANCE_NAME
---------- ----------------
1 jchr1
2 jchr2
SQL> show parameter db_recovery
NAME TYPE VALUE
------------------------------------ ----------- -----------------
db_recovery_file_dest string
db_recovery_file_dest_size big integer 20G
SQL> archive log list
数据库日志模式 存档模式
自动存档 启用
存档终点 +DATA
最早的联机日志序列 31075
下一个存档日志序列 31076
当前日志序列 31076
SQL>
SQL> select flashback_on from v$database;
FLASHBACK_ON
------------------
NO
脚本delete.bat:
rman target / nocatalog CMDFILE 'D:\d-arch\rmanbackup.txt' LOG 'D:\d-arch\rman_backup_%DATE:~0,4%%DATE:~5,2%%DATE:~8,2%".log'
脚本D:\d-arch\rmanbackup.txt
delete noprompt archivelog ALL COMPLETED BEFORE "sysdate - 6";
查看脚本日志发现脚本是执行成功的;
但是客户经常报归档空间满,由上面参数,我们知道其归档日志是存放在+DATA卷组中的(ASM卷组)
2.1. 问题1:分析:
1、常规的:我们可以手动备份后清除日志:
1)backup format '/install_source/rman_bak/arch_%d_%U' archivelog all delete input; --删除已经备份过的归档
2)delete noprompt archivelog ALL COMPLETED BEFORE "sysdate - 6";
执行上面这两个命令会发现发现
3)delete noprompt archivelog all; --这会清除所有归档日志,包括刚新生成的(慎用)
2、查看ASM卷组状态
C:\>asmcmd
ASMCMD>
ASMCMD> ls
DATA/
OCR/
ASMCMD> cd data
ASMCMD> ls
JCHR/
ASMCMD> cd jchr
ASMCMD> ls
ARCHIVELOG/
CONTROLFILE/
DATAFILE/
ONLINELOG/
PARAMETERFILE/
STANDBYLOG/
TEMPFILE/
spfilejchr.ora
spfilejchr2.ora
ASMCMD> lsdg
State Type Rebal Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Voting_files N
ame
MOUNTED EXTERN N 512 4096 1048576 819197 48329 0 48329 0 N D
ATA/
MOUNTED EXTERN N 512 4096 1048576 10237 9841 0 9841 0 N O
CR/
通过上面的信息,我们可以发现+DATA\jchr\archivelog\下存放的是数据库的归档文件,但是存在很多2012年和2013年前几个月的的归档日志文件,说明以前的归档文件并没有从ASM卷组上物理删除;同时,有这些归档日志长期占用着物理空间,整个ASM卷组的磁盘空间只剩下40多个G(48329M),因此后续的归档日志很容易超出整个空间,导致无法归档。为什么会这样呢?
Oracle控制文件以及Oracle RMAN的的备份恢复的原理,Oracle 控制文件里边记录了数据库的名字,id,创建的时间戳,以及归档记录和备份信息。
Oracle RMAN的备份恢复的所有信息都依赖于:要么是控制文件,要么是恢复目录(catalog),因为所有的备份与恢复信息都会依据备份是的方式存储到这两个位置。
理所当然的是,对这两个东东里的备份集、镜像副本、归档日志等等所有能备份的对象的任意操作,首先会参这些对象的记录的信息,其次是当被记录的对象发生变化时做相应的更新。
--查看已经归档日志的统计:
select name,status,count(*) from gv$archived_log group by name,status order by 1 desc;
从上面的查询可知,当前的两个节点其归档日志只有2个,其余的17630个其NAME都是NULL值。
看看关于视图v$archived_log中NAME列的解释 :
Archived log file name. If set to NULL, either the log file was cleared before it was archived or an RMAN backup command with the "delete input" option was executed to back up archivelog all (RMAN> backup archivelog all delete input;).
上面的这段话表明当前的这些日志文件要么被手动清除,要么被rman的delete input选项清除。
其次status列的D字段也表明了这些个名字为空的归档日志已经被Deleted.也就是说有17630个归档日志已经被删除了。
--如下视图查询,我们可以看到归档日志状态为D且name为null的,两个实例上都是8815个:
select inst_id,name,count(*) from gv$archived_log group by inst_id,name order by 1 desc;
--查看控制文件已归档记录的相关信息:
SQL> select * from gv$controlfile_record_section where type='ARCHIVED LOG';
INST_ID TYPE RECORD_SIZE RECORDS_TOTAL RECORDS_USED FIRST_INDEX LAST_INDEX LAST_RECID
----------- ---------- ----------------------------------------------------------------------------
2 ARCHIVED LOG 584 10952 10710 8822 8579 52415
1 ARCHIVED LOG 584 10952 10710 8822 8579 52415
SQL> select count (*) from v$archived_log;
COUNT(*)
----------
10710
RECORDS_TOTAL:Number of records allocated for the section
列RECORDS_TOTAL表明为当前TYPE分配的可存储的总数,在两个instance上都为10952条
从最近一次切换日志的查询结果可知,被删除的有8815条,如果下次日志切换再增加一条往哪里放呢?
那些已经超出缺省保留期的归档日志被覆盖(在控制文件中的记录),即被重用,用户在控制文件中保存ARCHIVED LOG部分的保留时间由谁来决定呢,参数control_file_record_keep_time,缺省为7天 ,这意味着7天前的归档日志和备份信息可能在控制文件中已经不存在了
SQL> show parameter control
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
control_file_record_keep_time integer 7
control_files string +DATA/jchr/controlfile/current
.261.793900303, +DATA/jchr/con
trolfile/current.260.793900303
control_management_pack_access string DIAGNOSTIC+TUNING
# 下面的查询正好表明为什么2012_10_23和之前的日志为什么没有被删除
# 因为20130505 12:00:04之后的归档日志记录在控制文件中已经被覆盖了,所以使用delete archivelog all时是根本无法清除之前的日志的,即对于rman下的delete archivelog all方式不会删除控制文件中对应的归档日志信息(在控制文件中设置delete状态的归档日志)
SQL> alter session set nls_date_format='yyyymmdd hh24:mi:ss';
会话已更改。
SQL> select min (FIRST_TIME), min (COMPLETION_TIME), max (FIRST_TIME), max (COMP
LETION_TIME) from v$archived_log;
MIN(FIRST_TIME) MIN(COMPLETION_TI MAX(FIRST_TIME) MAX(COMPLETION_TI
----------------- ----------------- ----------------- -----------------
20130505 10:06:16 20130505 12:00:04 20130519 22:18:57 20130519 22:24:57
SQL> select min (FIRST_TIME), min (COMPLETION_TIME), max (FIRST_TIME), max (COMP
LETION_TIME) from gv$archived_log;
MIN(FIRST_TIME) MIN(COMPLETION_TI MAX(FIRST_TIME) MAX(COMPLETION_TI
----------------- ----------------- ----------------- -----------------
20130505 10:06:16 20130505 12:00:04 20130519 22:18:57 20130519 22:24:57
2.2. 问题1:方案:
如下手工依次删除之前废弃的归档日志即可:
ASMCMD> rm -r 2012*
You may delete multiple files and/or directories.
Are you sure? (y/n) y
3. 问题2:ASM INSTANCE 操作报ORA-04031现象
在命令体是否下执行select、crsctl、crs_stat等命令会报错,如下:
1) ORA-04031: unable to allocate 3896 bytes of shared memory ("shared pool","unknown object","sga heap(1,0)","kglsim object batch") (DBD ERROR: error possibly near indicator at char 44 in '/* ASMCMD */ select name, parent_index from v$asm_alias where reference_index=33582893')
2) ORA-04031: unable to allocate 4064 bytes of shared memory ("shared pool","unknown object","sga heap(1,0)","kglsim heap") (DBD ERROR: OCIStmtExecute)
3) SQL> show parameter share
ORA-00604: error occurred at recursive SQL level 2
ORA-04031: unable to allocate 4064 bytes of shared memory ("sharedpool","select x.inst_id,x.indx+1,ks...","sga heap(1,0)","kglsim heap")
3.1. 问题2:分析
--无法查看ASM的参数文件路径:
SQL> show parameter spfile
ORA-00604: error occurred at recursive SQL level 2
ORA-04031: unable to allocate 4064 bytes of shared memory ("shared
pool","select x.inst_id,x.indx+1,ks...","sga heap(1,0)","kglsim heap")
--备份参数文件
SQL> create pfile='d:\initasm1.ora' from spfile;
File created.
--参数文件内容:
*._library_cache_advice=FALSE
+asm1.asm_diskgroups='DATA'#Manual Mount
+asm2.asm_diskgroups='DATA'#Manual Mount
*.asm_power_limit=1
*.diagnostic_dest='D:\app\Administrator'
*.instance_type='asm'
*.large_pool_size=12M
*.remote_login_passwordfile='EXCLUSIVE'
--尝试手动停止个组件:
D:\app\11.2.0\grid\BIN>srvctl.bat status asm
PRCR-1070 : 无法检查 资源 ora.asm 是否已注册
Cannot communicate with crsd
D:\app\11.2.0\grid\BIN>srvctl.bat status asm
PRCR-1070 : 无法检查 资源 ora.asm 是否已注册
Cannot communicate with crsd
D:\app\11.2.0\grid\BIN>crs_stat.exe -t
Errors in file :
ORA-04031: unable to allocate 3896 bytes of shared memory ("shared pool","begin
dbms_diskgroup.read(:h...","sga heap(1,0)","kglsim object batch")
D:\app\11.2.0\grid\BIN>crs_stat.exe -t -n db01
^C
D:\app\11.2.0\grid\BIN>srvctl.bat status asm -n db01
PRCR-1070 : 无法检查 资源 ora.asm 是否已注册
Cannot communicate with crsd
D:\app\11.2.0\grid\BIN>srvctl.bat status nodeapps -n db01
-n 选项已过时。
PRCR-1035 : 无法查找 null 的 CRS 资源 ora.cluster_vip.type
PRCR-1068 : 无法查询资源
Cannot communicate with crsdPRCR-1070 : 无法检查 资源 ora.net1.network 是否已注册
Cannot communicate with crsdPRCR-1070 : 无法检查 资源 ora.gsd 是否已注册
Cannot communicate with crsdPRCR-1070 : 无法检查 资源 ora.ons 是否已注册
Cannot communicate with crsdPRCR-1070 : 无法检查 资源 ora.eons 是否已注册
Cannot communicate with crsd
D:\app\11.2.0\grid\BIN>
D:\app\11.2.0\grid\BIN>srvctl.bat status nodeapps
PRCR-1035 : 无法查找 1 的 CRS 资源 ora.cluster_vip.type
PRCR-1068 : 无法查询资源
Cannot communicate with crsdPRCR-1070 : 无法检查 资源 ora.net1.network 是否已注册
Cannot communicate with crsdPRCR-1070 : 无法检查 资源 ora.gsd 是否已注册
Cannot communicate with crsdPRCR-1070 : 无法检查 资源 ora.ons 是否已注册
Cannot communicate with crsdPRCR-1070 : 无法检查 资源 ora.eons 是否已注册
Cannot communicate with crsd
D:\app\11.2.0\grid\BIN>srvctl.bat status instance
PRKO-2082 : 缺少必需的选项 -d
D:\app\11.2.0\grid\BIN>srvctl.bat status instance -d jchr
PRKO-3132 : 检查数据库 jchr 的实例的状态需要 -i 选项或 -n 选项
D:\app\11.2.0\grid\BIN>srvctl.bat status instance -d jchr -i jchr1
PRCD-1027 : 无法检索数据库 jchr
PRCR-1070 : 无法检查 资源 ora.jchr.db 是否已注册
Cannot communicate with crsd
D:\app\11.2.0\grid\BIN>srvctl.bat stop instance -d jchr -i jchr1
PRCD-1027 : 无法检索数据库 jchr
PRCR-1070 : 无法检查 资源 ora.jchr.db 是否已注册
Cannot communicate with crsd
D:\app\11.2.0\grid\BIN>crsctl check crs
CRS-4638: Oracle High Availability Services 联机
CRS-4535: 无法与集群就绪服务通信
CRS-4529: 集群同步服务联机
CRS-4533: 事件管理器联机
D:\app\11.2.0\grid\BIN>
D:\app\11.2.0\grid\BIN>crs_stat
Errors in file :
ORA-04031: unable to allocate 3896 bytes of shared memory ("shared pool","begin
dbms_diskgroup.read(:h...","sga heap(1,0)","kglsim object batch")
D:\app\11.2.0\grid\BIN>set ORACLE_SID=jchr1
D:\app\11.2.0\grid\BIN>crs_stat
Errors in file :
ORA-04031: unable to allocate 3896 bytes of shared memory ("shared pool","begin
dbms_diskgroup.read(:h...","sga heap(1,0)","kglsim object batch")
D:\app\11.2.0\grid\BIN>
D:\app\11.2.0\grid\BIN>crs_stat -t
Errors in file :
ORA-04031: unable to allocate 3896 bytes of shared memory ("shared pool","begin
dbms_diskgroup.read(:h...","sga heap(1,0)","kglsim object batch")
--查看asm的alert日志
D:\app\Administrator\diag\asm\+asm\+asm1\trace\ alert_+asm1.log
Thu Sep 13 15:17:26 2012
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 Private 10.0.0.0 configured from GPnP Profile for use as a cluster interconnect
Interface type 1 Public 10.139.5.0 configured from GPnP Profile for use as a public interface
Shared memory segment for instance monitoring created
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as D:\app\11.2.0\grid\RDBMS
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options.
Using parameter settings in client-side pfile D:\APP\11.2.0\GRID\DATABASE\INIT+ASM1.ORA on machine DB01
System parameters with non-default values:
large_pool_size = 12M
instance_type = "asm"
remote_login_passwordfile= "EXCLUSIVE"
asm_power_limit = 1
diagnostic_dest = "D:\APP\ADMINISTRATOR"
Cluster communication is configured to use the following interface(s) for this instance
10.10.10.1
cluster interconnect IPC version:Oracle 11 Winsock2 TCP/IP IPC
IPC Vendor 1 proto 1
Version 1.0
Thu Sep 13 15:17:26 2012
PMON started with pid=2, OS id=916
Thu Sep 13 15:17:26 2012
VKTM started with pid=3, OS id=5972 at elevated priority
VKTM running at (10)millisec precision with DBRM quantum (100)ms
--查看对应的trace :+asm1_ora_3912.trc
Memory Utilization of Subpool 1
================================
Allocation Name Size
___________________________ ____________
"free memory " 87682648
"miscellaneous " 0
"KJC dest ctx " 1456
"kfcgx heap " 48376
"kfr group ctx " 736
"ksv work msg " 2272
---同样实例2上面也无法查看:
D:\app\11.2.0\grid\BIN>set ORACLE_SID=+ASM2
D:\app\11.2.0\grid\BIN>sqlplus / as sysdba
SQL*Plus: Release 11.2.0.1.0 Production on 星期一 5月 20 20:11:38 2013
Copyright (c) 1982, 2010, Oracle. All rights reserved.
连接到:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options
SQL> select name, round((bytes/1024/1024),0) MB from v$sgastat where pool='shar
d pool' order by bytes desc;
select name, round((bytes/1024/1024),0) MB from v$sgastat where pool='shared po
l' order by bytes desc
*
第 1 行出现错误:
ORA-00604: error occurred at recursive SQL level 2
ORA-04031: unable to allocate 3896 bytes of shared memory ("shared
pool","select inst_id,'',ksmssnam,k...","sga heap(1,0)","kglsim object batch")
3.2. 问题2:方案
1、手动修改ASM实例的shared pool,为其指定SGA和shared_pool_size大小;
2、方法:介于Oracle11G+ ASM(ocr在asm上)时ASM实例无法单独重启,因此参考RAC_ASM_Practic(ASM实例参数修改).txt 里的方法修改参数;
3.3. 参考:
参考1:ASM Instance generates ORA-04031 ("shared pool","select name_kfgrp, number_kf...","sga heap(1,0)","kglsim heap" [ID 1450745.1]
Cause
This was researched in Bug 12566932
1. The over allocation occurred in "kglsim allocations" as you can see in these errors
ORA-04031: unable to allocate 3896 bytes of shared memory ("shared pool","select name_kfgrp, number_kf...","sga heap(1,0)","kglsim object batch")
ORA-04031: unable to allocate 4064 bytes of shared memory ("shared pool","select name_kfgrp, number_kf...","sga heap(1,0)","kglsim heap")
2. You can also see which compoinent is using the memory by issuing this select statement:
set lines 100
set pages 9999
col mb format 999,999,999
select name, round((bytes/1024/1024),0) MB
from v$sgastat where pool='shared pool'
order by bytes
3. There was memory available at the time of the error which indicates the shared pool is fragmented
Solution
In the Bug the following workaround resolved the errors:
1. Set the following parameter _library_cache_advice=false
alter system set "_library_cache_advice"=false scope=spfile;
2. Bounce the database
如下数据库的shared pool各组件及空闲大小,由于asm实例执行即报错,根本无法返回结果;
SET ORACLE_SID=jchr1
SQL> select name, round((bytes/1024/1024),0) MB from v$sgastat where pool='shared pool' order by bytes
NAME MB
-------------------------- ----------
KQR L PO 14
dbwriter coalesce buffer 16
ges resource 18
KCL name table 18
SQLA 19
dbktb: trace buffer 23
ksunfy : SSO free list 26
event statistics per sess 28
gcs shadows 30
ges big msg buffers 30
FileOpenBlock 30
NAME MB
-------------------------- ----------
ges enqueues 31
gcs resources 41
ASH buffers 50
free memory 2914:
已选择961行。
参考2:How to Start (or stop) 10gR2 or 11gR1 Oracle Clusterware Services Manually in Windows [ID 729512.1]
RAC and Oracle Clusterware Best Practices and Starter Kit (Platform Independent) [ID 810394.1]
1. Try starting crs with "crsctl start crs" command from $CRS_HOME\bin directory
OR
2. Start the following services from Windows 'Services' or by using the Windows command 'net start at the
command prompt:
a. If OPMD is present, then start it: Oracle Process Manager Service. In addition to OPMD, start the
OracleClusterVolumeService if you are using Oracle Cluster File System (OCFS) in your environment.
b. If OPMD is not present then start the following services in the order provided:
OracleObjectService
OracleClusterVolumeService (if using OCFS)
OracleCSService
OracleEVMService
OracleCRService
http://www.oracleonlinux.cn/2012/12/how-to-manaual-start-oracle-rac-on-windows/
http://cywxzyh.blog.163.com/blog/static/163857248201144112236781/
参考3:ASM & Shared Pool (ORA-4031) [ID 437924.1]
参考4:其他参考:
http://zhang41082.itpub.net/post/7167/463093
http://www.jb51.net/article/32344.htm
查询了metalink文档,发现果然是Oracle的bug,metalink文档Bug No. 4431215描述了这个问题。如果在ASM上建立的目录名称和SID前缀相同,则这个目录下的所有目录一旦为空,就会被ASM实例自动删除,不管这个目录是ASM自动建立,还是用户手工建立的。
解决这个问题的办法倒是很多,比如将归档目录设置到日志文件的目录中,或者改变主目录的名称使其和ORACLE_SID有所区别。还有更简单的办法,在归档目录下拷贝一个小文件,使得归档被删除后,归档目录不为空,从而避免路径被清除。
Oracle在10.2.0.4和11.1.0.6中fixed了这个bug。因此升级或打PATCH也是解决这个问题的一个方法。
http://blog.chinaunix.net/uid-22948773-id-2600822.html
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/14710393/viewspace-1100971/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/14710393/viewspace-1100971/