Windows下RAC归档满和ORA-04031问题-优快云博客

Windows下RAC归档满和ORA-04031问题

(V 1.0)

2012年8月

版本信息

日期	版本	描述	作者/修改人	备注
2012-8-31	1.0	创建	Jusin Hao

目录

2. 问题1：归档满现象及环境信息：... 4

2.1. 问题1：分析：... 5

2.2. 问题1：方案：... 11

3. 问题2：ASM INSTANCE 操作报ORA-04031现象... 12

3.1. 问题2：分析... 12

3.2. 问题2：方案... 17

3.3. 参考：... 17

2. 问题1：归档满现象及环境信息：

# localhost name resolution is handled within DNS itself.

# 127.0.0.1 localhost

# ::1 localhost

10.139.5.13 db01

10.139.5.14 db02

10.139.5.15 db01-vip

10.139.5.16 db02-vip

10.139.5.17 scan

10.10.10.1 db01-pri

10.10.10.2 db02-pri

10.139.5.20 dbbackup

10.139.5.21 appbackup

SQL> select inst_id,instance_name from gv$instance;

INST_ID INSTANCE_NAME

---------- ----------------

1 jchr1

2 jchr2

SQL> show parameter db_recovery

NAME TYPE VALUE

------------------------------------ ----------- -----------------

db_recovery_file_dest string

db_recovery_file_dest_size big integer 20G

SQL> archive log list

数据库日志模式存档模式

自动存档启用

存档终点 +DATA

最早的联机日志序列 31075

下一个存档日志序列 31076

当前日志序列 31076

SQL>

SQL> select flashback_on from v$database;

FLASHBACK_ON

------------------

脚本delete.bat：

rman target / nocatalog CMDFILE 'D:\d-arch\rmanbackup.txt' LOG 'D:\d-arch\rman_backup_%DATE:~0,4%%DATE:~5,2%%DATE:~8,2%".log'

脚本D:\d-arch\rmanbackup.txt

delete noprompt archivelog ALL COMPLETED BEFORE "sysdate - 6";

查看脚本日志发现脚本是执行成功的;

但是客户经常报归档空间满，由上面参数，我们知道其归档日志是存放在+DATA卷组中的（ASM卷组）

2.1. 问题1：分析：

1、常规的：我们可以手动备份后清除日志：

1）backup format '/install_source/rman_bak/arch_%d_%U' archivelog all delete input; --删除已经备份过的归档

2）delete noprompt archivelog ALL COMPLETED BEFORE "sysdate - 6";

执行上面这两个命令会发现发现

3）delete noprompt archivelog all; --这会清除所有归档日志，包括刚新生成的（慎用）

2、查看ASM卷组状态

C:\>asmcmd

ASMCMD>

ASMCMD> ls

DATA/

OCR/

ASMCMD> cd data

ASMCMD> ls

JCHR/

ASMCMD> cd jchr

ASMCMD> ls

ARCHIVELOG/

CONTROLFILE/

DATAFILE/

ONLINELOG/

PARAMETERFILE/

STANDBYLOG/

TEMPFILE/

spfilejchr.ora

spfilejchr2.ora

ASMCMD> lsdg

State Type Rebal Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Voting_files N

ame

MOUNTED EXTERN N 512 4096 1048576 819197 48329 0 48329 0 N D

ATA/

MOUNTED EXTERN N 512 4096 1048576 10237 9841 0 9841 0 N O

CR/

通过上面的信息，我们可以发现+DATA\jchr\archivelog\下存放的是数据库的归档文件，但是存在很多2012年和2013年前几个月的的归档日志文件，说明以前的归档文件并没有从ASM卷组上物理删除；同时，有这些归档日志长期占用着物理空间，整个ASM卷组的磁盘空间只剩下40多个G（48329M），因此后续的归档日志很容易超出整个空间，导致无法归档。为什么会这样呢？
Oracle控制文件以及Oracle RMAN的的备份恢复的原理，Oracle 控制文件里边记录了数据库的名字，id,创建的时间戳，以及归档记录和备份信息。
Oracle RMAN的备份恢复的所有信息都依赖于：要么是控制文件，要么是恢复目录(catalog)，因为所有的备份与恢复信息都会依据备份是的方式存储到这两个位置。
理所当然的是，对这两个东东里的备份集、镜像副本、归档日志等等所有能备份的对象的任意操作，首先会参这些对象的记录的信息，其次是当被记录的对象发生变化时做相应的更新。

--查看已经归档日志的统计：

select name,status,count(*) from gv$archived_log group by name,status order by 1 desc;

从上面的查询可知，当前的两个节点其归档日志只有2个，其余的17630个其NAME都是NULL值。
看看关于视图v$archived_log中NAME列的解释：
Archived log file name. If set to NULL, either the log file was cleared before it was archived or an RMAN backup command with the "delete input" option was executed to back up archivelog all (RMAN> backup archivelog all delete input;).
上面的这段话表明当前的这些日志文件要么被手动清除，要么被rman的delete input选项清除。
其次status列的D字段也表明了这些个名字为空的归档日志已经被Deleted.也就是说有17630个归档日志已经被删除了。

--如下视图查询，我们可以看到归档日志状态为D且name为null的，两个实例上都是8815个：

select inst_id,name,count(*) from gv$archived_log group by inst_id,name order by 1 desc;

--查看控制文件已归档记录的相关信息：

SQL> select * from gv$controlfile_record_section where type='ARCHIVED LOG';

INST_ID TYPE RECORD_SIZE RECORDS_TOTAL RECORDS_USED FIRST_INDEX LAST_INDEX LAST_RECID

----------- ---------- ----------------------------------------------------------------------------

2 ARCHIVED LOG 584 10952 10710 8822 8579 52415

1 ARCHIVED LOG 584 10952 10710 8822 8579 52415

SQL> select count (*) from v$archived_log;

COUNT(*)

----------

10710

RECORDS_TOTAL：Number of records allocated for the section

列RECORDS_TOTAL表明为当前TYPE分配的可存储的总数，在两个instance上都为10952条
从最近一次切换日志的查询结果可知，被删除的有8815条，如果下次日志切换再增加一条往哪里放呢？

那些已经超出缺省保留期的归档日志被覆盖（在控制文件中的记录），即被重用，用户在控制文件中保存ARCHIVED LOG部分的保留时间由谁来决定呢，参数control_file_record_keep_time，缺省为7天 ，这意味着7天前的归档日志和备份信息可能在控制文件中已经不存在了

SQL> show parameter control

NAME TYPE VALUE

------------------------------------ ----------- ------------------------------

control_file_record_keep_time integer 7

control_files string +DATA/jchr/controlfile/current

.261.793900303, +DATA/jchr/con

trolfile/current.260.793900303

control_management_pack_access string DIAGNOSTIC+TUNING

# 下面的查询正好表明为什么2012_10_23和之前的日志为什么没有被删除
# 因为20130505 12:00:04之后的归档日志记录在控制文件中已经被覆盖了，所以使用delete archivelog all时是根本无法清除之前的日志的，即对于rman下的delete archivelog all方式不会删除控制文件中对应的归档日志信息（在控制文件中设置delete状态的归档日志）

SQL> alter session set nls_date_format='yyyymmdd hh24:mi:ss';

会话已更改。

SQL> select min (FIRST_TIME), min (COMPLETION_TIME), max (FIRST_TIME), max (COMP

LETION_TIME) from v$archived_log;

MIN(FIRST_TIME) MIN(COMPLETION_TI MAX(FIRST_TIME) MAX(COMPLETION_TI

----------------- ----------------- ----------------- -----------------

20130505 10:06:16 20130505 12:00:04 20130519 22:18:57 20130519 22:24:57

SQL> select min (FIRST_TIME), min (COMPLETION_TIME), max (FIRST_TIME), max (COMP

LETION_TIME) from gv$archived_log;

MIN(FIRST_TIME) MIN(COMPLETION_TI MAX(FIRST_TIME) MAX(COMPLETION_TI

----------------- ----------------- ----------------- -----------------

20130505 10:06:16 20130505 12:00:04 20130519 22:18:57 20130519 22:24:57

2.2. 问题1：方案：

如下手工依次删除之前废弃的归档日志即可：

ASMCMD> rm -r 2012*

You may delete multiple files and/or directories.

Are you sure? (y/n) y

3. 问题2：ASM INSTANCE 操作报ORA-04031现象

在命令体是否下执行select、crsctl、crs_stat等命令会报错，如下:

1） ORA-04031: unable to allocate 3896 bytes of shared memory ("shared pool","unknown object","sga heap(1,0)","kglsim object batch") (DBD ERROR: error possibly near indicator at char 44 in '/* ASMCMD */ select name, parent_index from v$asm_alias where reference_index=33582893')

2） ORA-04031: unable to allocate 4064 bytes of shared memory ("shared pool","unknown object","sga heap(1,0)","kglsim heap") (DBD ERROR: OCIStmtExecute)

3） SQL> show parameter share

ORA-00604: error occurred at recursive SQL level 2

ORA-04031: unable to allocate 4064 bytes of shared memory ("sharedpool","select x.inst_id,x.indx+1,ks...","sga heap(1,0)","kglsim heap")

3.1. 问题2：分析

--无法查看ASM的参数文件路径：

SQL> show parameter spfile

ORA-00604: error occurred at recursive SQL level 2

ORA-04031: unable to allocate 4064 bytes of shared memory ("shared

pool","select x.inst_id,x.indx+1,ks...","sga heap(1,0)","kglsim heap")

--备份参数文件

SQL> create pfile='d:\initasm1.ora' from spfile;

File created.

--参数文件内容：

*._library_cache_advice=FALSE

+asm1.asm_diskgroups='DATA'#Manual Mount

+asm2.asm_diskgroups='DATA'#Manual Mount

*.asm_power_limit=1

*.diagnostic_dest='D:\app\Administrator'

*.instance_type='asm'

*.large_pool_size=12M

*.remote_login_passwordfile='EXCLUSIVE'

--尝试手动停止个组件：

D:\app\11.2.0\grid\BIN>srvctl.bat status asm

PRCR-1070 : 无法检查资源 ora.asm 是否已注册

Cannot communicate with crsd

D:\app\11.2.0\grid\BIN>srvctl.bat status asm

PRCR-1070 : 无法检查资源 ora.asm 是否已注册

Cannot communicate with crsd

D:\app\11.2.0\grid\BIN>crs_stat.exe -t

Errors in file :

ORA-04031: unable to allocate 3896 bytes of shared memory ("shared pool","begin

dbms_diskgroup.read(:h...","sga heap(1,0)","kglsim object batch")

D:\app\11.2.0\grid\BIN>crs_stat.exe -t -n db01

D:\app\11.2.0\grid\BIN>srvctl.bat status asm -n db01

PRCR-1070 : 无法检查资源 ora.asm 是否已注册

Cannot communicate with crsd

D:\app\11.2.0\grid\BIN>srvctl.bat status nodeapps -n db01

-n 选项已过时。

PRCR-1035 : 无法查找 null 的 CRS 资源 ora.cluster_vip.type

PRCR-1068 : 无法查询资源

Cannot communicate with crsdPRCR-1070 : 无法检查资源 ora.net1.network 是否已注册

Cannot communicate with crsdPRCR-1070 : 无法检查资源 ora.gsd 是否已注册

Cannot communicate with crsdPRCR-1070 : 无法检查资源 ora.ons 是否已注册

Cannot communicate with crsdPRCR-1070 : 无法检查资源 ora.eons 是否已注册

Cannot communicate with crsd

D:\app\11.2.0\grid\BIN>

D:\app\11.2.0\grid\BIN>srvctl.bat status nodeapps

PRCR-1035 : 无法查找 1 的 CRS 资源 ora.cluster_vip.type

PRCR-1068 : 无法查询资源

Cannot communicate with crsdPRCR-1070 : 无法检查资源 ora.net1.network 是否已注册

Cannot communicate with crsdPRCR-1070 : 无法检查资源 ora.gsd 是否已注册

Cannot communicate with crsdPRCR-1070 : 无法检查资源 ora.ons 是否已注册

Cannot communicate with crsdPRCR-1070 : 无法检查资源 ora.eons 是否已注册

Cannot communicate with crsd

D:\app\11.2.0\grid\BIN>srvctl.bat status instance

PRKO-2082 : 缺少必需的选项 -d

D:\app\11.2.0\grid\BIN>srvctl.bat status instance -d jchr

PRKO-3132 : 检查数据库 jchr 的实例的状态需要 -i 选项或 -n 选项

D:\app\11.2.0\grid\BIN>srvctl.bat status instance -d jchr -i jchr1

PRCD-1027 : 无法检索数据库 jchr

PRCR-1070 : 无法检查资源 ora.jchr.db 是否已注册

Cannot communicate with crsd

D:\app\11.2.0\grid\BIN>srvctl.bat stop instance -d jchr -i jchr1

PRCD-1027 : 无法检索数据库 jchr

PRCR-1070 : 无法检查资源 ora.jchr.db 是否已注册

Cannot communicate with crsd

D:\app\11.2.0\grid\BIN>crsctl check crs

CRS-4638: Oracle High Availability Services 联机

CRS-4535: 无法与集群就绪服务通信

CRS-4529: 集群同步服务联机

CRS-4533: 事件管理器联机

D:\app\11.2.0\grid\BIN>

D:\app\11.2.0\grid\BIN>crs_stat

Errors in file :

ORA-04031: unable to allocate 3896 bytes of shared memory ("shared pool","begin

dbms_diskgroup.read(:h...","sga heap(1,0)","kglsim object batch")

D:\app\11.2.0\grid\BIN>set ORACLE_SID=jchr1

D:\app\11.2.0\grid\BIN>crs_stat

Errors in file :

ORA-04031: unable to allocate 3896 bytes of shared memory ("shared pool","begin

dbms_diskgroup.read(:h...","sga heap(1,0)","kglsim object batch")

D:\app\11.2.0\grid\BIN>

D:\app\11.2.0\grid\BIN>crs_stat -t

Errors in file :

ORA-04031: unable to allocate 3896 bytes of shared memory ("shared pool","begin

dbms_diskgroup.read(:h...","sga heap(1,0)","kglsim object batch")

--查看asm的alert日志

D:\app\Administrator\diag\asm\+asm\+asm1\trace\ alert_+asm1.log

Thu Sep 13 15:17:26 2012

Starting ORACLE instance (normal)

LICENSE_MAX_SESSION = 0

LICENSE_SESSIONS_WARNING = 0

Interface type 1 Private 10.0.0.0 configured from GPnP Profile for use as a cluster interconnect

Interface type 1 Public 10.139.5.0 configured from GPnP Profile for use as a public interface

Shared memory segment for instance monitoring created

Picked latch-free SCN scheme 3

Using LOG_ARCHIVE_DEST_1 parameter default value as D:\app\11.2.0\grid\RDBMS

Autotune of undo retention is turned on.

LICENSE_MAX_USERS = 0

SYS auditing is disabled

Starting up:

Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production

With the Real Application Clusters and Automatic Storage Management options.

Using parameter settings in client-side pfile D:\APP\11.2.0\GRID\DATABASE\INIT+ASM1.ORA on machine DB01

System parameters with non-default values:

large_pool_size = 12M

instance_type = "asm"

remote_login_passwordfile= "EXCLUSIVE"

asm_power_limit = 1

diagnostic_dest = "D:\APP\ADMINISTRATOR"

Cluster communication is configured to use the following interface(s) for this instance

10.10.10.1

cluster interconnect IPC version:Oracle 11 Winsock2 TCP/IP IPC

IPC Vendor 1 proto 1

Version 1.0

Thu Sep 13 15:17:26 2012

PMON started with pid=2, OS id=916

Thu Sep 13 15:17:26 2012

VKTM started with pid=3, OS id=5972 at elevated priority

VKTM running at (10)millisec precision with DBRM quantum (100)ms

--查看对应的trace ：+asm1_ora_3912.trc

Memory Utilization of Subpool 1

================================

Allocation Name Size

___________________________ ____________

"free memory " 87682648

"miscellaneous " 0

"KJC dest ctx " 1456

"kfcgx heap " 48376

"kfr group ctx " 736

"ksv work msg " 2272

---同样实例2上面也无法查看：

D:\app\11.2.0\grid\BIN>set ORACLE_SID=+ASM2

D:\app\11.2.0\grid\BIN>sqlplus / as sysdba

SQL*Plus: Release 11.2.0.1.0 Production on 星期一 5月 20 20:11:38 2013

连接到:

Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production

With the Real Application Clusters and Automatic Storage Management options

SQL> select name, round((bytes/1024/1024),0) MB from v$sgastat where pool='shar

d pool' order by bytes desc;

select name, round((bytes/1024/1024),0) MB from v$sgastat where pool='shared po

l' order by bytes desc

第 1 行出现错误:

ORA-00604: error occurred at recursive SQL level 2

ORA-04031: unable to allocate 3896 bytes of shared memory ("shared

pool","select inst_id,'',ksmssnam,k...","sga heap(1,0)","kglsim object batch")

3.2. 问题2：方案

1、手动修改ASM实例的shared pool，为其指定SGA和shared_pool_size大小；

2、方法：介于Oracle11G+ ASM（ocr在asm上）时ASM实例无法单独重启，因此参考RAC_ASM_Practic(ASM实例参数修改).txt 里的方法修改参数；

3.3. 参考：

参考1：ASM Instance generates ORA-04031 ("shared pool","select name_kfgrp, number_kf...","sga heap(1,0)","kglsim heap" [ID 1450745.1]

Cause

This was researched in Bug 12566932

1. The over allocation occurred in "kglsim allocations" as you can see in these errors

ORA-04031: unable to allocate 3896 bytes of shared memory ("shared pool","select name_kfgrp, number_kf...","sga heap(1,0)","kglsim object batch")

ORA-04031: unable to allocate 4064 bytes of shared memory ("shared pool","select name_kfgrp, number_kf...","sga heap(1,0)","kglsim heap")

2. You can also see which compoinent is using the memory by issuing this select statement:

set lines 100
set pages 9999
col mb format 999,999,999
select name, round((bytes/1024/1024),0) MB
from v$sgastat where pool='shared pool'
order by bytes

3. There was memory available at the time of the error which indicates the shared pool is fragmented

Solution

In the Bug the following workaround resolved the errors:

1. Set the following parameter _library_cache_advice=false

alter system set "_library_cache_advice"=false scope=spfile;

2. Bounce the database

如下数据库的shared pool各组件及空闲大小，由于asm实例执行即报错，根本无法返回结果；

SET ORACLE_SID=jchr1

SQL> select name, round((bytes/1024/1024),0) MB from v$sgastat where pool='shared pool' order by bytes

NAME MB

-------------------------- ----------

KQR L PO 14

dbwriter coalesce buffer 16

ges resource 18

KCL name table 18

SQLA 19

dbktb: trace buffer 23

ksunfy : SSO free list 26

event statistics per sess 28

gcs shadows 30

ges big msg buffers 30

FileOpenBlock 30

NAME MB

-------------------------- ----------

ges enqueues 31

gcs resources 41

ASH buffers 50

free memory 2914:

已选择961行。

参考2：How to Start (or stop) 10gR2 or 11gR1 Oracle Clusterware Services Manually in Windows [ID 729512.1]
RAC and Oracle Clusterware Best Practices and Starter Kit (Platform Independent) [ID 810394.1]

1. Try starting crs with "crsctl start crs" command from $CRS_HOME\bin directory

2. Start the following services from Windows 'Services' or by using the Windows command 'net start at the
        command prompt:
            a. If OPMD is present, then start it: Oracle Process Manager Service. In addition to OPMD, start the
                OracleClusterVolumeService if you are using Oracle Cluster File System (OCFS) in your environment.
             b. If OPMD is not present then start the following services in the order provided:
                    OracleObjectService
                    OracleClusterVolumeService (if using OCFS)
                    OracleCSService
                    OracleEVMService
                    OracleCRService

http://www.oracleonlinux.cn/2012/12/how-to-manaual-start-oracle-rac-on-windows/

http://cywxzyh.blog.163.com/blog/static/163857248201144112236781/

参考3：ASM & Shared Pool (ORA-4031) [ID 437924.1]

参考4：其他参考：

http://zhang41082.itpub.net/post/7167/463093

http://www.jb51.net/article/32344.htm

查询了metalink文档，发现果然是Oracle的bug，metalink文档Bug No. 4431215描述了这个问题。如果在ASM上建立的目录名称和SID前缀相同，则这个目录下的所有目录一旦为空，就会被ASM实例自动删除，不管这个目录是ASM自动建立，还是用户手工建立的。

解决这个问题的办法倒是很多，比如将归档目录设置到日志文件的目录中，或者改变主目录的名称使其和ORACLE_SID有所区别。还有更简单的办法，在归档目录下拷贝一个小文件，使得归档被删除后，归档目录不为空，从而避免路径被清除。

Oracle在10.2.0.4和11.1.0.6中fixed了这个bug。因此升级或打PATCH也是解决这个问题的一个方法。

http://blog.chinaunix.net/uid-22948773-id-2600822.html

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/14710393/viewspace-1100971/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/14710393/viewspace-1100971/