root.sh脚本支持checkpoints文件实现重复运行

本文介绍了解决Oracle 12c集群GRID/GI安装过程中root.sh脚本执行失败的问题。通过分析日志和使用工具如OCRDUMP及KFED,定位到共享磁盘配置错误,并最终通过调整udev配置文件解决了问题。
安装集群GRID/GI一般包括三个过程:首先,运行OUI/RunInstaller输入集群配置信息,其次,拷贝/编译集群文件,最后,以root用户运行root.sh脚本配置集群/启动集群,其中运行root.sh脚本是最关键的阶段。接触过很多SR case都是在这个阶段出现错误导致安装失败。如果问题修复后,需要先deconfigure 已有的配置,然后再运行root.sh。从11.2.0.2版本开始支持重复运行root.sh脚本,也就是说修复问题后,可以直接再运行root.sh,并且从上次失败的地方继续安装(类似”断点续传”)。这个特性在12c中又得到增强。实现这个功能主要是通过将安装阶段信息记录到checkpoint文件和OCR文件来实现:

 

 

 

11.2 checkpoint文件位置

 

$ORACLE_BASE/Clusterware/ckptGridHA_${nodename}.xml

 

 

 

12c checkpoint文件位置

 

$ORACLE_BASE/crsdata/$hostname/crsconfig/ckptGridHA_${nodename}.xml

 

 

 

下面分享一个安装12.1.0.2 集群GRID/GI, 运行root.sh 脚本失败的案例。

 

 

 

案例分享

 

=========

 

在Linux系统上安装12.1.0.2 集群GRID/GI软件,节点2运行root.sh失败,屏幕的错误信息:

 

 

 

OLR initialization - successful

 

2015/12/15 13:16:55 CLSRSC-507: The root script cannot proceed on this node rac2 because either the first-node operations have not completed on node rac1 or there was an error in obtaining the status of the first-node operations.

 

 

 

以上错误说明节点2无法确认节点1安装状态是否完成。Root.sh是如果来确认节点1安装是否完成呢?需要检查日志:

 

 

 

$GRID_HOME>/cfgtoollogs/crsconfig/rootcrs_rac2_2015-12-18_09-41-53PM.log

 

 

 

2015-12-18 21:42:39: Trying to get the value of key: SYSTEM.rootcrs.checkpoints.firstnode in OCR.

 

2015-12-18 21:42:39: setting ORAASM_UPGRADE to 1

 

2015-12-18 21:42:39: Check the existence of key pair with key name: SYSTEM.rootcrs.checkpoints.firstnode in OCR.

 

2015-12-18 21:42:39: setting ORAASM_UPGRADE to 1

 

2015-12-18 21:42:39: Invoking "/u01/gridsoft/12.1.0/bin/cluutil -exec -keyexists -key checkpoints.firstnode"

 

2015-12-18 21:42:39: trace file=/u01/gridbase/crsdata/rac2/crsconfig/cluutil9.log

 

2015-12-18 21:42:39: Running as user grid: /u01/gridsoft/12.1.0/bin/cluutil -exec -keyexists -key checkpoints.firstnode

 

2015-12-18 21:42:39: s_run_as_user2: Running /bin/su grid -c ' echo CLSRSC_START; /u01/gridsoft/12.1.0/bin/cluutil -exec -keyexists -key checkpoints.firstnode '

 

2015-12-18 21:42:39: Removing file /tmp/filexr1WwO

 

2015-12-18 21:42:39: Successfully removed file: /tmp/filexr1WwO

 

2015-12-18 21:42:39: pipe exit code: 256

 

2015-12-18 21:42:39: /bin/su exited with rc=1

 

 

 

2015-12-18 21:42:39: oracle.ops.mgmt.rawdevice.OCRException: PROC-32: Cluster Ready Services on the local node is not running Messaging error [gipcretConnectionRefused] [29]

 

 

 

2015-12-18 21:42:39: Cannot get OCR key with CLUUTIL, try using OCRDUMP.

 

2015-12-18 21:42:39: Check OCR key using ocrdump

 

2015-12-18 21:42:54: ocrdump output: PROT-302: Failed to initialize ocrdump

 

 

 

2015-12-18 21:42:54: The key pair with keyname: SYSTEM.rootcrs.checkpoints.firstnode does not exist in OCR.

 

 

 

以上信息说明节点2首先执行cluutil -exec -keyexists -key checkpoints.firstnode命令来查看OCR中的key: SYSTEM.rootcrs.checkpoints.firstnode,失败后又尝试执行OCRDUMP命令,但是OCRDUMP命令也失败。接下来分析OCRDUMP命令也失败的原因:

 

 

 

$GRID_BASE/diag/crs/<node>/crs/trace/ocrdump_13146.trc

 

 

 

2015-12-18 21:42:48.098879 :  OCRASM: ASM Error Stack : ORA-29701: unable to connect to Cluster Synchronization Service

 

2015-12-18 21:42:48.098885 :  OCRASM: proprasmo: ASM instance is down. Proceed to open the file in dirty mode.

 

  CLWAL: clsw_Initialize: Error [32] from procr_init_ext

 

  CLWAL: clsw_Initialize: Error [PROCL-32: Oracle High Availability Services on the local node is not running Messaging error [gipcretConnectionRefused] [29]] from procr_init_ext

 

2015-12-18 21:42:48.101773 :    GPNP: clsgpnpkww_initclswcx: [at clsgpnpkww.c:351] Result: (56) CLSGPNP_OCR_INIT. (:GPNP01201: )Failed to init CLSW-OLR context. CLSW Error (3): CLSW-3: Error in the cluster registry (OCR) layer. [32] [PROCL-32: Oracle High Availability Services on the local node is not running Messaging error [gipcretConnectionRefused] [29]]

 

2015-12-18 21:42:48.112746 :  OCRASM: proprasmo: Error [13] in opening the GPNP profile. Try to get offline profile

 

2015-12-18 21:42:48.220769 :  OCRRAW: kgfo_kge2slos error stack at kgfolclcpi1: AMDU-00210: No disks found in diskgroup OCR_VOTING

 

 

 

以上信息提示无法连接ORA-29701 CSS和PROCL-32 OHASD这些都是正常的,因为节点2集群没有启动,这些错误可能会干扰我们分析问题。关键的错误信息是AMDU-00210: No disks found in diskgroup OCR_VOTING,也就是说节点2没有找到ASM disk导致OCRDUMP失败,因此无法确认节点1安装的状态是否完成。接下来我们执行kfed确认ASM disk是否有问题:

 

 

 

节点1查看disk /dev/raw/raw1

 

$ /u01/gridsoft/12.1.0/bin/kfed read /dev/raw/raw1

 

kfbh.endian:                          1 ; 0x000: 0x01

 

kfbh.hard:                          130 ; 0x001: 0x82

 

kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD <=========disk raw1类型是KFBTYP_DISKHEAD,是正常的asm disk

 

kfbh.datfmt:                          1 ; 0x003: 0x01

 

kfbh.block.blk:                       0 ; 0x004: blk=0

 

kfbh.block.obj:              2147483648 ; 0x008: disk=0

 

kfbh.check:                   420965027 ; 0x00c: 0x19176aa3

 

kfbh.fcn.base:                        0 ; 0x010: 0x00000000

 

kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000

 

 

 

...

 

 

 

kfdhdb.vfstart:                     128 ; 0x0ec: 0x00000080  <=========vfstart 值说明这个disk是vote file 

 

kfdhdb.vfend:                       160 ; 0x0f0: 0x000000a0  <=========vfend 值说明这个disk是vote file

 

 

 

节点2查看disk /dev/raw/raw1

 

$ /u01/gridsoft/12.1.0/bin/kfed read /dev/raw/raw1

 

kfbh.endian:                          0 ; 0x000: 0x00

 

kfbh.hard:                            0 ; 0x001: 0x00

 

kfbh.type:                            0 ; 0x002: KFBTYP_INVALID<=========节点2上查看raw1类型是无效的KFBTYP_INVALID

 

kfbh.datfmt:                          0 ; 0x003: 0x00

 

kfbh.block.blk:                       0 ; 0x004: blk=0

 

kfbh.block.obj:                       0 ; 0x008: file=0

 

kfbh.check:                           0 ; 0x00c: 0x00000000

 

kfbh.fcn.base:                        0 ; 0x010: 0x00000000

 

kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000

 

kfbh.spare1:                          0 ; 0x018: 0x00000000

 

kfbh.spare2:                          0 ; 0x01c: 0x00000000

 

000000000 00000000 00000000 00000000 00000000  [................]

 

 Repeat 255 times

 

KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0]

 

 

 

在节点1查看/dev/raw/raw1显示disk 类型是KFBTYP_INVALID,并且kfdhdb.vfstart有值,说明raw1在节点1是正常的asm disk,并且是vote disk。但是节点2查看相同的disk,显示完全不同的信息。正常情况下,配置的共享设备raw1在节点1和节点2看到的信息应该是一致的,但是这个case中节点1和节点2看到的是不同的信息,说明共享disk配置是不正确的。

 

 

 

同时,在节点1手动执行OCRDUMP确认key SYSTEM.rootcrs.checkpoints.firstnode是存在的,并且状态是” SUCCESS”

 

su – root

 

ocrdump /tmp/ocrdump1.out

 

more /tmp/ocrdump1.out

 

 

 

[SYSTEM.rootcrs.checkpoints.firstnode]

 

ORATEXT : SUCCESS 

 

 

 

最后,修改UDEV配置文件(/etc/udev/rules.d/99-oracle-asmdevices.rules)后问题解决。

转载于:https://www.cnblogs.com/liang545621/p/9418070.html

root@autodl-container-69a4429c94-10b36e06:~/autodl-tmp/Two-Branch-Dehazing-main# sh /root/autodl-tmp/Two-Branch-Dehazing-main/train.sh root@autodl-container-69a4429c94-10b36e06:~/autodl-tmp/Two-Branch-Dehazing-main# ls -l train.sh -rwxr-xr-x 1 root root 1231 Jun 18 18:49 train.sh root@autodl-container-69a4429c94-10b36e06:~/autodl-tmp/Two-Branch-Dehazing-main# pwd /root/autodl-tmp/Two-Branch-Dehazing-main root@autodl-container-69a4429c94-10b36e06:~/autodl-tmp/Two-Branch-Dehazing-main# ls -la total 48 drwxr-xr-x 7 root root 4096 Jun 18 18:49 . drwxr-xr-x 16 root root 4096 Jun 16 09:16 .. drwxr-xr-x 3 root root 4096 Jun 16 09:17 .idea drwxr-xr-x 2 root root 4096 Jun 18 16:23 .ipynb_checkpoints -rwxr-xr-x 1 root root 5427 Jun 16 09:17 README.md drwxr-xr-x 4 root root 55 Jun 16 09:19 Two-Branch-Dehazing-main -rw-r--r-- 1 root root 103 Jun 18 16:10 dense_train.log drwxr-xr-x 2 root root 158 Jun 16 09:17 figs drwxr-xr-x 4 root root 4096 Jun 18 16:57 src -rwxr-xr-x 1 root root 387 Jun 16 09:17 test.sh -rw-r--r-- 1 root root 1169 Jun 18 16:23 test_sanity.py -rwxr-xr-x 1 root root 1231 Jun 18 18:49 train.sh -rw-r--r-- 1 root root 908 Jun 18 17:00 train_6k.log root@autodl-container-69a4429c94-10b36e06:~/autodl-tmp/Two-Branch-Dehazing-main# sh /root/autodl-tmp/Two-Branch-Dehazing-main/train.sh root@autodl-container-69a4429c94-10b36e06:~/autodl-tmp/Two-Branch-Dehazing-main# echo "CUDA_VISIBLE_DEVICES: $ CUDA_VISIBLE_DEVICES" CUDA_VISIBLE_DEVICES: $ CUDA_VISIBLE_DEVICES root@autodl-container-69a4429c94-10b36e06:~/autodl-tmp/Two-Branch-Dehazing-main# nvidia-smi Wed Jun 18 18:52:41 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA vGPU-32GB On | 00000000:66:00.0 Off | N/A | | 30% 30C P8 22W / 320W | 1MiB / 32760MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+
最新发布
06-19
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值