WARNING: Waited 15 secs for write IO to PST disk 4 in group 3 in alert_asm.log

本文解决了一个AIX7.1环境下2节点ASM磁盘组访问超时的问题,通过对ASM心跳超时机制的理解及调整_os参数rw_timeout与_asm_hbeatiowait,成功解决了Node2无法正常访问的问题。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

昨天同事遇到一个 2节点,基于AIX 7.1的的ASM ocr访问超时的问题,Node2无法正常访问,检查Node2的alert_asm.log日志如下:

Reference :ASM diskgroup dismount with "Waited 15 secs for write IO to PST" (Doc ID 1581684.1)

Thu Aug 21 17:24:06 2014
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 3 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 4 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 3 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 4 in group 3.
Thu Aug 21 17:24:06 2014
NOTE: process _b000_+asm1 (24903780) initiating offline of disk 0.2095165706 (GRID_0000) with mask 0x7e in group 3
NOTE: process _b000_+asm1 (24903780) initiating offline of disk 1.2095165707 (GRID_0001) with mask 0x7e in group 3
NOTE: process _b000_+asm1 (24903780) initiating offline of disk 2.2095165708 (GRID_0002) with mask 0x7e in group 3
NOTE: process _b000_+asm1 (24903780) initiating offline of disk 3.2095165709 (GRID_0003) with mask 0x7e in group 3
NOTE: process _b000_+asm1 (24903780) initiating offline of disk 4.2095165710 (GRID_0004) with mask 0x7e in group 3
NOTE: checking PST: grp = 3
GMON checking disk modes for group 3 at 10 for pid 35, osid 24903780
ERROR: no read quorum in group: required 3, found 0 disks
NOTE: checking PST for grp 3 done.

....

NOTE: initiating PST update: grp = 3, dsk = 0/0x7ce1b10a, mask = 0x6a, op = clear

 

crs.log

2014-08-21 20:36:04.495: [  OCRRAW][9264]proprior: Retrying buffer read from another mirror for disk group [+GRID] for block at offset [6909952]
2014-08-21 20:36:04.495: [  OCRASM][9264]proprasmres: Total 0 mirrors detected
2014-08-21 20:36:04.495: [  OCRASM][9264]proprasmres: Only 1 mirror found in this disk group.
2014-08-21 20:36:04.495: [  OCRASM][9264]proprasmres: Need to invoke checkdg. Mirror #0 has an invalid buffer.
2014-08-21 20:36:04.595: [  OCRASM][9264]proprasmres: kgfoControl returned error [8]
[  OCRASM][9264]SLOS : SLOS: cat=8, opn=kgfoCkDG01, dep=15032, loc=kgfokge

2014-08-21 20:36:04.595: [  OCRASM][9264]ASM Error Stack : ORA-27091: unable to queue I/O
ORA-15078: ASM diskgroup was forcibly dismounted
ORA-06512: at line 4

2014-08-21 20:36:04.595: [  OCRRAW][9264]proprior: ASM re silver returned [22]
2014-08-21 20:36:04.597: [  OCRRAW][9264]fkce:2: problem [22] reading the tnode 6909952

 

基于2个日志,可以发先是ocr所在的diskgroup 访问超时,MOS 中提及到AIX有个rw_timeout参数时间和asm一个IO超时时间对存储访问的请求时间不一致所致:

Symptoms
Normal or high redundancy diskgroup is dismounted with these WARNING messages.

//ASM alert.log

Mon Jul 01 09:10:47 2013
WARNING: Waited 15 secs for write IO to PST disk 1 in group 6.
WARNING: Waited 15 secs for write IO to PST disk 4 in group 6.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 6.
WARNING: Waited 15 secs for write IO to PST disk 4 in group 6.
....
GMON dismounting group 6 at 72 for pid 44, osid 8782162

Cause
Generally this kind messages comes in ASM alertlog file on below situations,
Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup,
thus the ASM instance dismount the diskgroup.By default, it is 15 seconds.
By the way the heart beat delays are sort of ignored for external redundancy diskgroup.
ASM instance stop issuing more PST heart beat until it succeeds PST revalidation,
but the heart beat delays do not dismount external redundancy diskgroup directly.

The ASM disk could go into unresponsiveness, normally in the following scenarios:

+    Some of the paths of the physical paths of the multipath device are offline or lost
+    During path 'failover' in a multipath set up
+    Server load, or any sort of storage/multipath/OS maintenance
The Doc ID 10109915.8  briefs about Bug 10109915(this fix introduce this underscore parameter). And the issue is with no OS/Storage tunable timeout mechanism in a case of a Hung NFS Server/Filer. And then  _asm_hbeatiowait  helps in setting the time out.

Solution
1]    Check with OS and Storage admin that there is disk unresponsiveness.

2]    Possibly keep the disk responsiveness to below 15 seconds.

This will depend on various factors like
+    Operating System
+    Presence of Multipath ( and Multipath Type )
+    Any kernel parameter

So you need to find out, what is the 'maximum' possible disk unresponsiveness for your set up.
For example, on AIX  rw_timeout  setting affects this and defaults to 30 seconds.
Another example is Linux with native multipathing. In such set up, number of physical paths and  polling_interval value in multipath.conf file, will dictate this maximum disk unresponsiveness.
So for your set up ( combination of OS / multipath / storage ), you need to find out this.

3]If you can not keep the disk unresponsiveness to below 15 seconds, then the below parameter can be set in the ASM instance ( on all the Nodes of RAC ):

 _asm_hbeatiowait
Set it to 200.

Run below in asm instance to set desired value for _asm_hbeatiowait

alter system set "_asm_hbeatiowait"=<value> scope=spfile sid='*';

 

on AIX  rw_timeout  setting affects this and defaults to 30 seconds.

AIX中对存储io的请求超时时间(rw_timeout)默认为30秒

 

Delayed ASM PST(Partner and Status Table) heart beats on ASM disks in normal or high redundancy diskgroup,thus the ASM instance dismount the diskgroup.By default, it is 15 seconds.By the way the heart beat delays are sort of ignored for external redundancy diskgroup.
默认ASM 有个Delayed ASM PST heart beats的时间为15秒,超过15秒会被dismount(crs日志可知), 在dg的冗余配置成external redundanccy下可避免被dismount(normal/high redundancy 会发生这种情况,所以很不幸正好中招。)

 

最终通过将所有节点的_asm_hbeatiowait,重启所有节点服务生效。

alter system set "_asm_hbeatiowait"=(大于30秒) scope=spfile sid='*';

 

 

 

 

 

<think>好的,我现在要处理用户关于ServiceManagerCppClient日志的问题。首先,我需要理解用户提供的日志内容:“ServiceManagerCppClient: Waited one second for activity (is service started? Number of threads started in the threadpool: 16. Are binder threads started and available?)”。用户希望了解这段日志的含义及其背后的逻辑。 首先,我需要确定ServiceManagerCppClient是什么。根据名字,这应该是一个C++编写的客户端,用于与Android的Service Manager进行交互。Android中的Service Manager负责管理系统服务,客户端通过Binder机制与之通信。因此,这个日志可能出现在客户端尝试连接或调用服务时遇到问题的情况下。 接下来,分析日志中的各部分信息。第一句“Waited one second for activity”说明客户端在等待某个活动时超时,超时时间为一秒。这可能意味着客户端在尝试与服务端通信时没有收到预期的响应,导致等待超时。 然后括号内的三个问题需要逐一解析: 1. “is service started?”:询问目标服务是否已启动。如果服务未启动,客户端自然无法与其通信,导致超时。 2. “Number of threads started in the threadpool: 16”:线程池中有16个线程已启动。这可能指的是服务端处理请求的线程池配置,16个线程可能已创建,但可能存在资源不足或线程被阻塞的情况。 3. “Are binder threads started and available?”:Binder线程是否已启动并可用。Binder是Android的IPC机制,依赖专门的线程处理通信。如果Binder线程未正确启动或被占满,会导致请求无法处理。 接下来需要考虑背后的逻辑。当客户端通过Binder调用服务时,服务端需要有可用的Binder线程来处理请求。如果服务未启动,或者线程池中的线程都被占用(例如处理其他请求或死锁),客户端就会在等待响应时超时。日志中的线程数16可能暗示线程池配置,但实际可用性可能存在问题。 可能的根本原因包括服务启动失败、线程池配置不当(如最大线程数不足)、Binder线程未正确初始化,或者系统资源紧张导致线程无法及时调度。此外,客户端可能在重试机制下多次尝试,每次等待一秒后仍未成功,从而记录此日志。 在回答用户时,需要分步骤解释日志各部分的意义,然后讨论可能的原因和解决方案。需要确保用中文清晰表达,结构分明,同时遵循系统级指令中的LaTeX格式要求,特别是行内公式的正确使用。例如,提及线程数时可能需要使用$16$来表示,但根据用户提供的例子,可能不需要特别处理数字,除非涉及数学表达式。此处可能不需要使用LaTeX,但需确保其他部分符合格式要求。 最后,检查是否有遗漏点,比如是否需要建议用户检查服务状态、查看日志中的其他错误、或调整线程池配置等。确保回答真实可靠,帮助用户逐步理解问题并找到解决办法。</think>这个日志属于Android Binder通信框架中的客户端等待超时警告,具体涉及$$ServiceManagerCppClient$$与服务端的交互问题。以下是分层解读: --- ### 一、日志文本解析 日志内容: ``` ServiceManagerCppClient: Waited one second for activity (is service started? Number of threads started in the threadpool: 16. Are binder threads started and available?) ``` 关键信息分解: 1. **`Waited one second for activity`** 客户端等待服务端响应超时(1秒阈值) 2. **`is service started?`** 质疑目标服务是否已注册到ServiceManager 3. **`threadpool: 16`** 服务端线程池已创建$16$个工作线程 4. **`binder threads available?`** 检查Binder通信线程是否存活且空闲 --- ### 二、背后的逻辑链条 #### 1. **Binder通信流程** - 客户端通过$$ServiceManagerCppClient$$调用服务时,会向ServiceManager查询服务句柄 - 服务端需注册到ServiceManager并通过**Binder线程池**响应请求 #### 2. **超时原因分析** | 可能原因 | 触发条件 | 影响 | |---------|----------|------| | **服务未启动** | 服务未注册到ServiceManager | 客户端无法获取服务代理对象 | | **线程池饥饿** | 所有$16$个线程被占满(如死锁、长耗时操作) | 新请求无法被及时处理 | | **Binder线程未启动** | Binder驱动未初始化或线程未创建 | 底层IPC通道不可用 | #### 3. **线程模型验证** - Android Binder默认分配**$16$个线程**(由`Binder:16_4`参数控制) - 若线程池满载,新的客户端请求会进入等待队列(日志中的超时即在此阶段触发) --- ### 三、调试建议 1. **检查服务状态** ```shell adb shell dumpsys <service_name> # 确认目标服务是否存活 ``` 2. **分析线程池** ```shell adb shell ps -T <pid> | grep "Binder" # 查看Binder线程是否活跃 ``` 3. **监控Binder调用** ```shell adb shell su root cat /sys/kernel/debug/binder/stats # 统计Binder事务状态 ``` --- ### 四、典型解决方案 1. **服务启动问题** - 确保服务在`AndroidManifest.xml`中声明且被正确启动 - 检查依赖服务是否先于当前服务初始化 2. **线程池优化** - 避免在Binder线程执行阻塞操作(如I/O或复杂计算) - 调整线程池参数(需修改`/dev/binder`配置,需ROOT权限) 3. **死锁排查** - 使用`StrictMode`检测主线程阻塞 - 通过`ANR traces`分析线程堆栈 --- ### 五、延伸思考 此日志反映了Android中**生产者-消费者模型**的典型问题: - **生产者**(客户端)发送请求速度 > **消费者**(服务端线程池)处理速度 - 在$16$线程的配置下,若平均请求处理时间$t_{process} > \frac{1}{16}$秒,则系统将逐步累积延迟 可通过**Little&#39;s Law**估算系统负载: $$L = \lambda \cdot W$$ 其中$\lambda$为请求速率,$W$为平均处理时间,当$L \geq 16$时即触发资源耗尽。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值