Platform: RK3368
OS: Android 6.0
Kernel: 3.10.0
现象
logcat Watchdog系统重启日志:
10-13 17:17:23.106 571 614 W IPCThreadState: Waiting for thread to be free. mExecutingThreadsCount=15 mMaxThreads=15
10-13 17:17:34.195 571 922 I Process : Sending signal. PID: 571 SIG: 3
10-13 17:17:34.196 571 576 I art : Thread[2,tid=576,WaitingInMainSignalCatcherLoop,Thread*=0x55840f6430,peer=0x12c010a0,"Signal Catcher"]: reacting to signal 3
10-13 17:17:34.875 571 625 E SensorsHal: poll() failed (Interrupted system call)
10-13 17:17:36.779 571 656 D WifiConfigStore: Retrieve network priorities after PNO.
10-13 17:17:36.781 571 576 I art : Wrote stack traces to '/data/anr/traces.txt'
10-13 17:17:40.746 571 922 I Watchdog_N: dumpKernelStacks
10-13 17:17:40.859 571 922 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.Watchdog$BinderThreadMonitor on foreground thread (android.fg)
10-13 17:17:40.859 571 922 W Watchdog: foreground thread stack trace:
10-13 17:17:40.860 571 922 W Watchdog: at android.os.Binder.blockUntilThreadAvailable(Native Method)
10-13 17:17:40.860 571 922 W Watchdog: at com.android.server.Watchdog$BinderThreadMonitor.monitor(Watchdog.java:200)
10-13 17:17:40.860 571 922 W Watchdog: at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:172)
10-13 17:17:40.860 571 922 W Watchdog: at android.os.Handler.handleCallback(Handler.java:739)
10-13 17:17:40.860 571 922 W Watchdog: at android.os.Handler.dispatchMessage(Handler.java:95)
10-13 17:17:40.860 571 922 W Watchdog: at android.os.Looper.loop(Looper.java:148)
10-13 17:17:40.860 571 922 W Watchdog: at android.os.HandlerThread.run(HandlerThread.java:61)
10-13 17:17:40.860 571 922 W Watchdog: at com.android.server.ServiceThread.run(ServiceThread.java:46)
10-13 17:17:40.860 571 922 W Watchdog: *** GOODBYE!
10-13 17:17:40.860 571 922 I Process : Sending signal. PID: 571 SIG: 9
10-13 17:17:40.977 199 199 E installd: eof
10-13 17:17:40.977 199 199 E installd: failed to read size
10-13 17:17:40.977 199 199 I installd: closing connection
10-13 17:17:40.978 188 188 I lowmemorykiller: ActivityManager disconnected
10-13 17:17:40.978 188 188 I lowmemorykiller: Closing Activity Manager data connection
10-13 17:17:40.988 189 189 I ServiceManager: service 'lock_settings' died
10-13 17:17:40.988 189 189 I ServiceManager: service 'network_management' died
10-13 17:17:40.988 189 189 I ServiceManager: service 'clipboard' died
10-13 17:17:40.988 189 189 I ServiceManager: service 'batterystats' died
kernel 日志:
[ 1012.641269] [3: watchdog: 922] SysRq : Show Blocked State
[ 1012.641365] [3: watchdog: 922] task PC stack pid father
[ 1012.641535] [3: watchdog: 922] rknand_bg D ffffffc0000856c0 0 153 2 0x00000000
[ 1012.641570] [3: watchdog: 922] Call trace:
[ 1012.641613] [3: watchdog: 922] [<ffffffc0000856c0>] __switch_to+0x80/0x8c
[ 1012.641650] [3: watchdog: 922] [<ffffffc0008347dc>] __schedule+0x400/0x608
[ 1012.641682] [3: watchdog: 922] [<ffffffc000834a48>] schedule+0x64/0x70
[ 1012.641714] [3: watchdog: 922] [<ffffffc000832e34>] schedule_timeout+0x1cc/0x1fc
[ 1012.641770] [3: watchdog: 922] [<ffffffbffc000ca0>] nand_blktrans_bg_thread+0x28c/0x2fc [rk30xxnand_ko]
[ 1012.641807] [3: watchdog: 922] [<ffffffc0000c8024>] kthread+0xac/0xb8
[ 1012.641835] [3: watchdog: 922] rknand D ffffffc0000856c0 0 154 2 0x00000000
[ 1012.641867] [3: watchdog: 922] Call trace:
[ 1012.641897] [3: watchdog: 922] [<ffffffc0000856c0>] __switch_to+0x80/0x8c
[ 1012.641928] [3: watchdog: 922] [<ffffffc0008347dc>] __schedule+0x400/0x608
[ 1012.641959] [3: watchdog: 922] [<ffffffc000834a48>] schedule+0x64/0x70
[ 1012.641989] [3: watchdog: 922] [<ffffffc000832e34>] schedule_timeout+0x1cc/0x1fc
[ 1012.642041] [3: watchdog: 922] [<ffffffbffc001854>] nand_blktrans_thread+0x1a8/0x61c [rk30xxnand_ko]
[ 1012.642074] [3: watchdog: 922] [<ffffffc0000c8024>] kthread+0xac/0xb8
[ 1012.642632] [3: watchdog: 922] Sched Debug Version: v0.10, 3.10.0 #441
...
[ 1012.836914] [1: kworker/u16:0: 6] binder: release 571:604 transaction 20644 in, still active
[ 1012.837026] [1: kworker/u16:0: 6] binder: send failed reply for transaction 20644 to 711:1430
[ 1012.837147] [1: kworker/u16:0: 6] binder: release 571:605 transaction 38506 in, still activell active
调试分析
看日志只能知道是MonitorChecker超时了,BinderThreadMonitor阻塞导致的。此时只看logcat日志是无法定位到更深层次的原因,就要看traces文件日志了;
触发ANR,dump trace文件的场景如下:
- Service Timeout:比如前台服务在20s内未执行完成
- BroadcastQueue Timeout:比如前台广播在10s内未执行完成
- ContentProvider Timeout:内容提供者,在publish过超时10s
- InputDispatching Timeout: 输入事件分发超时5s,包括按键和触摸事件
其实还有一种情况也会生成trace文件,那就是触发Framework的Watchdog的时候。
ANR traces 文件默认是由SystemProperties dalvik.vm.stack-trace-file指定的:
dalvik.vm.stack-trace-file = /data/anr/traces.txt
根据日志10-13 17:17:36.781 571 576 I art : Wrote stack traces to ‘/data/anr/traces.txt’,我们查找trace文件对应时间段的日志,发现多个binder在Native Waiting阻塞,是APP在调用WifiManager.enableNetwork,达到32个:
"Binder_1" prio=5 tid=8 Waiting
| group="main" sCount=1 dsCount=0 obj=0x12ca00a0 self=0x55841284e0
| sysTid=604 nice=0 cgrp=default sched=0/0 handle=0x7f829f4450
| state=S schedstat=( 0 0 0 ) utm=33 stm=17 core=0 HZ=100
| stack=0x7f828f8000-0x7f828fa000 stackSize=1013KB
| held mutexes=
at java.lang.Object.wait!(Native method)
- waiting on <0x091b4b3e> (a java.lang.Object)
at com.android.internal.util.AsyncChannel$SyncMessenger.sendMessageSynchronously(AsyncChannel.java:820)
- locked <0x091b4b3e> (a java.lang.Object)
at com.android.internal.util.AsyncChannel$SyncMessenger.-wrap0(AsyncChannel.java:-1)
at com.android.internal.util.AsyncChannel.sendMessageSynchronously(AsyncChannel.java:654)
at com.android.internal.util.AsyncChannel.sendMessageSynchronously(AsyncChannel.java:699)
at com.android.server.wifi.WifiStateMachine.syncEnableNetwork(WifiStateMachine.java:2328)
at com.android.server.wifi.WifiServiceImpl.enableNetwork(WifiServiceImpl.java:1034)
at android.net.wifi.IWifiManager$Stub.onTransact(IWifiManager.java:141)
at android.os.Binder.execTransact(Binder.java:453)
经过多次测试分析发现一些:
- 每次都是开机16分钟后出现"Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.Watchdog$BinderThreadMonitor on foreground thread (android.fg)";
- 都是出现"W IPCThreadState: Waiting for thread to be free. mExecutingThreadsCount=15 mMaxThreads=15",即Binder线程达到最大值,已经没有线程去处理新的binder消息,导致binder transaction failed.
- 结合trace文件分析,发现确实有多个Binder处于Waiting状态,都是阻塞在com.android.server.wifi.WifiServiceImpl.enableNetwork;
- 发现有个APP会频繁的调用enableNetwork,而network id是-1,问题可能是由此引发的.
代码分析Wifif enableNetwork流程:
frameworks/base/wifi/java/android/net/wifi/WifiManager.java
* @return {@code true} if the operation succeeded
*/
public boolean enableNetwork(int netId, boolean disableOthers) {
final boolean pin = disableOthers && mTargetSdkVersion < Build.VERSION_CODES.LOLLIPOP;
if (pin) {
registerPinningNetworkCallback();
}
boolean success;
try {
success = mService.enableNetwork(netId, disableOthers);
} catch (RemoteException e) {
success = false;
}
if (pin && !success) {
unregisterPinningNetworkCallback();
}
return success;
}
APP 调用WifiManager.enableNetwork,通过AIDL与服务端进行跨进程通信。
frameworks/base/wifi/java/android/net/wifi/IWifiManager.aidl
interface IWifiManager
{
int getSupportedFeatures();
WifiActivityEnergyInfo reportActivityInfo();
List<WifiConfiguration> getConfiguredNetworks();
List<WifiConfiguration> getPrivilegedConfiguredNetworks();
WifiConfiguration getMatchingWifiConfig(in ScanResult scanResult);
int addOrUpdateNetwork(in WifiConfiguration config);
boolean removeNetwork(int netId);
boolean enableNetwork(int netId, boolean disableOthers);
服务端:
frameworks/opt/net/wifi/service/java/com/android/server/wifi/WifiServiceImpl.java
/**
* See {@link android.net.wifi.WifiManager#enableNetwork(int, boolean)}
* @param netId the integer that identifies the network configuration
* to the supplicant
* @param disableOthers if true, disable all other networks.
* @return {@code true} if the operation succeeded
*/
public boolean enableNetwork(int netId, boolean disableOthers) {
enforceChangePermission();
if (mWifiStateMachineChannel != null) {
return mWifiStateMachine.syncEnableNetwork(mWifiStateMachineChannel, netId,
disableOthers);
} else {
Slog.e(TAG, "mWifiStateMachineChannel is not initialized");
return false;
}
}
frameworks/opt/net/wifi/service/java/com/android/server/wifi/WifiStateMachine.java
/**
* Enable a network
*
* @param netId network id of the network
* @param disableOthers true, if all other networks have to be disabled
* @return {@code true} if the operation succeeds, {@code false} otherwise
*/
public boolean syncEnableNetwork(AsyncChannel channel, int netId, boolean disableOthers) {
Log.d(TAG, "syncEnableNetwork "+channel);
Message resultMsg = channel.sendMessageSynchronously(CMD_ENABLE_NETWORK, netId,
disableOthers ? 1 : 0);
boolean result = (resultMsg.arg1 != FAILURE);
resultMsg.recycle();
Log.d(TAG, "syncEnableNetwork "+channel +" result="+result);
return result;
}
class ConnectModeState extends State {
@Override
public void enter() {
connectScanningService();
}
@Override
public boolean processMessage(Message message) {
WifiConfiguration config;
int netId;
boolean ok;
boolean didDisconnect;
String bssid;
String ssid;
NetworkUpdateResult result;
logStateAndMessage(message, getClass().getSimpleName());
switch (message.what) {
case WifiMonitor.ASSOCIATION_REJECTION_EVENT:
......
case CMD_ENABLE_NETWORK:
boolean disableOthers = message.arg2 == 1;
netId = message.arg1;
config = mWifiConfigStore.getWifiConfiguration(netId);
if (config == null) {
loge("No network with id = " + netId);
messageHandlingStatus = MESSAGE_HANDLING_STATUS_FAIL;
break;
}
// Tell autojoin the user did try to select to that network
// However, do NOT persist the choice by bumping the priority of the network
if (disableOthers) {
mWifiAutoJoinController.
updateConfigurationHistory(netId, true, false);
// Set the last selected configuration so as to allow the system to
// stick the last user choice without persisting the choice
mWifiConfigStore.setLastSelectedConfiguration(netId);
// Remember time of last connection attempt
lastConnectAttemptTimestamp = System.currentTimeMillis();
mWifiConnectionStatistics.numWifiManagerJoinAttempt++;
}
// Cancel auto roam requests
autoRoamSetBSSID(netId, "any");
int uid = message.sendingUid;
ok = mWifiConfigStore.enableNetwork(netId, disableOthers, uid);
if (!ok) {
messageHandlingStatus = MESSAGE_HANDLING_STATUS_FAIL;
}
replyToMessage(message, message.what, ok ? SUCCESS : FAILURE);
break;
case CMD_ENABLE_ALL_NETWORKS:
看到问题了,ConnectModeState处理CMD_ENABLE_NETWORK命令时,如果没有找到netId对应的网络,直接break了,而AsyncChannel的客户端还一直在等应答。如果APP频繁的调用enableNetwork,而且network id是-1,就会导致“W IPCThreadState: Waiting for thread to be free. mExecutingThreadsCount=15 mMaxThreads=15”.
修改后的代码:
--- a/service/java/com/android/server/wifi/WifiStateMachine.java
+++ b/service/java/com/android/server/wifi/WifiStateMachine.java
@@ -7356,6 +7358,7 @@ public class WifiStateMachine extends StateMachine implements WifiNative.WifiPno
if (config == null) {
loge("No network with id = " + netId);
messageHandlingStatus = MESSAGE_HANDLING_STATUS_FAIL;
+ replyToMessage(message, message.what, FAILURE);
break;
}
关于AsyncChannel
AsyncChannel为两个Handler之间的异步通道,Handler可能在同一个进程中,也可能在另一个进程中。AysncChannel可以使用两种协议样式。
-
第一种是简单的请求/应答协议,其中服务器不需要知道是哪个客户机发出请求。在简单的请求/应答协议中,客户端/源端向服务器/目的地发送请求。服务器使用replyToMessage方法。
-
第二种是服务器/目的地也需要知道它连接的是哪个客户端。例如,服务器需要将未经请求的消息发送回客户机。或者服务器为每个客户机保留不同的状态。在这个模型中,服务器还将使用连接方法。
frameworks/base/core/java/com/android/internal/util/AsyncChannel.java
private static Message sendMessageSynchronously(Messenger dstMessenger, Message msg) {
SyncMessenger sm = SyncMessenger.obtain();
try {
if (dstMessenger != null && msg != null) {
msg.replyTo = sm.mMessenger;
synchronized (sm.mHandler.mLockObject) {
dstMessenger.send(msg);
sm.mHandler.mLockObject.wait();
}
} else {
sm.mHandler.mResultMsg = null;
}
} catch (InterruptedException e) {
sm.mHandler.mResultMsg = null;
} catch (RemoteException e) {
sm.mHandler.mResultMsg = null;
}
Message resultMsg = sm.mHandler.mResultMsg;
sm.recycle();
return resultMsg;
}

本文分析了一起由Watchdog超时导致的Android系统重启问题,涉及Kernel 3.10.0和Android 6.0。在RK3368平台上,系统在16分钟后因BinderThreadMonitor阻塞而重启,主要发生在调用WifiManager.enableNetwork(-1)时。调试过程中,发现多个Binder线程等待,导致无法处理新消息。通过代码分析,问题出在CMD_ENABLE_NETWORK命令处理时未正确响应AsyncChannel,提出了解决方案。
2万+

被折叠的 条评论
为什么被折叠?



