Android SWT机制
Android SystemServer Watchdog Timeout 安卓看门狗超时机制
Watchdog字面上是“看门狗”的意思,有做过嵌入式低层的朋友应该知道,为了防止嵌入式系统MCU里的程序因为干扰而跑飞,专门在MCU里设计了一个定时器电路,叫做看门狗。当MCU正常工作的,每隔一段时间会输出一个信号给看门狗,也就是所谓的喂狗。如果程序跑飞,MCU在规定的时间内没法喂狗,这时看门狗就会直接触发一个reset信号,让CPU重新启动。
在Android系统的framework中,设计了一个系统服务Watchdog,它类似于一个软件看门狗,用来保护重要的系统服务。它的源代码位于:
frameworks/base/services/core/java/com/android/server/Watchdog.java
本文代码基于Android 10
Watchdog单例
public class Watchdog extends Thread {
static Watchdog sWatchdog;
public static Watchdog getInstance() {
if (sWatchdog == null) {
sWatchdog = new Watchdog();
}
return sWatchdog;
}
private Watchdog() {
}
}
Watchdog其实是一个Thread,并且是一个单列,外界只能通过getInstance获取实例。
初始化和启动
WatchDog是在SystemServer进程中被初始化和启动的。在SystemServer 的run方法中,各种Android服务被注册和启动,其中也包括了WatchDog的初始化和启动。代码如下:
private void startBootstrapServices() {
// Start the watchdog as early as possible so we can crash the system server
// if we deadlock during early boot
traceBeginAndSlog("StartWatchdog");
final Watchdog watchdog = Watchdog.getInstance();
watchdog.start();
traceEnd();
...
// Complete the watchdog setup with an ActivityManager instance and listen for reboots
// Do this only after the ActivityManagerService is properly started as a system process
traceBeginAndSlog("InitWatchdog");
watchdog.init(mSystemContext, mActivityManagerService);
traceEnd();
注意第一条注释,在前几个的Android版本中,是在SystemServer中startOtherServices里初始化,然后通过注册SystemReady接口,在ActivityManagerService的SystemReady接口的回调方法中调用watchdog.start()从而执行Watchdog的run()方法,现在改为在系统启动早期就开始启动。
构造函数
final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();
final HandlerChecker mMonitorChecker;
private Watchdog() {
super("watchdog");
// 定义监视器Checker项,检查各个服务的monitor
mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
"foreground thread", DEFAULT_TIMEOUT);
mHandlerCheckers.add(mMonitorChecker);
// Add checker for main thread. We only do a quick check since there
// can be UI running on the thread.
mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
"main thread", DEFAULT_TIMEOUT));
// UI thread Checker
mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
"ui thread", DEFAULT_TIMEOUT));
// IO thread Checker
mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
"i/o thread", DEFAULT_TIMEOUT));
// Display
mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
"display thread", DEFAULT_TIMEOUT);
......
}
调用watchdog.start()从而执行Watchdog的run()方法,
主要就有一个无限循环,循环中遍历各个HandlerChecker,调用其scheduleCheckLocked方法检查
public void scheduleCheckLocked() {
if (mCompleted) {
//通过mMonitorQueue进行缓冲,防止检查过程中有新服务加入
mMonitors.addAll(mMonitorQueue);
mMonitorQueue.clear();
}
//监视器为空并且线程消息队列空闲
if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())
|| (mPauseCount > 0)) {
mCompleted = true;
return;
}
if (!mCompleted) {
// we already have a check in flight, so no need
return;
}
//不满足,记录超时,通过触发run进一步检查
mCompleted = false;
mCurrentMonitor = null;
mStartTime = SystemClock.uptimeMillis();
mHandler.postAtFrontOfQueue(this);
}
调用mHandler.postAtFrontOfQueue(this)发送一个消息到当前线程的消息队列头部,检测是否堵塞
HandlerChecker为内部类,实现Runnable接口,这里会进入他的run方法中做进一步检测
当是mMonitorChecker时,还会check各个monitor是否正常
public void run() {
final int size = mMonitors.size();
for (int i = 0 ; i < size ; i++) {
synchronized (Watchdog.this) {
mCurrentMonitor = mMonitors.get(i);
}
mCurrentMonitor.monitor(); //回调相应服务检测是否阻塞
}
synchronized (Watchdog.this) {
mCompleted = true; //完成检测,未阻塞
mCurrentMonitor = null;
}
}
以ActivityManagerService.java为例,为向watchdog注册monitor()回调,首先需要继承watchdog.Monitor接口,
而后在构造函数中把自身注册到watchdog monitor服务中。注意这里有两个检测项,一个是addMonitor,在每一个检测周期中watchdog会使用mMonitorChecker回调服务注册的monitor()方法调用synchronized同步方法,以检测服务是否存在死锁或阻塞;另一个是addThread,watchdog会定时通过HandlerChecker向系统服务发送消息,以检测服务主线程是否被阻塞。
Watchdog.getInstance().addMonitor(this);
Watchdog.getInstance().addThread(mHandler);
public void monitor() {
synchronized (this) { }
}
状态获取
private int evaluateCheckerCompletionLocked() {
int state = COMPLETED;
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
state = Math.max(state, hc.getCompletionStateLocked());
}
return state;
}
根据不同情况,得到不同状态
public int getCompletionStateLocked() {
if (mCompleted) {
return COMPLETED;
} else {
long latency = SystemClock.uptimeMillis() - mStartTime;
if (latency < mWaitMax/2) {
return WAITING;
} else if (latency < mWaitMax) {
return WAITED_HALF;
}
}
return OVERDUE;
}
调用evaluateCheckerCompletionLocked方法遍历各个HandlerChecker。然后调用getCompletionStateLocked获取相应完成状态。代码片段:
分别介绍四种状态以及对应的条件:
- COMPLETED: 监控的消息队列没有阻塞且监控的monitor可以正常申请锁。此时mCompleted=true。
- WAITING: 监控的消息队列阻塞时间或者监控的monitor无法申请锁超时在0-30s之间。
- WAITED_HALF:监控的消息队列阻塞时间或者监控的monitor无法申请锁的超时在30-60s之间。
- OVERDUE:监控的消息队列阻塞时间或者监控的monitor无法申请锁的时间超过我们默认的延时60s。
最终通过各个Check项返回的状态处理
final int waitState = evaluateCheckerCompletionLocked();
if (waitState == COMPLETED) {
// The monitors have returned; reset
waitedHalf = false;
continue;
} else if (waitState == WAITING) {
// still waiting but within their configured intervals; back off and recheck
continue;
} else if (waitState == WAITED_HALF) {
if (!waitedHalf) {
Slog.i(TAG, "WAITED_HALF");
// We've waited half the deadlock-detection interval. Pull a stack
// trace and wait another half.
ArrayList<Integer> pids = new ArrayList<Integer>();
pids.add(Process.myPid());
initialStack = ActivityManagerService.dumpStackTraces(pids,
null, null, getInterestingNativePids());
waitedHalf = true;
}
continue;
}
// something is overdue!
blockedCheckers = getBlockedCheckersLocked();
subject = describeCheckersLocked(blockedCheckers);
如果返回的状态是COMPLETED和WAITING,是在可以接受的范围之内,但是如果返回了WAITED_HALF状态,此时会调用ActivityManagerService.dumpStackTraces()打印当前相关的Trace信息,如果返回了OVERDUE状态,说明已经超时,会通过getBlockedCheckersLocked获取当前延时的checker类型,并且通过describeCheckersLocked打印当前阻塞信息。
String describeBlockedStateLocked() {
if (mCurrentMonitor == null) {
return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
} else {
return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
+ " on " + mName + " (" + getThread().getName() + ")";
}
}
接下来就是dump一些log及堆栈信息,最终若需要通过
Process.killProcess(Process.myPid());
System.exit(10);
进行重启
特殊处理
1.有时系统中有些任务比如所PKMS在第一次开机时任务重,往往需要花费更多时间,为了不让系统误报
可以暂时关闭Watchdog对系统线程的检测,使用
// 暂停当前正在运行的线程的监视操作。在执行可能错误触发watchdog的长时间运行的操作之前非常有用。
// 每个调用都需要一个匹配的{@link #resumeWatchingCurrentThread}调用。
public void pauseWatchingCurrentThread(String reason) {
synchronized (this) {
for (HandlerChecker hc : mHandlerCheckers) {
if (Thread.currentThread().equals(hc.getThread())) {
hc.pauseLocked(reason);
}
}
}
}
// 恢复watchdog的检测状态
public void resumeWatchingCurrentThread(String reason) {
synchronized (this) {
for (HandlerChecker hc : mHandlerCheckers) {
if (Thread.currentThread().equals(hc.getThread())) {
hc.resumeLocked(reason);
}
}
}
}
2.调用addThread的重载方法,设置一个合适的超时timeoutMillis,PKMS内部的mHandler就是通过这种方式设置超时的
public void addThread(Handler thread, long timeoutMillis) {
synchronized (this) {
final String name = thread.getLooper().getThread().getName();
mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));
}
}
总结:
1、Watchdog用HandlerChecker来监控消息队列是否发生阻塞,用MonitorChecker来监控系统核心服务是否发生长时间持锁。
2、HandlerChecker通过mHandler.getLooper().getQueue().isPolling()判断是否超时,BinderThreadMonitor主要是通过判断Binder线程是否超过了系统最大值来判断是否超时,其他MonitorChecker通过synchronized(this)判断是否超时。
3、超时之后,系统会打印一系列的信息,包括当前进程以及核心native进程的Stacktrace,kernel线程Stacktrace,打印Kernel里面blocked的线程以及所有CPU的backtraces。
4. 超时之后,Watchdog会杀掉自己,导致zygote重启。