一、概述
WatchDog,看门狗,是 Linux 系统一个很重要的机制,其目的是监测系统运行情况,当异常(死锁、死循环等)发生的时候,能及时采取策略(重启等),使系统恢复正常。
Android 系统中,有硬件 WatchDog 用于定时检测关键硬件是否正常工作,在 framework 层有一个软件 WatchDog 用于定期检测关键系统服务是否正常运行。
当应用超过一定时间无响应的时候,Android 系统为了不让应用长时间处于不可操作的状态,会弹出一个 ANR(应用无响应)的对话框,用户可以选择强制关闭,从而关掉这个应用进程。
ANR 机制是针对应用的,对于系统来说,如果长时间“无响应”,那么 WatchDog 就会触发“自杀”机制。由于这种机制的存在,就经常会出现一些 system_server 进程被 WatchDog 杀掉而发生手机重启的问题。
二、WatchDog 初始化
Android 系统的 WatchDog 是一个单例线程,在 system_server 启动时就会 init&start WatchDog。
2.1 startBootstrapServices
frameworks/base/services/java/com/android/server/SystemServer.java
/**
* Starts the small tangle of critical services that are needed to get the system off the
* ground. These services have complex mutual dependencies which is why we initialize them all
* in one place here. Unless your service is also entwined in these dependencies, it should be
* initialized in one of the other functions.
*/
private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {
t.traceBegin("startBootstrapServices");
// Start the watchdog as early as possible so we can crash the system server
// if we deadlock during early boot
t.traceBegin("StartWatchdog");
//【2.2】创建 WatchDog 线程,并 start
final Watchdog watchdog = Watchdog.getInstance();
watchdog.start();
t.traceEnd();
...
// Complete the watchdog setup with an ActivityManager instance and listen for reboots
// Do this only after the ActivityManagerService is properly started as a system process
t.traceBegin("InitWatchdog");
// 【2.3】注册 REBOOT 广播
watchdog.init(mSystemContext, mActivityManagerService);
t.traceEnd();
...
}
随着 Android 版本迭代,WatchDog 的启动时序已经越来越靠前,以便及早发现“死锁”等异常。
2.2 getInstance
WatchDog 采用单例模式设计,继承于 Thread,创建的线程名为“watchdog”。
frameworks/base/services/core/java/com/android/server/Watchdog.java
/** This class calls its monitor every minute. Killing this process if they don't return **/
public class Watchdog extends Thread {
static final String TAG = "Watchdog";
...
// Note 1: Do not lower this value below thirty seconds without tightening the invoke-with
// timeout in com.android.internal.os.ZygoteConnection, or wrapped applications
// can trigger the watchdog.
// Note 2: The debug value is already below the wait time in ZygoteConnection. Wrapped
// applications may not work with a debug build. CTS will fail.
// 默认超时时间为 60s,且最低不能设置低于 30s
private static final long DEFAULT_TIMEOUT = DB ? 10 * 1000 : 60 * 1000;
private static final long CHECK_INTERVAL = DEFAULT_TIMEOUT / 2;
...
private static Watchdog sWatchdog;
...
/* This handler will be used to post message back onto the main thread */
// 【2.2.1】所有的 HandlerChecker 对象 List
private final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();
private final HandlerChecker mMonitorChecker;
...
public static Watchdog getInstance() {
if (sWatchdog == null) {
sWatchdog = new Watchdog();
}
return sWatchdog;
}
private Watchdog() {
super("watchdog");
// Initialize handler checkers for each common thread we want to check. Note
// that we are not currently checking the background thread, since it can
// potentially hold longer running operations with no guarantees about the timeliness
// of operations there.
// The shared foreground thread is the main checker. It is where we
// will also dispatch monitor checks and do other work.
// 监测 android.fg
mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
"foreground thread", DEFAULT_TIMEOUT);
mHandlerCheckers.add(mMonitorChecker);
// Add checker for main thread. We only do a quick check since there
// can be UI running on the thread.
// 监测 main
mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
"main thread", DEFAULT_TIMEOUT));
// Add checker for shared UI thread.
// 监测 android.ui
mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
"ui thread", DEFAULT_TIMEOUT));
// And also check IO thread.
// 监测 android.io
mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
"i/o thread", DEFAULT_TIMEOUT));
// And the display thread.
// 监测 android.display
mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
"display thread", DEFAULT_TIMEOUT));
// And the animation thread.
// 监测 android.anim
mHandlerCheckers.add(new HandlerChecker(AnimationThread.getHandler(),
"animation thread", DEFAULT_TIMEOUT));
// And the surface animation thread.
// 监测 android.anim.lf
mHandlerCheckers.add(new HandlerChecker(SurfaceAnimationThread.getHandler(),
"surface animation thread", DEFAULT_TIMEOUT));
// Initialize monitor for Binder threads.
// 【2.2.2】监测 Binder 线程
addMonitor(new BinderThreadMonitor());
...
}
WatchDog 在初始化时,会构建很多 HandlerChecker,大致分为两类:
- Monitor Checker,用于检查 Monitor 对象是否发生持锁时间过长,AMS、PKMS、WMS 等核心系统服务都是 Monitor 对象;
- Looper Checker,用于检查线程的消息队列是否长时间处于工作状态。Watchdog 自身的消息队列,Ui、Io、Display 等这些全局的消息队列都是被检查的对象;
两类 HandlerChecker 的侧重点不同,Monitor Checker 预警我们不能长时间持有核心系统服务的对象锁,否则会阻塞很多方法运行;Looper Checker 预警我们不能长时间的霸占消息队列,否则其他消息将得不到处理。这两类都会导致系统卡住 (System Not Responding)。
2.2.1 HandlerChecker
/**
* Used for checking status of handle threads and scheduling monitor callbacks.
*/
public final class HandlerChecker implements Runnable {
// 监测的 Handler 对象
private final Handler mHandler;
// 监测的线程描述名称
private final String mName;
// 最长等待时间
private final long mWaitMax;
// 监测器 list
private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
// 监视器队列
private final ArrayList<Monitor> mMonitorQueue = new ArrayList<Monitor>();
// 默认为 true,开始检查时置为 false
private boolean mCompleted;
// 当前监视器
private Monitor mCurrentMonitor;
// 开始检查时间点
private long mStartTime;
// 暂停次数
private int mPauseCount;
HandlerChecker(Handler handler, String name, long waitMaxMillis) {
mHandler = handler;
mName = name;
mWaitMax = waitMaxMillis;
mCompleted = true;
}
void addMonitorLocked(Monitor monitor) {
// We don't want to update mMonitors when the Handler is in the middle of checking
// all monitors. We will update mMonitors on the next schedule if it is safe
mMonitorQueue.add(monitor);
}
...
}
HandlerChecker 实现了 Runnable。
2.2.2 addMonitor
除了 WatchDog 里面自己添加的固定的 HandlerChecker 之外,Watchdog 还提供了两个方法:
- addMonitor
- addThread
供外部添加 Monitor Checker 和 Looper Checker。
/** Monitor for checking the availability of binder threads. The monitor will block until
* there is a binder thread available to process in coming IPCs to make sure other processes
* can still communicate with the service.
*/
private static final class BinderThreadMonitor implements Watchdog.Monitor {
@Override
public void monitor() {
Binder.blockUntilThreadAvailable();
}
}
public interface Monitor {
void monitor();
}
public void addMonitor(Monitor monitor) {
synchronized (this) {
mMonitorChecker.addMonitorLocked(monitor);
}
}
public void addThread(Handler thread) {
addThread(thread, DEFAULT_TIMEOUT);
}
public void addThread(Handler thread, long timeoutMillis) {
synchronized (this) {
final String name = thread.getLooper().getThread().getName();
mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));
}
}
监控 Binder 线程,将 monitor 通过 addMonitor 方法添加到 HandlerChecker 的成员变量 mMonitorQueue 中。
blockUntilThreadAvailable 最终调用的是 IPCThreadState,等待有空闲的 binder 线程。
frameworks/native/libs/binder/IPCThreadState.cpp
void IPCThreadState::blockUntilThreadAvailable()
{
pthread_mutex_lock(&mProcess->mThreadCountLock);
while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {
// 正在执行的 binder 线程数不超过进程最大 binder 线程上限(对于 SystemServer 是 31)
ALOGW("Waiting for thread to be free. mExecutingThreadsCount=%lu mMaxThreads=%lu\n",
static_cast<unsigned long>(mProcess->mExecutingThreadsCount),
static_cast<unsigned long>(mProcess->mMaxThreads));
pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);
}
pthread_mutex_unlock(&mProcess->mThreadCountLock);
}
在这里是将 Binder 线程添加到 android.fg 线程的 HandlerChecker(mMonitorChecker) 来检查是否工作正常。
如果线程池超标,一般会有以下 log:
IPCThreadState: Waiting for thread to be free. mExecutingThreadsCount=32 mMaxThreads=31
监控 Handler 线程
Watchdog 监控的 system_server 的线程有:默认 DEFAULT_TIMEOUT 为 60s,调试时设置为 10s 以方便找出潜在的 ANR 问题。
线程名 | 对应 Handler | Timeout |
---|---|---|
main | new Handler(Looper.getMainLooper()) | 60s |
android.fg | FgThread.getHandler | 60s |
android.ui | UiThread.getHandler | 60s |
android.io | IoThread.getHandler | 60s |
android.display | DisplayThread.getHandler | 60s |
android.anim | AnimationThread.getHandler | 60s |
android.anim.lf | SurfaceAnimationThread.getHandler | 60s |
BlobStore | HandlerThread | 60s |
PackageManager(PermissionManagerService) | HandlerThread | 60s |
PowerManagerService | HandlerThread | 60s |
PackageManager | HandlerThread | 10min |
RollbackManagerServiceHandler | HandlerThread | 10min |
监控同步锁
能够被 Watchdog 监控的系统服务都实现了 Watchdog.Monitor 接口,并实现其中的 monitor() 方法。运行在 android.fg 线程,系统中实现该接口类主要有:
- ActivityManagerService
- WindowManagerService
- InputManagerService
- PowerManagerService
- BinderThreadMonitor
- MediaProjectionManagerService
- MediaRouterService
- MediaSessionService
- StorageManagerService
- TvRemoteService
以 AMS 为例:
frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java
public class ActivityManagerService extends IActivityManager.Stub
implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback {
...
// Note: This method is invoked on the main thread but may need to attach various
// handlers to other threads. So take care to be explicit about the looper.
public ActivityManagerService(Context systemContext, ActivityTaskManagerService atm) {
...
Watchdog.getInstance().addMonitor(this);
Watchdog.getInstance().addThread(mHandler);
...
}
...
/** In this method we try to acquire our lock to make sure that we have not deadlocked */
public void monitor() {
synchronized (this) { }
}
...
}
被监控的服务中实现的 monitor() 方法都很简单,就是去拿锁,如果服务未出现长时间持锁现象,那么该方法就不会很耗时;反之如果迟迟拿不到锁,那么就可能是产生了长时间持锁甚至死锁。
2.3 init
private ActivityManagerService mActivity;
...
/**
* Registers a {@link BroadcastReceiver} to listen to reboot broadcasts and trigger reboot.
* Should be called during boot after the ActivityManagerService is up and registered
* as a system service so it can handle registration of a {@link BroadcastReceiver}.
*/
public void init(Context context, ActivityManagerService activity) {
mActivity = activity;
context.registerReceiver(new RebootRequestReceiver(),
new IntentFilter(Intent.ACTION_REBOOT),
android.Manifest.permission.REBOOT, null);
...
}
final class RebootRequestReceiver extends BroadcastReceiver {
@Override
public void onReceive(Context c, Intent intent) {
if (intent.getIntExtra("nowait", 0) != 0) {
rebootSystem("Received ACTION_REBOOT broadcast");
return;
}
Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent);
}
}
/**
* Perform a full reboot of the system.
*/
void rebootSystem(String reason) {
Slog.i(TAG, "Rebooting system because: " + reason);
IPowerManager pms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);
try {
pms.reboot(false, reason, false);
} catch (RemoteException ex) {
}
}
最终是通过 PowerManagerService 来完成重启操作。
三、WatchDog 监测机制
WatchDog 是一个线程,当调用它的 start 方法时,就进入它的 run 方法,开始监测机制。
3.1 触发
public class Watchdog extends Thread {
...
// These are temporally ordered: larger values as lateness increases
private static final int COMPLETED = 0;
private static final int WAITING = 1;
private static final int WAITED_HALF = 2;
private static final int OVERDUE = 3;
...
@Override
public void run() {
// 30s 超时
boolean waitedHalf = false;
while (true) {
final List<HandlerChecker> blockedCheckers;
final String subject;
// 是否重启
final boolean allowRestart;
// 是否已连接 debugger
int debuggerWasConnected = 0;
synchronized (this) {
// 30s
long timeout = CHECK_INTERVAL;
// Make sure we (re)spin the checkers that have become idle within
// this wait-and-check interval
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
// 【3.1.1】遍历执行所有 HandlerChecker 的监测方法
hc.scheduleCheckLocked();
}
if (debuggerWasConnected > 0) {
debuggerWasConnected--;
}
// NOTE: We use uptimeMillis() here because we do not want to increment the time we
// wait while asleep. If the device is asleep then the thing that we are waiting
// to timeout on is asleep as well and won't have a chance to run, causing a false
// positive on when to kill things.
// 开始监测,不包含睡眠时间
long start = SystemClock.uptimeMillis();
// 确保执行 30s 才会继续往下执行
while (timeout > 0) {
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
try {
// 如果有中断异常,直接捕获,继续 wait
wait(timeout);
// Note: mHandlerCheckers and mMonitorChecker may have changed after waiting
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}
boolean fdLimitTriggered = false;
if (mOpenFdMonitor != null) {
fdLimitTriggered = mOpenFdMonitor.monitor();
}
if (!fdLimitTriggered) {
// 【3.1.2】评估 HandlerChecker 状态
final int waitState = evaluateCheckerCompletionLocked();
// waitState 是 COMPLETED,即检测完成并正常,继续检查
if (waitState == COMPLETED) {
// The monitors have returned; reset
waitedHalf = false;
continue;
// 30s 之内,继续检查
} else if (waitState == WAITING) {
// still waiting but within their configured intervals; back off and recheck
continue;
// 30s ~ 60s 之内,dump 一些信息并继续检查
} else if (waitState == WAITED_HALF) {
if (!waitedHalf) {
Slog.i(TAG, "WAITED_HALF");
// We've waited half the deadlock-detection interval. Pull a stack
// trace and wait another half.
ArrayList<Integer> pids = new ArrayList<>(mInterestingJavaPids);
// 【4.1.3】输出 system_server、phone、media 等感兴趣的 java、native 进程栈
ActivityManagerService.dumpStackTraces(pids, null, null,
getInterestingNativePids(), null);
waitedHalf = true;
}
continue;
}
// 【3.2】已超时
// something is overdue!
blockedCheckers = getBlockedCheckersLocked();
subject = describeCheckersLocked(blockedCheckers);
} else {
blockedCheckers = Collections.emptyList();
subject = "Open FD high water mark reached";
}
allowRestart = mAllowRestart;
}
...
}
...
}
...
}
3.1.1 scheduleCheckLocked
向 WatchDog 的监控线程的 Looper 池的最头部执行该 HandlerChecker.run() 方法,在该方法中调用各 monitor 的 monitor() 方法,执行完成后会设置 mCompleted = true。
public final class HandlerChecker implements Runnable {
...
public void scheduleCheckLocked() {
// 开始检查,将所有的 monitor 从 mMonitorQueue 转移至 mMonitors
if (mCompleted) {
// Safe to update monitors in queue, Handler is not in the middle of work
mMonitors.addAll(mMonitorQueue);
mMonitorQueue.clear();
}
// 当 monitor 个数为 0(除了 android.fg 线程之外都为 0)且处于 poll 状态
// 或者上次 check 还没有完成,则将 mCompleted 置为 true 并返回
if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())
|| (mPauseCount > 0)) {
// Don't schedule until after resume OR
// If the target looper has recently been polling, then
// there is no reason to enqueue our checker on it since that
// is as good as it not being deadlocked. This avoid having
// to do a context switch to check the thread. Note that we
// only do this if we have no monitors since those would need to
// be executed at this point.
mCompleted = true;
return;
}
// 正在检查中,无需重复
if (!mCompleted) {
// we already have a check in flight, so no need
return;
}
mCompleted = false;
mCurrentMonitor = null;
// 开始检查时间
mStartTime = SystemClock.uptimeMillis();
// 插入到消息队列开头,开始 run
mHandler.postAtFrontOfQueue(this);
}
...
@Override
public void run() {
// Once we get here, we ensure that mMonitors does not change even if we call
// #addMonitorLocked because we first add the new monitors to mMonitorQueue and
// move them to mMonitors on the next schedule when mCompleted is true, at which
// point we have completed execution of this method.
final int size = mMonitors.size();
for (int i = 0 ; i < size ; i++) {
synchronized (Watchdog.this) {
mCurrentMonitor = mMonitors.get(i);
}
// 回调具体 monitor 的 monitor 方法
mCurrentMonitor.monitor();
}
synchronized (Watchdog.this) {
mCompleted = true;
mCurrentMonitor = null;
}
}
...
}
- Looper Checker:对于不是 android.fg 线程的 HandlerChecker 来说,它不包含 monitor 对象,判断消息队列是否处于空闲状态;如果一直无法空闲,那么后面的 mHandler.postAtFrontOfQueue(this) 也会阻塞,就可能导致它的 run() 方法被延时执行,mCompleted 就不会被置为 true。
- Monitor Checker:就是 android.fg 线程的 HandlerChecker,handler 可能被阻塞,执行 monitor() 也有可能被阻塞。
可能的问题:
- 如果有其他消息不断地调用 postAtFrontOfQueue(),也可能导致 WatchDog 没有机会执行;
- 或者是每个 monitor 消耗一些时间,累加起来超过 60s 造成的 WatchDog 超时。
这些都是非常规的 WatchDog。
3.1.2 evaluateCheckerCompletionLocked
private int evaluateCheckerCompletionLocked() {
int state = COMPLETED;
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
state = Math.max(state, hc.getCompletionStateLocked());
}
return state;
}
遍历获取 mHandlerCheckers 中等待状态值最大的 state。
public final class HandlerChecker implements Runnable {
...
public int getCompletionStateLocked() {
if (mCompleted) {
return COMPLETED;
} else {
long latency = SystemClock.uptimeMillis() - mStartTime;
// 默认 mWaitMax 为 60s
if (latency < mWaitMax/2) {
return WAITING;
} else if (latency < mWaitMax) {
return WAITED_HALF;
}
}
return OVERDUE;
}
...
}
通过检查 mCompleted 变量和 check 任务执行的时间来得到结果。
3.2 执行
public class Watchdog extends Thread {
...
@Override
public void run() {
...
while (true) {
...
synchronized (this) {
...
// something is overdue!
// 【3.2.1】获取被阻塞的 Checkers
blockedCheckers = getBlockedCheckersLocked();
// 【3.2.2】被阻塞的 Checkers 的描述信息
subject = describeCheckersLocked(blockedCheckers);
...
...
}
allowRestart = mAllowRestart;
}
// If we got here, that means that the system is most likely hung.
// First collect stack traces from all threads of the system process.
// Then kill this process so that the system will restart.
// event log,tag 为 watchdog
EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
// 感兴趣的 java 进程
ArrayList<Integer> pids = new ArrayList<>(mInterestingJavaPids);
// 系统 ANR 时间
long anrTime = SystemClock.uptimeMillis();
StringBuilder report = new StringBuilder();
// 读取 /proc/pressure/memory
report.append(MemoryPressureUtil.currentPsiState());
// CPU Tracker
ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(false);
StringWriter tracesFileException = new StringWriter();
// 输出主要 java、native 进程堆栈信息等
final File stack = ActivityManagerService.dumpStackTraces(
pids, processCpuTracker, new SparseArray<>(), getInterestingNativePids(),
tracesFileException);
// Give some extra time to make sure the stack traces get written.
// The system's been hanging for a minute, another second or two won't hurt much.
// 确保 dump 信息输出
SystemClock.sleep(5000);
processCpuTracker.update();
report.append(processCpuTracker.printCurrentState(anrTime));
report.append(tracesFileException.getBuffer());
// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
doSysRq('w');
doSysRq('l');
// Try to add the error to the dropbox, but assuming that the ActivityManager
// itself may be deadlocked. (which has happened, causing this statement to
// deadlock and the watchdog as a whole to be ineffective)
Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
public void run() {
// If a watched thread hangs before init() is called, we don't have a
// valid mActivity. So we can't log the error to dropbox.
if (mActivity != null) {
mActivity.addErrorToDropBox(
"watchdog", null, "system_server", null, null, null,
subject, report.toString(), stack, null);
}
FrameworkStatsLog.write(FrameworkStatsLog.SYSTEM_SERVER_WATCHDOG_OCCURRED,
subject);
}
};
dropboxThread.start();
try {
dropboxThread.join(2000); // wait up to 2 seconds for it to return.
} catch (InterruptedException ignored) {}
IActivityController controller;
synchronized (this) {
controller = mController;
}
// 设置了 IActivityController 特殊处理
if (controller != null) {
Slog.i(TAG, "Reporting stuck state to activity controller");
try {
Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
// 1 = keep waiting, -1 = kill system
// 【3.2.3】可设置等待而不是杀死系统
int res = controller.systemNotResponding(subject);
if (res >= 0) {
Slog.i(TAG, "Activity controller requested to coninue to wait");
waitedHalf = false;
continue;
}
} catch (RemoteException e) {
}
}
// Only kill the process if the debugger is not attached.
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
if (debuggerWasConnected >= 2) {
Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
} else if (debuggerWasConnected > 0) {
Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
} else if (!allowRestart) {
Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
} else {
// 【3.2.4】遍历输出阻塞线程的堆栈信息
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
Slog.w(TAG, "*** GOODBYE!");
// 杀死 system_server
Process.killProcess(Process.myPid());
System.exit(10);
}
waitedHalf = false;
}
}
收集完信息后便会杀死 system_server 进程。此处 allowRestart 默认值为 true,当执行 am hang 操作则设置不允许重启,即不会杀死 system_server 进程。
3.2.1 getBlockedCheckersLocked
private ArrayList<HandlerChecker> getBlockedCheckersLocked() {
ArrayList<HandlerChecker> checkers = new ArrayList<HandlerChecker>();
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
if (hc.isOverdueLocked()) {
checkers.add(hc);
}
}
return checkers;
}
public final class HandlerChecker implements Runnable {
...
boolean isOverdueLocked() {
return (!mCompleted) && (SystemClock.uptimeMillis() > mStartTime + mWaitMax);
}
...
}
遍历所有 HandlerChecker,统计所有没有 mCompleted 并且已超时的 Checker。
3.2.2 describeCheckersLocked
private String describeCheckersLocked(List<HandlerChecker> checkers) {
StringBuilder builder = new StringBuilder(128);
for (int i=0; i<checkers.size(); i++) {
if (builder.length() > 0) {
builder.append(", ");
}
builder.append(checkers.get(i).describeBlockedStateLocked());
}
return builder.toString();
}
public final class HandlerChecker implements Runnable {
...
String describeBlockedStateLocked() {
if (mCurrentMonitor == null) {
return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
} else {
return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
+ " on " + mName + " (" + getThread().getName() + ")";
}
}
...
}
根据 getBlockedCheckersLocked 获取的所有超时 HandlerChecker,输出对应信息:
- android.fg 线程,即 Monitor Checker,输出“Blocked in monitor …”,意味着 android.fg 线程处理当前消息超时,或者 monitor 迟迟拿不到锁;
- 其它线程,即 Looper Checker,输出“Blocked in handler on …”,意味着该线程处理当前消息超时。
3.2.3 systemNotResponding
如果设置了 IActivityController,则可以选择等待而不是杀死系统,以 Monkey 中的实现为例:
development/cmds/monkey/src/com/android/commands/monkey/Monkey.java
/**
* Application that injects random key events and other actions into the system.
*/
public class Monkey {
...
/** Kill the process after a timeout or crash. */
private boolean mKillProcessAfterError;
...
/**
* Monitor operations happening in the system.
*/
private class ActivityController extends IActivityController.Stub {
...
public int systemNotResponding(String message) {
StrictMode.ThreadPolicy savedPolicy = StrictMode.allowThreadDiskWrites();
Logger.err.println("// WATCHDOG: " + message);
...
return (mKillProcessAfterError) ? -1 : 1;
}
}
...
/**
* Run the command!
*
* @param args The command-line arguments
* @return Returns a posix-style result code. 0 for no error.
*/
private int run(String[] args) {
...
if (!processOptions()) {
return -1;
}
...
if (!getSystemInterfaces()) {
return -3;
}
...
}
...
/**
* Process the command-line options
*
* @return Returns true if options were parsed with no apparent errors.
*/
private boolean processOptions() {
...
try {
String opt;
...
while ((opt = nextOption()) != null) {
...
} else if (opt.equals("--kill-process-after-error")) {
mKillProcessAfterError = true;
...
}
...
}
...
}
...
}
...
/**
* Attach to the required system interfaces.
*
* @return Returns true if all system interfaces were available.
*/
private boolean getSystemInterfaces() {
mAm = ActivityManager.getService();
...
try {
mAm.setActivityController(new ActivityController(), true);
...
} catch (RemoteException e) {
Logger.err.println("** Failed talking with activity manager!");
return false;
}
return true;
}
...
}
如果在 Monkey 测试中设置了“–kill-process-after-error”,则可能会触发 SWT,否则会一直 Wait 而不会重启。
3.2.4 killProcess
WatchDog 机制发现超时后,杀死 system_server 进程,从而导致 Zygote 进程自杀,进而触发 init 重启 Zygote 进程,这便出现手机 framework 重启的现象。
通常伴随以下 log:
Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in ...
...
Watchdog: *** GOODBYE!
四、总结
Watchdog 是一个运行在 system_server 进程的名为“watchdog”的线程:
- 当监控对象阻塞时间超过 60s 则触发一次 SWT,会杀死 system_server,触发上层重启;
- mHandlerCheckers 记录所有的 HandlerChecker 对象的列表,包括 fg、main、ui、i/o、display 等线程的 Handler;
- mMonitorChecker 记录所有 Watchdog 目前正在监控的 Monitor,所有的这些 Monitors 都运行在 fg 线程;
- 有两种方式加入 WatchDog 监控:
- addThread():用于监测 Handler 线程,处理消息是否有阻塞;
- addMonitor(): 用于监控实现了 Watchdog.Monitor 接口的服务。这种超时可能是“android.fg”线程 Handler 阻塞,也可能是 monitor 迟迟拿不到锁。
以下情况,即使触发了 Watchdog,也不会杀掉 system_server 进程:
- Monkey:设置 IActivityController,不设置“–kill-process-after-error”,可以拦截 SystemNotResponding 事件;
- debugger:连接 debugger 的情况,不重启;
- hang:执行 am hang 命令,不重启。