WatchDog初步理解

最新推荐文章于 2025-02-26 10:44:07 发布

jjx_fight

最新推荐文章于 2025-02-26 10:44:07 发布

阅读量1k

点赞数 16

文章标签： android

本文链接：https://blog.youkuaiyun.com/jjx_fight/article/details/142652304

版权

文章目录

一、概述
二、WatchDog 初始化
三、WatchDog 监测机制
四、总结

一、概述

WatchDog，看门狗，是 Linux 系统一个很重要的机制，其目的是监测系统运行情况，当异常（死锁、死循环等）发生的时候，能及时采取策略（重启等），使系统恢复正常。

Android 系统中，有硬件 WatchDog 用于定时检测关键硬件是否正常工作，在 framework 层有一个软件 WatchDog 用于定期检测关键系统服务是否正常运行。

当应用超过一定时间无响应的时候，Android 系统为了不让应用长时间处于不可操作的状态，会弹出一个 ANR(应用无响应)的对话框，用户可以选择强制关闭，从而关掉这个应用进程。

ANR 机制是针对应用的，对于系统来说，如果长时间“无响应”，那么 WatchDog 就会触发“自杀”机制。由于这种机制的存在，就经常会出现一些 system_server 进程被 WatchDog 杀掉而发生手机重启的问题。

二、WatchDog 初始化

Android 系统的 WatchDog 是一个单例线程，在 system_server 启动时就会 init&start WatchDog。

2.1 startBootstrapServices

frameworks/base/services/java/com/android/server/SystemServer.java

    /**
     * Starts the small tangle of critical services that are needed to get the system off the
     * ground.  These services have complex mutual dependencies which is why we initialize them all
     * in one place here.  Unless your service is also entwined in these dependencies, it should be
     * initialized in one of the other functions.
     */
    private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {
        t.traceBegin("startBootstrapServices");

        // Start the watchdog as early as possible so we can crash the system server
        // if we deadlock during early boot
        t.traceBegin("StartWatchdog");
        //【2.2】创建 WatchDog 线程，并 start
        final Watchdog watchdog = Watchdog.getInstance();
        watchdog.start();
        t.traceEnd();

        ...

        // Complete the watchdog setup with an ActivityManager instance and listen for reboots
        // Do this only after the ActivityManagerService is properly started as a system process
        t.traceBegin("InitWatchdog");
        // 【2.3】注册 REBOOT 广播
        watchdog.init(mSystemContext, mActivityManagerService);
        t.traceEnd();

        ...
    }

随着 Android 版本迭代，WatchDog 的启动时序已经越来越靠前，以便及早发现“死锁”等异常。

2.2 getInstance

WatchDog 采用单例模式设计，继承于 Thread，创建的线程名为“watchdog”。

frameworks/base/services/core/java/com/android/server/Watchdog.java

/** This class calls its monitor every minute. Killing this process if they don't return **/
public class Watchdog extends Thread {
    static final String TAG = "Watchdog";

    ...
    // Note 1: Do not lower this value below thirty seconds without tightening the invoke-with
    //         timeout in com.android.internal.os.ZygoteConnection, or wrapped applications
    //         can trigger the watchdog.
    // Note 2: The debug value is already below the wait time in ZygoteConnection. Wrapped
    //         applications may not work with a debug build. CTS will fail.
    // 默认超时时间为 60s，且最低不能设置低于 30s
    private static final long DEFAULT_TIMEOUT = DB ? 10 * 1000 : 60 * 1000;
    private static final long CHECK_INTERVAL = DEFAULT_TIMEOUT / 2;
    ...
    private static Watchdog sWatchdog;
    ...
    /* This handler will be used to post message back onto the main thread */
    // 【2.2.1】所有的 HandlerChecker 对象 List
    private final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();
    private final HandlerChecker mMonitorChecker;
    ...

    public static Watchdog getInstance() {
        if (sWatchdog == null) {
            sWatchdog = new Watchdog();
        }

        return sWatchdog;
    }

    private Watchdog() {
        super("watchdog");
        // Initialize handler checkers for each common thread we want to check.  Note
        // that we are not currently checking the background thread, since it can
        // potentially hold longer running operations with no guarantees about the timeliness
        // of operations there.

        // The shared foreground thread is the main checker.  It is where we
        // will also dispatch monitor checks and do other work.
        // 监测 android.fg
        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                "foreground thread", DEFAULT_TIMEOUT);
        mHandlerCheckers.add(mMonitorChecker);
        // Add checker for main thread.  We only do a quick check since there
        // can be UI running on the thread.
        // 监测 main
        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
                "main thread", DEFAULT_TIMEOUT));
        // Add checker for shared UI thread.
        // 监测 android.ui
        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
                "ui thread", DEFAULT_TIMEOUT));
        // And also check IO thread.
        // 监测 android.io
        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
                "i/o thread", DEFAULT_TIMEOUT));
        // And the display thread.
        // 监测 android.display
        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
                "display thread", DEFAULT_TIMEOUT));
        // And the animation thread.
        // 监测 android.anim
        mHandlerCheckers.add(new HandlerChecker(AnimationThread.getHandler(),
                "animation thread", DEFAULT_TIMEOUT));
        // And the surface animation thread.
        // 监测 android.anim.lf
        mHandlerCheckers.add(new HandlerChecker(SurfaceAnimationThread.getHandler(),
                "surface animation thread", DEFAULT_TIMEOUT));

        // Initialize monitor for Binder threads.
        // 【2.2.2】监测 Binder 线程
        addMonitor(new BinderThreadMonitor());

        ...
    }

WatchDog 在初始化时，会构建很多 HandlerChecker，大致分为两类：

Monitor Checker，用于检查 Monitor 对象是否发生持锁时间过长，AMS、PKMS、WMS 等核心系统服务都是 Monitor 对象；
Looper Checker，用于检查线程的消息队列是否长时间处于工作状态。Watchdog 自身的消息队列，Ui、Io、Display 等这些全局的消息队列都是被检查的对象；

两类 HandlerChecker 的侧重点不同，Monitor Checker 预警我们不能长时间持有核心系统服务的对象锁，否则会阻塞很多方法运行；Looper Checker 预警我们不能长时间的霸占消息队列，否则其他消息将得不到处理。这两类都会导致系统卡住 (System Not Responding)。

2.2.1 HandlerChecker

    /**
     * Used for checking status of handle threads and scheduling monitor callbacks.
     */
    public final class HandlerChecker implements Runnable {
        // 监测的 Handler 对象
        private final Handler mHandler;
        // 监测的线程描述名称
        private final String mName;
        // 最长等待时间
        private final long mWaitMax;
        // 监测器 list
        private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
        // 监视器队列
        private final ArrayList<Monitor> mMonitorQueue = new ArrayList<Monitor>();
        // 默认为 true，开始检查时置为 false
        private boolean mCompleted;
        // 当前监视器
        private Monitor mCurrentMonitor;
        // 开始检查时间点
        private long mStartTime;
        // 暂停次数
        private int mPauseCount;

        HandlerChecker(Handler handler, String name, long waitMaxMillis) {
            mHandler = handler;
            mName = name;
            mWaitMax = waitMaxMillis;
            mCompleted = true;
        }

        void addMonitorLocked(Monitor monitor) {
            // We don't want to update mMonitors when the Handler is in the middle of checking
            // all monitors. We will update mMonitors on the next schedule if it is safe
            mMonitorQueue.add(monitor);
        }
        ...
    }

HandlerChecker 实现了 Runnable。

2.2.2 addMonitor

除了 WatchDog 里面自己添加的固定的 HandlerChecker 之外，Watchdog 还提供了两个方法：

addMonitor
addThread

供外部添加 Monitor Checker 和 Looper Checker。

    /** Monitor for checking the availability of binder threads. The monitor will block until
     * there is a binder thread available to process in coming IPCs to make sure other processes
     * can still communicate with the service.
     */
    private static final class BinderThreadMonitor implements Watchdog.Monitor {
        @Override
        public void monitor() {
            Binder.blockUntilThreadAvailable();
        }
    }

    public interface Monitor {
        void monitor();
    }

    public void addMonitor(Monitor monitor) {
        synchronized (this) {
            mMonitorChecker.addMonitorLocked(monitor);
        }
    }

    public void addThread(Handler thread) {
        addThread(thread, DEFAULT_TIMEOUT);
    }

    public void addThread(Handler thread, long timeoutMillis) {
        synchronized (this) {
            final String name = thread.getLooper().getThread().getName();
            mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));
        }
    }

监控 Binder 线程，将 monitor 通过 addMonitor 方法添加到 HandlerChecker 的成员变量 mMonitorQueue 中。

blockUntilThreadAvailable 最终调用的是 IPCThreadState，等待有空闲的 binder 线程。

frameworks/native/libs/binder/IPCThreadState.cpp

void IPCThreadState::blockUntilThreadAvailable()
{
    pthread_mutex_lock(&mProcess->mThreadCountLock);
    while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {
        // 正在执行的 binder 线程数不超过进程最大 binder 线程上限(对于 SystemServer 是 31)
        ALOGW("Waiting for thread to be free. mExecutingThreadsCount=%lu mMaxThreads=%lu\n",
                static_cast<unsigned long>(mProcess->mExecutingThreadsCount),
                static_cast<unsigned long>(mProcess->mMaxThreads));
        pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);
    }
    pthread_mutex_unlock(&mProcess->mThreadCountLock);
}

在这里是将 Binder 线程添加到 android.fg 线程的 HandlerChecker(mMonitorChecker) 来检查是否工作正常。

如果线程池超标，一般会有以下 log：

IPCThreadState: Waiting for thread to be free. mExecutingThreadsCount=32 mMaxThreads=31

监控 Handler 线程

Watchdog 监控的 system_server 的线程有：默认 DEFAULT_TIMEOUT 为 60s，调试时设置为 10s 以方便找出潜在的 ANR 问题。

线程名	对应 Handler	Timeout
main	new Handler(Looper.getMainLooper())	60s
android.fg	FgThread.getHandler	60s
android.ui	UiThread.getHandler	60s
android.io	IoThread.getHandler	60s
android.display	DisplayThread.getHandler	60s
android.anim	AnimationThread.getHandler	60s
android.anim.lf	SurfaceAnimationThread.getHandler	60s
BlobStore	HandlerThread	60s
PackageManager(PermissionManagerService)	HandlerThread	60s
PowerManagerService	HandlerThread	60s
PackageManager	HandlerThread	10min
RollbackManagerServiceHandler	HandlerThread	10min

监控同步锁

能够被 Watchdog 监控的系统服务都实现了 Watchdog.Monitor 接口，并实现其中的 monitor() 方法。运行在 android.fg 线程，系统中实现该接口类主要有：

ActivityManagerService
WindowManagerService
InputManagerService
PowerManagerService
BinderThreadMonitor
MediaProjectionManagerService
MediaRouterService
MediaSessionService
StorageManagerService
TvRemoteService

以 AMS 为例：

frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java

public class ActivityManagerService extends IActivityManager.Stub
        implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback {
    ...
    // Note: This method is invoked on the main thread but may need to attach various
    // handlers to other threads.  So take care to be explicit about the looper.
    public ActivityManagerService(Context systemContext, ActivityTaskManagerService atm) {
        ...
        Watchdog.getInstance().addMonitor(this);
        Watchdog.getInstance().addThread(mHandler);
        ...
    }

    ...

    /** In this method we try to acquire our lock to make sure that we have not deadlocked */
    public void monitor() {
        synchronized (this) { }
    }

    ...
}

被监控的服务中实现的 monitor() 方法都很简单，就是去拿锁，如果服务未出现长时间持锁现象，那么该方法就不会很耗时；反之如果迟迟拿不到锁，那么就可能是产生了长时间持锁甚至死锁。

2.3 init

    private ActivityManagerService mActivity;
    ...
    /**
     * Registers a {@link BroadcastReceiver} to listen to reboot broadcasts and trigger reboot.
     * Should be called during boot after the ActivityManagerService is up and registered
     * as a system service so it can handle registration of a {@link BroadcastReceiver}.
     */
    public void init(Context context, ActivityManagerService activity) {
        mActivity = activity;
        context.registerReceiver(new RebootRequestReceiver(),
                new IntentFilter(Intent.ACTION_REBOOT),
                android.Manifest.permission.REBOOT, null);
       ...
    }

    final class RebootRequestReceiver extends BroadcastReceiver {
        @Override
        public void onReceive(Context c, Intent intent) {
            if (intent.getIntExtra("nowait", 0) != 0) {
                rebootSystem("Received ACTION_REBOOT broadcast");
                return;
            }
            Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent);
        }
    }

    /**
     * Perform a full reboot of the system.
     */
    void rebootSystem(String reason) {
        Slog.i(TAG, "Rebooting system because: " + reason);
        IPowerManager pms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);
        try {
            pms.reboot(false, reason, false);
        } catch (RemoteException ex) {
        }
    }

最终是通过 PowerManagerService 来完成重启操作。

三、WatchDog 监测机制

WatchDog 是一个线程，当调用它的 start 方法时，就进入它的 run 方法，开始监测机制。

3.1 触发

public class Watchdog extends Thread {
    ...
    // These are temporally ordered: larger values as lateness increases
    private static final int COMPLETED = 0;
    private static final int WAITING = 1;
    private static final int WAITED_HALF = 2;
    private static final int OVERDUE = 3;

    ...

    @Override
    public void run() {
        // 30s 超时
        boolean waitedHalf = false;
        while (true) {
            final List<HandlerChecker> blockedCheckers;
            final String subject;
            // 是否重启
            final boolean allowRestart;
            // 是否已连接 debugger
            int debuggerWasConnected = 0;
            synchronized (this) {
                // 30s
                long timeout = CHECK_INTERVAL;
                // Make sure we (re)spin the checkers that have become idle within
                // this wait-and-check interval
                for (int i=0; i<mHandlerCheckers.size(); i++) {
                    HandlerChecker hc = mHandlerCheckers.get(i);
                    // 【3.1.1】遍历执行所有 HandlerChecker 的监测方法
                    hc.scheduleCheckLocked();
                }

                if (debuggerWasConnected > 0) {
                    debuggerWasConnected--;
                }

                // NOTE: We use uptimeMillis() here because we do not want to increment the time we
                // wait while asleep. If the device is asleep then the thing that we are waiting
                // to timeout on is asleep as well and won't have a chance to run, causing a false
                // positive on when to kill things.
                // 开始监测，不包含睡眠时间
                long start = SystemClock.uptimeMillis();
                // 确保执行 30s 才会继续往下执行
                while (timeout > 0) {
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    try {
                        // 如果有中断异常，直接捕获，继续 wait
                        wait(timeout);
                        // Note: mHandlerCheckers and mMonitorChecker may have changed after waiting
                    } catch (InterruptedException e) {
                        Log.wtf(TAG, e);
                    }
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
                }

                boolean fdLimitTriggered = false;
                if (mOpenFdMonitor != null) {
                    fdLimitTriggered = mOpenFdMonitor.monitor();
                }

                if (!fdLimitTriggered) {
                    // 【3.1.2】评估 HandlerChecker 状态
                    final int waitState = evaluateCheckerCompletionLocked();
                    // waitState 是 COMPLETED，即检测完成并正常，继续检查
                    if (waitState == COMPLETED) {
                        // The monitors have returned; reset
                        waitedHalf = false;
                        continue;
                    // 30s 之内，继续检查
                    } else if (waitState == WAITING) {
                        // still waiting but within their configured intervals; back off and recheck
                        continue;
                    // 30s ~ 60s 之内，dump 一些信息并继续检查
                    } else if (waitState == WAITED_HALF) {
                        if (!waitedHalf) {
                            Slog.i(TAG, "WAITED_HALF");
                            // We've waited half the deadlock-detection interval.  Pull a stack
                            // trace and wait another half.
                            ArrayList<Integer> pids = new ArrayList<>(mInterestingJavaPids);
                            // 【4.1.3】输出 system_server、phone、media 等感兴趣的 java、native 进程栈
                            ActivityManagerService.dumpStackTraces(pids, null, null,
                                    getInterestingNativePids(), null);
                            waitedHalf = true;
                        }
                        continue;
                    }

                    // 【3.2】已超时
                    // something is overdue!
                    blockedCheckers = getBlockedCheckersLocked();
                    subject = describeCheckersLocked(blockedCheckers);
                } else {
                    blockedCheckers = Collections.emptyList();
                    subject = "Open FD high water mark reached";
                }
                allowRestart = mAllowRestart;
            }

            ...
        }
        ...
    }

    ...
}

3.1.1 scheduleCheckLocked

向 WatchDog 的监控线程的 Looper 池的最头部执行该 HandlerChecker.run() 方法，在该方法中调用各 monitor 的 monitor() 方法，执行完成后会设置 mCompleted = true。

    public final class HandlerChecker implements Runnable {
        ...
        public void scheduleCheckLocked() {
            // 开始检查，将所有的 monitor 从 mMonitorQueue 转移至 mMonitors
            if (mCompleted) {
                // Safe to update monitors in queue, Handler is not in the middle of work
                mMonitors.addAll(mMonitorQueue);
                mMonitorQueue.clear();
            }
            // 当 monitor 个数为 0(除了 android.fg 线程之外都为 0)且处于 poll 状态
            // 或者上次 check 还没有完成，则将 mCompleted 置为 true 并返回
            if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())
                    || (mPauseCount > 0)) {
                // Don't schedule until after resume OR
                // If the target looper has recently been polling, then
                // there is no reason to enqueue our checker on it since that
                // is as good as it not being deadlocked.  This avoid having
                // to do a context switch to check the thread. Note that we
                // only do this if we have no monitors since those would need to
                // be executed at this point.
                mCompleted = true;
                return;
            }
            // 正在检查中，无需重复
            if (!mCompleted) {
                // we already have a check in flight, so no need
                return;
            }

            mCompleted = false;
            mCurrentMonitor = null;
            // 开始检查时间
            mStartTime = SystemClock.uptimeMillis();
            // 插入到消息队列开头，开始 run
            mHandler.postAtFrontOfQueue(this);
        }

        ...

        @Override
        public void run() {
            // Once we get here, we ensure that mMonitors does not change even if we call
            // #addMonitorLocked because we first add the new monitors to mMonitorQueue and
            // move them to mMonitors on the next schedule when mCompleted is true, at which
            // point we have completed execution of this method.
            final int size = mMonitors.size();
            for (int i = 0 ; i < size ; i++) {
                synchronized (Watchdog.this) {
                    mCurrentMonitor = mMonitors.get(i);
                }
                // 回调具体 monitor 的 monitor 方法
                mCurrentMonitor.monitor();
            }

            synchronized (Watchdog.this) {
                mCompleted = true;
                mCurrentMonitor = null;
            }
        }

        ...
    }

Looper Checker：对于不是 android.fg 线程的 HandlerChecker 来说，它不包含 monitor 对象，判断消息队列是否处于空闲状态；如果一直无法空闲，那么后面的 mHandler.postAtFrontOfQueue(this) 也会阻塞，就可能导致它的 run() 方法被延时执行，mCompleted 就不会被置为 true。
Monitor Checker：就是 android.fg 线程的 HandlerChecker，handler 可能被阻塞，执行 monitor() 也有可能被阻塞。

可能的问题：

如果有其他消息不断地调用 postAtFrontOfQueue()，也可能导致 WatchDog 没有机会执行；
或者是每个 monitor 消耗一些时间，累加起来超过 60s 造成的 WatchDog 超时。

这些都是非常规的 WatchDog。

3.1.2 evaluateCheckerCompletionLocked

    private int evaluateCheckerCompletionLocked() {
        int state = COMPLETED;
        for (int i=0; i<mHandlerCheckers.size(); i++) {
            HandlerChecker hc = mHandlerCheckers.get(i);
            state = Math.max(state, hc.getCompletionStateLocked());
        }
        return state;
    }

遍历获取 mHandlerCheckers 中等待状态值最大的 state。

    public final class HandlerChecker implements Runnable {
        ...
        public int getCompletionStateLocked() {
            if (mCompleted) {
                return COMPLETED;
            } else {
                long latency = SystemClock.uptimeMillis() - mStartTime;
                // 默认 mWaitMax 为 60s
                if (latency < mWaitMax/2) {
                    return WAITING;
                } else if (latency < mWaitMax) {
                    return WAITED_HALF;
                }
            }
            return OVERDUE;
        }
        ...
    }

通过检查 mCompleted 变量和 check 任务执行的时间来得到结果。

3.2 执行

public class Watchdog extends Thread {
    ...
    @Override
    public void run() {
        ...
        while (true) {
            ...
            synchronized (this) {
                ...
                        // something is overdue!
                        // 【3.2.1】获取被阻塞的 Checkers
                        blockedCheckers = getBlockedCheckersLocked();
                        // 【3.2.2】被阻塞的 Checkers 的描述信息
                        subject = describeCheckersLocked(blockedCheckers);
                        ...
                ...
                }
                allowRestart = mAllowRestart;
            }

            // If we got here, that means that the system is most likely hung.
            // First collect stack traces from all threads of the system process.
            // Then kill this process so that the system will restart.
            // event log，tag 为 watchdog
            EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
            // 感兴趣的 java 进程
            ArrayList<Integer> pids = new ArrayList<>(mInterestingJavaPids);

            // 系统 ANR 时间
            long anrTime = SystemClock.uptimeMillis();
            StringBuilder report = new StringBuilder();
            // 读取 /proc/pressure/memory
            report.append(MemoryPressureUtil.currentPsiState());
            // CPU Tracker
            ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(false);
            StringWriter tracesFileException = new StringWriter();
            // 输出主要 java、native 进程堆栈信息等
            final File stack = ActivityManagerService.dumpStackTraces(
                    pids, processCpuTracker, new SparseArray<>(), getInterestingNativePids(),
                    tracesFileException);

            // Give some extra time to make sure the stack traces get written.
            // The system's been hanging for a minute, another second or two won't hurt much.
            // 确保 dump 信息输出
            SystemClock.sleep(5000);

            processCpuTracker.update();
            report.append(processCpuTracker.printCurrentState(anrTime));
            report.append(tracesFileException.getBuffer());

            // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
            doSysRq('w');
            doSysRq('l');

            // Try to add the error to the dropbox, but assuming that the ActivityManager
            // itself may be deadlocked.  (which has happened, causing this statement to
            // deadlock and the watchdog as a whole to be ineffective)
            Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
                    public void run() {
                        // If a watched thread hangs before init() is called, we don't have a
                        // valid mActivity. So we can't log the error to dropbox.
                        if (mActivity != null) {
                            mActivity.addErrorToDropBox(
                                    "watchdog", null, "system_server", null, null, null,
                                    subject, report.toString(), stack, null);
                        }
                        FrameworkStatsLog.write(FrameworkStatsLog.SYSTEM_SERVER_WATCHDOG_OCCURRED,
                                subject);
                    }
                };
            dropboxThread.start();
            try {
                dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
            } catch (InterruptedException ignored) {}

            IActivityController controller;
            synchronized (this) {
                controller = mController;
            }
            // 设置了 IActivityController 特殊处理
            if (controller != null) {
                Slog.i(TAG, "Reporting stuck state to activity controller");
                try {
                    Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
                    // 1 = keep waiting, -1 = kill system
                    // 【3.2.3】可设置等待而不是杀死系统
                    int res = controller.systemNotResponding(subject);
                    if (res >= 0) {
                        Slog.i(TAG, "Activity controller requested to coninue to wait");
                        waitedHalf = false;
                        continue;
                    }
                } catch (RemoteException e) {
                }
            }

            // Only kill the process if the debugger is not attached.
            if (Debug.isDebuggerConnected()) {
                debuggerWasConnected = 2;
            }
            if (debuggerWasConnected >= 2) {
                Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
            } else if (debuggerWasConnected > 0) {
                Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
            } else if (!allowRestart) {
                Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
            } else {
                // 【3.2.4】遍历输出阻塞线程的堆栈信息
                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
                WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
                Slog.w(TAG, "*** GOODBYE!");
                // 杀死 system_server
                Process.killProcess(Process.myPid());
                System.exit(10);
            }

            waitedHalf = false;
        }
    }

收集完信息后便会杀死 system_server 进程。此处 allowRestart 默认值为 true，当执行 am hang 操作则设置不允许重启，即不会杀死 system_server 进程。

3.2.1 getBlockedCheckersLocked

    private ArrayList<HandlerChecker> getBlockedCheckersLocked() {
        ArrayList<HandlerChecker> checkers = new ArrayList<HandlerChecker>();
        for (int i=0; i<mHandlerCheckers.size(); i++) {
            HandlerChecker hc = mHandlerCheckers.get(i);
            if (hc.isOverdueLocked()) {
                checkers.add(hc);
            }
        }
        return checkers;
    }

    public final class HandlerChecker implements Runnable {
        ...
        boolean isOverdueLocked() {
            return (!mCompleted) && (SystemClock.uptimeMillis() > mStartTime + mWaitMax);
        }
        ...
    }

遍历所有 HandlerChecker，统计所有没有 mCompleted 并且已超时的 Checker。

3.2.2 describeCheckersLocked

    private String describeCheckersLocked(List<HandlerChecker> checkers) {
        StringBuilder builder = new StringBuilder(128);
        for (int i=0; i<checkers.size(); i++) {
            if (builder.length() > 0) {
                builder.append(", ");
            }
            builder.append(checkers.get(i).describeBlockedStateLocked());
        }
        return builder.toString();
    }

    public final class HandlerChecker implements Runnable {
        ...
        String describeBlockedStateLocked() {
            if (mCurrentMonitor == null) {
                return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
            } else {
                return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
                        + " on " + mName + " (" + getThread().getName() + ")";
            }
        }
        ...
    }

根据 getBlockedCheckersLocked 获取的所有超时 HandlerChecker，输出对应信息：

android.fg 线程，即 Monitor Checker，输出“Blocked in monitor …”，意味着 android.fg 线程处理当前消息超时，或者 monitor 迟迟拿不到锁；
其它线程，即 Looper Checker，输出“Blocked in handler on …”，意味着该线程处理当前消息超时。

3.2.3 systemNotResponding

如果设置了 IActivityController，则可以选择等待而不是杀死系统，以 Monkey 中的实现为例：

development/cmds/monkey/src/com/android/commands/monkey/Monkey.java

/**
 * Application that injects random key events and other actions into the system.
 */
public class Monkey {
    ...
    /** Kill the process after a timeout or crash. */
    private boolean mKillProcessAfterError;
    ...
    /**
     * Monitor operations happening in the system.
     */
    private class ActivityController extends IActivityController.Stub {
        ...
        public int systemNotResponding(String message) {
            StrictMode.ThreadPolicy savedPolicy = StrictMode.allowThreadDiskWrites();
            Logger.err.println("// WATCHDOG: " + message);
            ...
            return (mKillProcessAfterError) ? -1 : 1;
        }
    }

    ...

    /**
     * Run the command!
     *
     * @param args The command-line arguments
     * @return Returns a posix-style result code. 0 for no error.
     */
    private int run(String[] args) {
        ...
        if (!processOptions()) {
            return -1;
        }

        ...

        if (!getSystemInterfaces()) {
            return -3;
        }
        ...
    }

    ...

    /**
     * Process the command-line options
     *
     * @return Returns true if options were parsed with no apparent errors.
     */
    private boolean processOptions() {
        ...
        try {
            String opt;
            ...
            while ((opt = nextOption()) != null) {
                ...
                } else if (opt.equals("--kill-process-after-error")) {
                    mKillProcessAfterError = true;
                    ...
                }
                ...
            }
            ...
        }
        ...
    }

    ...

    /**
     * Attach to the required system interfaces.
     *
     * @return Returns true if all system interfaces were available.
     */
    private boolean getSystemInterfaces() {
        mAm = ActivityManager.getService();
        ...
        try {
            mAm.setActivityController(new ActivityController(), true);
            ...
        } catch (RemoteException e) {
            Logger.err.println("** Failed talking with activity manager!");
            return false;
        }

        return true;
    }

    ...


}

如果在 Monkey 测试中设置了“–kill-process-after-error”，则可能会触发 SWT，否则会一直 Wait 而不会重启。

3.2.4 killProcess

WatchDog 机制发现超时后，杀死 system_server 进程，从而导致 Zygote 进程自杀，进而触发 init 重启 Zygote 进程，这便出现手机 framework 重启的现象。

通常伴随以下 log：

Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in ...

...

Watchdog: *** GOODBYE!

四、总结

Watchdog 是一个运行在 system_server 进程的名为“watchdog”的线程：

当监控对象阻塞时间超过 60s 则触发一次 SWT，会杀死 system_server，触发上层重启；
mHandlerCheckers 记录所有的 HandlerChecker 对象的列表，包括 fg、main、ui、i/o、display 等线程的 Handler；
mMonitorChecker 记录所有 Watchdog 目前正在监控的 Monitor，所有的这些 Monitors 都运行在 fg 线程；
有两种方式加入 WatchDog 监控：
- addThread()：用于监测 Handler 线程，处理消息是否有阻塞；
- addMonitor(): 用于监控实现了 Watchdog.Monitor 接口的服务。这种超时可能是“android.fg”线程 Handler 阻塞，也可能是 monitor 迟迟拿不到锁。

以下情况，即使触发了 Watchdog，也不会杀掉 system_server 进程：