WatchDog初步理解

一、概述

WatchDog,看门狗,是 Linux 系统一个很重要的机制,其目的是监测系统运行情况,当异常(死锁、死循环等)发生的时候,能及时采取策略(重启等),使系统恢复正常。

Android 系统中,有硬件 WatchDog 用于定时检测关键硬件是否正常工作,在 framework 层有一个软件 WatchDog 用于定期检测关键系统服务是否正常运行。

当应用超过一定时间无响应的时候,Android 系统为了不让应用长时间处于不可操作的状态,会弹出一个 ANR(应用无响应)的对话框,用户可以选择强制关闭,从而关掉这个应用进程。

ANR 机制是针对应用的,对于系统来说,如果长时间“无响应”,那么 WatchDog 就会触发“自杀”机制。由于这种机制的存在,就经常会出现一些 system_server 进程被 WatchDog 杀掉而发生手机重启的问题。

二、WatchDog 初始化

Android 系统的 WatchDog 是一个单例线程,在 system_server 启动时就会 init&start WatchDog。

2.1 startBootstrapServices

frameworks/base/services/java/com/android/server/SystemServer.java

    /**
     * Starts the small tangle of critical services that are needed to get the system off the
     * ground.  These services have complex mutual dependencies which is why we initialize them all
     * in one place here.  Unless your service is also entwined in these dependencies, it should be
     * initialized in one of the other functions.
     */
    private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {
        t.traceBegin("startBootstrapServices");

        // Start the watchdog as early as possible so we can crash the system server
        // if we deadlock during early boot
        t.traceBegin("StartWatchdog");
        //【2.2】创建 WatchDog 线程,并 start
        final Watchdog watchdog = Watchdog.getInstance();
        watchdog.start();
        t.traceEnd();

        ...

        // Complete the watchdog setup with an ActivityManager instance and listen for reboots
        // Do this only after the ActivityManagerService is properly started as a system process
        t.traceBegin("InitWatchdog");
        // 【2.3】注册 REBOOT 广播
        watchdog.init(mSystemContext, mActivityManagerService);
        t.traceEnd();

        ...
    }

随着 Android 版本迭代,WatchDog 的启动时序已经越来越靠前,以便及早发现“死锁”等异常。

2.2 getInstance

WatchDog 采用单例模式设计,继承于 Thread,创建的线程名为“watchdog”。

frameworks/base/services/core/java/com/android/server/Watchdog.java

/** This class calls its monitor every minute. Killing this process if they don't return **/
public class Watchdog extends Thread {
    static final String TAG = "Watchdog";

    ...
    // Note 1: Do not lower this value below thirty seconds without tightening the invoke-with
    //         timeout in com.android.internal.os.ZygoteConnection, or wrapped applications
    //         can trigger the watchdog.
    // Note 2: The debug value is already below the wait time in ZygoteConnection. Wrapped
    //         applications may not work with a debug build. CTS will fail.
    // 默认超时时间为 60s,且最低不能设置低于 30s
    private static final long DEFAULT_TIMEOUT = DB ? 10 * 1000 : 60 * 1000;
    private static final long CHECK_INTERVAL = DEFAULT_TIMEOUT / 2;
    ...
    private static Watchdog sWatchdog;
    ...
    /* This handler will be used to post message back onto the main thread */
    // 【2.2.1】所有的 HandlerChecker 对象 List
    private final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();
    private final HandlerChecker mMonitorChecker;
    ...

    public static Watchdog getInstance() {
        if (sWatchdog == null) {
            sWatchdog = new Watchdog();
        }

        return sWatchdog;
    }

    private Watchdog() {
        super("watchdog");
        // Initialize handler checkers for each common thread we want to check.  Note
        // that we are not currently checking the background thread, since it can
        // potentially hold longer running operations with no guarantees about the timeliness
        // of operations there.

        // The shared foreground thread is the main checker.  It is where we
        // will also dispatch monitor checks and do other work.
        // 监测 android.fg
        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                "foreground thread", DEFAULT_TIMEOUT);
        mHandlerCheckers.add(mMonitorChecker);
        // Add checker for main thread.  We only do a quick check since there
        // can be UI running on the thread.
        // 监测 main
        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
                "main thread", DEFAULT_TIMEOUT));
        // Add checker for shared UI thread.
        // 监测 android.ui
        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
                "ui thread", DEFAULT_TIMEOUT));
        // And also check IO thread.
        // 监测 android.io
        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
                "i/o thread", DEFAULT_TIMEOUT));
        // And the display thread.
        // 监测 android.display
        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
                "display thread", DEFAULT_TIMEOUT));
        // And the animation thread.
        // 监测 android.anim
        mHandlerCheckers.add(new HandlerChecker(AnimationThread.getHandler(),
                "animation thread", DEFAULT_TIMEOUT));
        // And the surface animation thread.
        // 监测 android.anim.lf
        mHandlerCheckers.add(new HandlerChecker(SurfaceAnimationThread.getHandler(),
                "surface animation thread", DEFAULT_TIMEOUT));

        // Initialize monitor for Binder threads.
        // 【2.2.2】监测 Binder 线程
        addMonitor(new BinderThreadMonitor());

        ...
    }

WatchDog 在初始化时,会构建很多 HandlerChecker,大致分为两类:

  • Monitor Checker,用于检查 Monitor 对象是否发生持锁时间过长,AMS、PKMS、WMS 等核心系统服务都是 Monitor 对象;
  • Looper Checker,用于检查线程的消息队列是否长时间处于工作状态。Watchdog 自身的消息队列,Ui、Io、Display 等这些全局的消息队列都是被检查的对象;

两类 HandlerChecker 的侧重点不同,Monitor Checker 预警我们不能长时间持有核心系统服务的对象锁,否则会阻塞很多方法运行;Looper Checker 预警我们不能长时间的霸占消息队列,否则其他消息将得不到处理。这两类都会导致系统卡住 (System Not Responding)。

2.2.1 HandlerChecker

    /**
     * Used for checking status of handle threads and scheduling monitor callbacks.
     */
    public final class HandlerChecker implements Runnable {
        // 监测的 Handler 对象
        private final Handler mHandler;
        // 监测的线程描述名称
        private final String mName;
        // 最长等待时间
        private final long mWaitMax;
        // 监测器 list
        private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
        // 监视器队列
        private final ArrayList<Monitor> mMonitorQueue = new ArrayList<Monitor>();
        // 默认为 true,开始检查时置为 false
        private boolean mCompleted;
        // 当前监视器
        private Monitor mCurrentMonitor;
        // 开始检查时间点
        private long mStartTime;
        // 暂停次数
        private int mPauseCount;

        HandlerChecker(Handler handler, String name, long waitMaxMillis) {
            mHandler = handler;
            mName = name;
            mWaitMax = waitMaxMillis;
            mCompleted = true;
        }

        void addMonitorLocked(Monitor monitor) {
            // We don't want to update mMonitors when the Handler is in the middle of checking
            // all monitors. We will update mMonitors on the next schedule if it is safe
            mMonitorQueue.add(monitor);
        }
        ...
    }

HandlerChecker 实现了 Runnable。

2.2.2 addMonitor

除了 WatchDog 里面自己添加的固定的 HandlerChecker 之外,Watchdog 还提供了两个方法:

  • addMonitor
  • addThread

供外部添加 Monitor Checker 和 Looper Checker。

    /** Monitor for checking the availability of binder threads. The monitor will block until
     * there is a binder thread available to process in coming IPCs to make sure other processes
     * can still communicate with the service.
     */
    private static final class BinderThreadMonitor implements Watchdog.Monitor {
        @Override
        public void monitor() {
            Binder.blockUntilThreadAvailable();
        }
    }

    public interface Monitor {
        void monitor();
    }

    public void addMonitor(Monitor monitor) {
        synchronized (this) {
            mMonitorChecker.addMonitorLocked(monitor);
        }
    }

    public void addThread(Handler thread) {
        addThread(thread, DEFAULT_TIMEOUT);
    }

    public void addThread(Handler thread, long timeoutMillis) {
        synchronized (this) {
            final String name = thread.getLooper().getThread().getName();
            mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));
        }
    }

监控 Binder 线程,将 monitor 通过 addMonitor 方法添加到 HandlerChecker 的成员变量 mMonitorQueue 中。

blockUntilThreadAvailable 最终调用的是 IPCThreadState,等待有空闲的 binder 线程。

frameworks/native/libs/binder/IPCThreadState.cpp

void IPCThreadState::blockUntilThreadAvailable()
{
    pthread_mutex_lock(&mProcess->mThreadCountLock);
    while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {
        // 正在执行的 binder 线程数不超过进程最大 binder 线程上限(对于 SystemServer 是 31)
        ALOGW("Waiting for thread to be free. mExecutingThreadsCount=%lu mMaxThreads=%lu\n",
                static_cast<unsigned long>(mProcess->mExecutingThreadsCount),
                static_cast<unsigned long>(mProcess->mMaxThreads));
        pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);
    }
    pthread_mutex_unlock(&mProcess->mThreadCountLock);
}

在这里是将 Binder 线程添加到 android.fg 线程的 HandlerChecker(mMonitorChecker) 来检查是否工作正常。

如果线程池超标,一般会有以下 log:

IPCThreadState: Waiting for thread to be free. mExecutingThreadsCount=32 mMaxThreads=31

监控 Handler 线程

Watchdog 监控的 system_server 的线程有:默认 DEFAULT_TIMEOUT 为 60s,调试时设置为 10s 以方便找出潜在的 ANR 问题。

线程名对应 HandlerTimeout
mainnew Handler(Looper.getMainLooper())60s
android.fgFgThread.getHandler60s
android.uiUiThread.getHandler60s
android.ioIoThread.getHandler60s
android.displayDisplayThread.getHandler60s
android.animAnimationThread.getHandler60s
android.anim.lfSurfaceAnimationThread.getHandler60s
BlobStoreHandlerThread60s
PackageManager(PermissionManagerService)HandlerThread60s
PowerManagerServiceHandlerThread60s
PackageManagerHandlerThread10min
RollbackManagerServiceHandlerHandlerThread10min

监控同步锁

能够被 Watchdog 监控的系统服务都实现了 Watchdog.Monitor 接口,并实现其中的 monitor() 方法。运行在 android.fg 线程,系统中实现该接口类主要有:

  • ActivityManagerService
  • WindowManagerService
  • InputManagerService
  • PowerManagerService
  • BinderThreadMonitor
  • MediaProjectionManagerService
  • MediaRouterService
  • MediaSessionService
  • StorageManagerService
  • TvRemoteService

以 AMS 为例:

frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java

public class ActivityManagerService extends IActivityManager.Stub
        implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback {
    ...
    // Note: This method is invoked on the main thread but may need to attach various
    // handlers to other threads.  So take care to be explicit about the looper.
    public ActivityManagerService(Context systemContext, ActivityTaskManagerService atm) {
        ...
        Watchdog.getInstance().addMonitor(this);
        Watchdog.getInstance().addThread(mHandler);
        ...
    }

    ...

    /** In this method we try to acquire our lock to make sure that we have not deadlocked */
    public void monitor() {
        synchronized (this) { }
    }

    ...
}

被监控的服务中实现的 monitor() 方法都很简单,就是去拿锁,如果服务未出现长时间持锁现象,那么该方法就不会很耗时;反之如果迟迟拿不到锁,那么就可能是产生了长时间持锁甚至死锁。

2.3 init

    private ActivityManagerService mActivity;
    ...
    /**
     * Registers a {@link BroadcastReceiver} to listen to reboot broadcasts and trigger reboot.
     * Should be called during boot after the ActivityManagerService is up and registered
     * as a system service so it can handle registration of a {@link BroadcastReceiver}.
     */
    public void init(Context context, ActivityManagerService activity) {
        mActivity = activity;
        context.registerReceiver(new RebootRequestReceiver(),
                new IntentFilter(Intent.ACTION_REBOOT),
                android.Manifest.permission.REBOOT, null);
       ...
    }

    final class RebootRequestReceiver extends BroadcastReceiver {
        @Override
        public void onReceive(Context c, Intent intent) {
            if (intent.getIntExtra("nowait", 0) != 0) {
                rebootSystem("Received ACTION_REBOOT broadcast");
                return;
            }
            Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent);
        }
    }

    /**
     * Perform a full reboot of the system.
     */
    void rebootSystem(String reason) {
        Slog.i(TAG, "Rebooting system because: " + reason);
        IPowerManager pms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);
        try {
            pms.reboot(false, reason, false);
        } catch (RemoteException ex) {
        }
    }

最终是通过 PowerManagerService 来完成重启操作。

三、WatchDog 监测机制

WatchDog 是一个线程,当调用它的 start 方法时,就进入它的 run 方法,开始监测机制。

3.1 触发

public class Watchdog extends Thread {
    ...
    // These are temporally ordered: larger values as lateness increases
    private static final int COMPLETED = 0;
    private static final int WAITING = 1;
    private static final int WAITED_HALF = 2;
    private static final int OVERDUE = 3;

    ...

    @Override
    public void run() {
        // 30s 超时
        boolean waitedHalf = false;
        while (true) {
            final List<HandlerChecker> blockedCheckers;
            final String subject;
            // 是否重启
            final boolean allowRestart;
            // 是否已连接 debugger
            int debuggerWasConnected = 0;
            synchronized (this) {
                // 30s
                long timeout = CHECK_INTERVAL;
                // Make sure we (re)spin the checkers that have become idle within
                // this wait-and-check interval
                for (int i=0; i<mHandlerCheckers.size(); i++) {
                    HandlerChecker hc = mHandlerCheckers.get(i);
                    // 【3.1.1】遍历执行所有 HandlerChecker 的监测方法
                    hc.scheduleCheckLocked();
                }

                if (debuggerWasConnected > 0) {
                    debuggerWasConnected--;
                }

                // NOTE: We use uptimeMillis() here because we do not want to increment the time we
                // wait while asleep. If the device is asleep then the thing that we are waiting
                // to timeout on is asleep as well and won't have a chance to run, causing a false
                // positive on when to kill things.
                // 开始监测,不包含睡眠时间
                long start = SystemClock.uptimeMillis();
                // 确保执行 30s 才会继续往下执行
                while (timeout > 0) {
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    try {
                        // 如果有中断异常,直接捕获,继续 wait
                        wait(timeout);
                        // Note: mHandlerCheckers and mMonitorChecker may have changed after waiting
                    } catch (InterruptedException e) {
                        Log.wtf(TAG, e);
                    }
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
                }

                boolean fdLimitTriggered = false;
                if (mOpenFdMonitor != null) {
                    fdLimitTriggered = mOpenFdMonitor.monitor();
                }

                if (!fdLimitTriggered) {
                    // 【3.1.2】评估 HandlerChecker 状态
                    final int waitState = evaluateCheckerCompletionLocked();
                    // waitState 是 COMPLETED,即检测完成并正常,继续检查
                    if (waitState == COMPLETED) {
                        // The monitors have returned; reset
                        waitedHalf = false;
                        continue;
                    // 30s 之内,继续检查
                    } else if (waitState == WAITING) {
                        // still waiting but within their configured intervals; back off and recheck
                        continue;
                    // 30s ~ 60s 之内,dump 一些信息并继续检查
                    } else if (waitState == WAITED_HALF) {
                        if (!waitedHalf) {
                            Slog.i(TAG, "WAITED_HALF");
                            // We've waited half the deadlock-detection interval.  Pull a stack
                            // trace and wait another half.
                            ArrayList<Integer> pids = new ArrayList<>(mInterestingJavaPids);
                            // 【4.1.3】输出 system_server、phone、media 等感兴趣的 java、native 进程栈
                            ActivityManagerService.dumpStackTraces(pids, null, null,
                                    getInterestingNativePids(), null);
                            waitedHalf = true;
                        }
                        continue;
                    }

                    // 【3.2】已超时
                    // something is overdue!
                    blockedCheckers = getBlockedCheckersLocked();
                    subject = describeCheckersLocked(blockedCheckers);
                } else {
                    blockedCheckers = Collections.emptyList();
                    subject = "Open FD high water mark reached";
                }
                allowRestart = mAllowRestart;
            }

            ...
        }
        ...
    }

    ...
}

3.1.1 scheduleCheckLocked

向 WatchDog 的监控线程的 Looper 池的最头部执行该 HandlerChecker.run() 方法,在该方法中调用各 monitor 的 monitor() 方法,执行完成后会设置 mCompleted = true。

    public final class HandlerChecker implements Runnable {
        ...
        public void scheduleCheckLocked() {
            // 开始检查,将所有的 monitor 从 mMonitorQueue 转移至 mMonitors
            if (mCompleted) {
                // Safe to update monitors in queue, Handler is not in the middle of work
                mMonitors.addAll(mMonitorQueue);
                mMonitorQueue.clear();
            }
            // 当 monitor 个数为 0(除了 android.fg 线程之外都为 0)且处于 poll 状态
            // 或者上次 check 还没有完成,则将 mCompleted 置为 true 并返回
            if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())
                    || (mPauseCount > 0)) {
                // Don't schedule until after resume OR
                // If the target looper has recently been polling, then
                // there is no reason to enqueue our checker on it since that
                // is as good as it not being deadlocked.  This avoid having
                // to do a context switch to check the thread. Note that we
                // only do this if we have no monitors since those would need to
                // be executed at this point.
                mCompleted = true;
                return;
            }
            // 正在检查中,无需重复
            if (!mCompleted) {
                // we already have a check in flight, so no need
                return;
            }

            mCompleted = false;
            mCurrentMonitor = null;
            // 开始检查时间
            mStartTime = SystemClock.uptimeMillis();
            // 插入到消息队列开头,开始 run
            mHandler.postAtFrontOfQueue(this);
        }

        ...

        @Override
        public void run() {
            // Once we get here, we ensure that mMonitors does not change even if we call
            // #addMonitorLocked because we first add the new monitors to mMonitorQueue and
            // move them to mMonitors on the next schedule when mCompleted is true, at which
            // point we have completed execution of this method.
            final int size = mMonitors.size();
            for (int i = 0 ; i < size ; i++) {
                synchronized (Watchdog.this) {
                    mCurrentMonitor = mMonitors.get(i);
                }
                // 回调具体 monitor 的 monitor 方法
                mCurrentMonitor.monitor();
            }

            synchronized (Watchdog.this) {
                mCompleted = true;
                mCurrentMonitor = null;
            }
        }

        ...
    }
  • Looper Checker:对于不是 android.fg 线程的 HandlerChecker 来说,它不包含 monitor 对象,判断消息队列是否处于空闲状态;如果一直无法空闲,那么后面的 mHandler.postAtFrontOfQueue(this) 也会阻塞,就可能导致它的 run() 方法被延时执行,mCompleted 就不会被置为 true。
  • Monitor Checker:就是 android.fg 线程的 HandlerChecker,handler 可能被阻塞,执行 monitor() 也有可能被阻塞。

可能的问题:

  • 如果有其他消息不断地调用 postAtFrontOfQueue(),也可能导致 WatchDog 没有机会执行;
  • 或者是每个 monitor 消耗一些时间,累加起来超过 60s 造成的 WatchDog 超时。

这些都是非常规的 WatchDog。

3.1.2 evaluateCheckerCompletionLocked

    private int evaluateCheckerCompletionLocked() {
        int state = COMPLETED;
        for (int i=0; i<mHandlerCheckers.size(); i++) {
            HandlerChecker hc = mHandlerCheckers.get(i);
            state = Math.max(state, hc.getCompletionStateLocked());
        }
        return state;
    }

遍历获取 mHandlerCheckers 中等待状态值最大的 state。

    public final class HandlerChecker implements Runnable {
        ...
        public int getCompletionStateLocked() {
            if (mCompleted) {
                return COMPLETED;
            } else {
                long latency = SystemClock.uptimeMillis() - mStartTime;
                // 默认 mWaitMax 为 60s
                if (latency < mWaitMax/2) {
                    return WAITING;
                } else if (latency < mWaitMax) {
                    return WAITED_HALF;
                }
            }
            return OVERDUE;
        }
        ...
    }

通过检查 mCompleted 变量和 check 任务执行的时间来得到结果。

3.2 执行

public class Watchdog extends Thread {
    ...
    @Override
    public void run() {
        ...
        while (true) {
            ...
            synchronized (this) {
                ...
                        // something is overdue!
                        // 【3.2.1】获取被阻塞的 Checkers
                        blockedCheckers = getBlockedCheckersLocked();
                        // 【3.2.2】被阻塞的 Checkers 的描述信息
                        subject = describeCheckersLocked(blockedCheckers);
                        ...
                ...
                }
                allowRestart = mAllowRestart;
            }

            // If we got here, that means that the system is most likely hung.
            // First collect stack traces from all threads of the system process.
            // Then kill this process so that the system will restart.
            // event log,tag 为 watchdog
            EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
            // 感兴趣的 java 进程
            ArrayList<Integer> pids = new ArrayList<>(mInterestingJavaPids);

            // 系统 ANR 时间
            long anrTime = SystemClock.uptimeMillis();
            StringBuilder report = new StringBuilder();
            // 读取 /proc/pressure/memory
            report.append(MemoryPressureUtil.currentPsiState());
            // CPU Tracker
            ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(false);
            StringWriter tracesFileException = new StringWriter();
            // 输出主要 java、native 进程堆栈信息等
            final File stack = ActivityManagerService.dumpStackTraces(
                    pids, processCpuTracker, new SparseArray<>(), getInterestingNativePids(),
                    tracesFileException);

            // Give some extra time to make sure the stack traces get written.
            // The system's been hanging for a minute, another second or two won't hurt much.
            // 确保 dump 信息输出
            SystemClock.sleep(5000);

            processCpuTracker.update();
            report.append(processCpuTracker.printCurrentState(anrTime));
            report.append(tracesFileException.getBuffer());

            // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
            doSysRq('w');
            doSysRq('l');

            // Try to add the error to the dropbox, but assuming that the ActivityManager
            // itself may be deadlocked.  (which has happened, causing this statement to
            // deadlock and the watchdog as a whole to be ineffective)
            Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
                    public void run() {
                        // If a watched thread hangs before init() is called, we don't have a
                        // valid mActivity. So we can't log the error to dropbox.
                        if (mActivity != null) {
                            mActivity.addErrorToDropBox(
                                    "watchdog", null, "system_server", null, null, null,
                                    subject, report.toString(), stack, null);
                        }
                        FrameworkStatsLog.write(FrameworkStatsLog.SYSTEM_SERVER_WATCHDOG_OCCURRED,
                                subject);
                    }
                };
            dropboxThread.start();
            try {
                dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
            } catch (InterruptedException ignored) {}

            IActivityController controller;
            synchronized (this) {
                controller = mController;
            }
            // 设置了 IActivityController 特殊处理
            if (controller != null) {
                Slog.i(TAG, "Reporting stuck state to activity controller");
                try {
                    Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
                    // 1 = keep waiting, -1 = kill system
                    // 【3.2.3】可设置等待而不是杀死系统
                    int res = controller.systemNotResponding(subject);
                    if (res >= 0) {
                        Slog.i(TAG, "Activity controller requested to coninue to wait");
                        waitedHalf = false;
                        continue;
                    }
                } catch (RemoteException e) {
                }
            }

            // Only kill the process if the debugger is not attached.
            if (Debug.isDebuggerConnected()) {
                debuggerWasConnected = 2;
            }
            if (debuggerWasConnected >= 2) {
                Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
            } else if (debuggerWasConnected > 0) {
                Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
            } else if (!allowRestart) {
                Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
            } else {
                // 【3.2.4】遍历输出阻塞线程的堆栈信息
                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
                WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
                Slog.w(TAG, "*** GOODBYE!");
                // 杀死 system_server
                Process.killProcess(Process.myPid());
                System.exit(10);
            }

            waitedHalf = false;
        }
    }

收集完信息后便会杀死 system_server 进程。此处 allowRestart 默认值为 true,当执行 am hang 操作则设置不允许重启,即不会杀死 system_server 进程。

3.2.1 getBlockedCheckersLocked

    private ArrayList<HandlerChecker> getBlockedCheckersLocked() {
        ArrayList<HandlerChecker> checkers = new ArrayList<HandlerChecker>();
        for (int i=0; i<mHandlerCheckers.size(); i++) {
            HandlerChecker hc = mHandlerCheckers.get(i);
            if (hc.isOverdueLocked()) {
                checkers.add(hc);
            }
        }
        return checkers;
    }

    public final class HandlerChecker implements Runnable {
        ...
        boolean isOverdueLocked() {
            return (!mCompleted) && (SystemClock.uptimeMillis() > mStartTime + mWaitMax);
        }
        ...
    }

遍历所有 HandlerChecker,统计所有没有 mCompleted 并且已超时的 Checker。

3.2.2 describeCheckersLocked

    private String describeCheckersLocked(List<HandlerChecker> checkers) {
        StringBuilder builder = new StringBuilder(128);
        for (int i=0; i<checkers.size(); i++) {
            if (builder.length() > 0) {
                builder.append(", ");
            }
            builder.append(checkers.get(i).describeBlockedStateLocked());
        }
        return builder.toString();
    }

    public final class HandlerChecker implements Runnable {
        ...
        String describeBlockedStateLocked() {
            if (mCurrentMonitor == null) {
                return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
            } else {
                return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
                        + " on " + mName + " (" + getThread().getName() + ")";
            }
        }
        ...
    }

根据 getBlockedCheckersLocked 获取的所有超时 HandlerChecker,输出对应信息:

  • android.fg 线程,即 Monitor Checker,输出“Blocked in monitor …”,意味着 android.fg 线程处理当前消息超时,或者 monitor 迟迟拿不到锁;
  • 其它线程,即 Looper Checker,输出“Blocked in handler on …”,意味着该线程处理当前消息超时。

3.2.3 systemNotResponding

如果设置了 IActivityController,则可以选择等待而不是杀死系统,以 Monkey 中的实现为例:

development/cmds/monkey/src/com/android/commands/monkey/Monkey.java

/**
 * Application that injects random key events and other actions into the system.
 */
public class Monkey {
    ...
    /** Kill the process after a timeout or crash. */
    private boolean mKillProcessAfterError;
    ...
    /**
     * Monitor operations happening in the system.
     */
    private class ActivityController extends IActivityController.Stub {
        ...
        public int systemNotResponding(String message) {
            StrictMode.ThreadPolicy savedPolicy = StrictMode.allowThreadDiskWrites();
            Logger.err.println("// WATCHDOG: " + message);
            ...
            return (mKillProcessAfterError) ? -1 : 1;
        }
    }

    ...

    /**
     * Run the command!
     *
     * @param args The command-line arguments
     * @return Returns a posix-style result code. 0 for no error.
     */
    private int run(String[] args) {
        ...
        if (!processOptions()) {
            return -1;
        }

        ...

        if (!getSystemInterfaces()) {
            return -3;
        }
        ...
    }

    ...

    /**
     * Process the command-line options
     *
     * @return Returns true if options were parsed with no apparent errors.
     */
    private boolean processOptions() {
        ...
        try {
            String opt;
            ...
            while ((opt = nextOption()) != null) {
                ...
                } else if (opt.equals("--kill-process-after-error")) {
                    mKillProcessAfterError = true;
                    ...
                }
                ...
            }
            ...
        }
        ...
    }

    ...

    /**
     * Attach to the required system interfaces.
     *
     * @return Returns true if all system interfaces were available.
     */
    private boolean getSystemInterfaces() {
        mAm = ActivityManager.getService();
        ...
        try {
            mAm.setActivityController(new ActivityController(), true);
            ...
        } catch (RemoteException e) {
            Logger.err.println("** Failed talking with activity manager!");
            return false;
        }

        return true;
    }

    ...


}

如果在 Monkey 测试中设置了“–kill-process-after-error”,则可能会触发 SWT,否则会一直 Wait 而不会重启。

3.2.4 killProcess

WatchDog 机制发现超时后,杀死 system_server 进程,从而导致 Zygote 进程自杀,进而触发 init 重启 Zygote 进程,这便出现手机 framework 重启的现象。

通常伴随以下 log:

Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in ...

...

Watchdog: *** GOODBYE!

四、总结

Watchdog 是一个运行在 system_server 进程的名为“watchdog”的线程:

  • 当监控对象阻塞时间超过 60s 则触发一次 SWT,会杀死 system_server,触发上层重启;
  • mHandlerCheckers 记录所有的 HandlerChecker 对象的列表,包括 fg、main、ui、i/o、display 等线程的 Handler;
  • mMonitorChecker 记录所有 Watchdog 目前正在监控的 Monitor,所有的这些 Monitors 都运行在 fg 线程;
  • 有两种方式加入 WatchDog 监控:
    • addThread():用于监测 Handler 线程,处理消息是否有阻塞;
    • addMonitor(): 用于监控实现了 Watchdog.Monitor 接口的服务。这种超时可能是“android.fg”线程 Handler 阻塞,也可能是 monitor 迟迟拿不到锁。

以下情况,即使触发了 Watchdog,也不会杀掉 system_server 进程:

  • Monkey:设置 IActivityController,不设置“–kill-process-after-error”,可以拦截 SystemNotResponding 事件;
  • debugger:连接 debugger 的情况,不重启;
  • hang:执行 am hang 命令,不重启。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值