Android 中WatchDog机制

概述:

Android中的WatchDog机制,字面意思为“看门狗”,简而言之就是Android系统中,用来监控各个重要系统服务,例如AMS是否死锁,执行超时的机制,当出现以上情况是,会执行手机系统的重启。

原理:WatchDog是一个线程,会不断循环执行,循环过程中会检测每个HandlerChecker里的所有mMonitors,当所有的mMonitors在指定时间(60S)执行完毕顺利返回,才被判定为完成,否则判定超时,打印一些应用的堆栈,写入log到“/data/anr/”,写入dropbox,并且会系统重启。monitor即为对应系统服务的检测是否死锁或者判定异常的接口,在各个系统服务里面实现。

启动:

watch作为系统的重要检测机制,在system_server最早的启动系统服务的“startBootstrapServices”中,由于wachdog打印日志需要依赖context,在AMS初始化以后,通过init函数将context传给watchdog单例。

/frameworks/base/services/java/com/android/server/SystemServer.java
 
    private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {
        ...
        final Watchdog watchdog = Watchdog.getInstance();
        watchdog.start();
        t.traceEnd();
 
        ...

        ActivityTaskManagerService atm = mSystemServiceManager.startService(
            ActivityTaskManagerService.Lifecycle.class).getService();
        mActivityManagerService = ActivityManagerService.Lifecycle.startService(
            mSystemServiceManager, atm);
        mActivityManagerService.setSystemServiceManager(mSystemServiceManager);
        mActivityManagerService.setInstaller(installer);
        mWindowManagerGlobalLock = atm.getGlobalLock();

        ...

 
        t.traceBegin("InitWatchdog");
        watchdog.init(mSystemContext, mActivityManagerService);
        t.traceEnd();
        ...

构造函数:

frameworks\base\services\core\java\com\android\server\Watchdog.java
private Watchdog() {
        super("watchdog");
        // Initialize handler checkers for each common thread we want to check.  Note
        // that we are not currently checking the background thread, since it can
        // potentially hold longer running operations with no guarantees about the timeliness
        // of operations there.

        // The shared foreground thread is the main checker.  It is where we
        // will also dispatch monitor checks and do other work.
        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                "foreground thread", DEFAULT_TIMEOUT);
        mHandlerCheckers.add(mMonitorChecker);
        // Add checker for main thread.  We only do a quick check since there
        // can be UI running on the thread.
        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
                "main thread", DEFAULT_TIMEOUT));
        // Add checker for shared UI thread.
        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
                "ui thread", DEFAULT_TIMEOUT));
        // And also check IO thread.
        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
                "i/o thread", DEFAULT_TIMEOUT));
        // And the display thread.
        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
                "display thread", DEFAULT_TIMEOUT));
        // And the animation thread.
        mHandlerCheckers.add(new HandlerChecker(AnimationThread.getHandler(),
                "animation thread", DEFAULT_TIMEOUT));
        // And the surface animation thread.
        mHandlerCheckers.add(new HandlerChecker(SurfaceAnimationThread.getHandler(),
                "surface animation thread", DEFAULT_TIMEOUT));

        // Initialize monitor for Binder threads.
        addMonitor(new BinderThreadMonitor());

        mOpenFdMonitor = OpenFdMonitor.create();

        mInterestingJavaPids.add(Process.myPid());

        // See the notes on DEFAULT_TIMEOUT.
        assert DB ||
                DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
    }

这里即在systemserver.java里通过单例模式初始化watchdog对象以后,默认监测这一些重要的线程,值得注意的是Fgthread单独拿出来了,这个是用来addMonitor对应的HandlerChecker,后面会讲到。

同样还有一个重要的即监控Binder,binder数量最大为16,不允许超过,此处也是一个monitor检测的实例。另外还有AMS也会在monitor里获取同步锁,当获取不到的时候,判定为死锁,下面也会讲到。

frameworks\base\services\core\java\com\android\server\Watchdog.java
private static final class BinderThreadMonitor implements Watchdog.Monitor {
        @Override
        public void monitor() {
            Binder.blockUntilThreadAvailable();
        }


//blockUntilThreadAvailable最终调用到native代码,检测binder数量是否超过16个,超过会陷入等待,即monitor超时
frameworks/native/libs/binder/IPCThreadState.cpp
void IPCThreadState::blockUntilThreadAvailable()
{
    pthread_mutex_lock(&mProcess->mThreadCountLock);
    while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {
        ALOGW("Waiting for thread to be free. mExecutingThreadsCount=%lu mMaxThreads=%lu\n",
                static_cast<unsigned long>(mProcess->mExecutingThreadsCount),
                static_cast<unsigned long>(mProcess->mMaxThreads));
        pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);
    }
    pthread_mutex_unlock(&mProcess->mThreadCountLock);
}

monitor检测是否处理超时,即在monitor()方法里做自己想要的检测。

AMS monitor检测是否死锁实例:

frameworks\base\services\core\java\com\android\server\am\ActivityManagerService.java    
/** In this method we try to acquire our lock to make sure that we have not deadlocked */
    public void monitor() {
        synchronized (this) { }
    }

WMS monitor:

frameworks\base\services\core\java\com\android\server\wm\WindowManagerService.java    
// Called by the heartbeat to ensure locks are not held indefnitely (for deadlock detection).
    @Override
    public void monitor() {
        synchronized (mGlobalLock) { }
    }

Input monitor实例:

frameworks\base\services\core\java\com\android\server\input\InputManagerService.java   
 // Called by the heartbeat to ensure locks are not held indefinitely (for deadlock detection).
    @Override
    public void monitor() {
        synchronized (mInputFilterLock) { }
        synchronized (mAssociationsLock) { /* Test if blocked by associations lock. */}
        nativeMonitor(mPtr);
    }

以下是实现了monitor接口的类:

AMS启动新应用的WatchDog:

AMS 中每次启动一个进程时,都会调用 handleProcessStartedLocked() 最终调用 WatchDog.processStarted() 添加到 WatchDog 中,而process 也是有限定的,如下:

    private static boolean isInterestingJavaProcess(String processName) {
        return processName.equals(StorageManagerService.sMediaStoreAuthorityProcessName)
                || processName.equals("com.android.phone");
    }
 
 
    public void processStarted(String processName, int pid) {
        if (isInterestingJavaProcess(processName)) {
            Slog.i(TAG, "Interesting Java process " + processName + " started. Pid " + pid);
            synchronized (this) {
                mInterestingJavaPids.add(pid);
            }
        }
    }

可以看到此处是为StorageManagerService.sMediaStoreAuthorityProcessName 或者是“com.android.phone”程序,才会别列入到感兴趣的列表里,在发生超时的时候,会把这些应用的堆栈给打出来:

frameworks\base\services\core\java\com\android\server\Watchdog.java
ArrayList<Integer> pids = new ArrayList<>mInterestingJavaPids);
initialStack = ActivityManagerService.dumpStackTraces(pids,null, null, 
    getInterestingNativePids(), null);

ArrayList<Integer> pids = new ArrayList<>(mInterestingJavaPids);
            final File finalStack = ActivityManagerService.dumpStackTraces(
                    pids, processCpuTracker, new SparseArray<>(), getInterestingNativePids(),
                    tracesFileException);

注:如果你是app开发者,可以加上代码块,来抓取trace文件,利用traceview打开即可:

import android.os.Debug;
Debug.startMethodTracing("/sdcard/awesometrace.trace");
 
// 执行你希望跟踪的操作
BigInteger fN = Fibonacci.computeRecursivelyWithCache(100000);
 
Debug.stopMethodTracing();
 
// 在sdcard下面生成一个awesometrace.trace文件,通过Eclipse DDMS获取它

重启广播注册:

    final class RebootRequestReceiver extends BroadcastReceiver {
        @Override
        public void onReceive(Context c, Intent intent) {
            if (intent.getIntExtra("nowait", 0) != 0) {
                rebootSystem("Received ACTION_REBOOT broadcast");
                return;
            }
            Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent);
        }
    }

WatchDog.java内部会注册一个BroadcastReceiver,当接收到重启广播的时候,会立即重启系统

实际

addMonitor() 和addThread():

WatchDog.java:  
    public interface Monitor {
        void monitor();
    }
  
    public void addMonitor(Monitor monitor) {
        synchronized (this) {
            mMonitorChecker.addMonitorLocked(monitor);
        }
    }

    public void addThread(Handler thread) {
        addThread(thread, DEFAULT_TIMEOUT);
    }

    public void addThread(Handler thread, long timeoutMillis) {
        synchronized (this) {
            final String name = thread.getLooper().getThread().getName();
            mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));
        }
    }


AMS:

    Watchdog.getInstance().addMonitor(this);
    Watchdog.getInstance().addThread(mHandler);

    public void monitor() {
        synchronized (this) { }
    }


monitor本质上是一个接口,addmonitor函数会在Fgthread里添加一个monitor,在最终检测是否超时的时候就会回调到对应服务里的monitor方法。例如此处就是AMS检测同步锁是否死锁。

addThread会在watchdog里添加一个HandlerCheck去检测。

当以上所有的HandlerCheck 和 Monitor都添加完毕后,WatchDog就会遍历所有的HandlerCheck和Monitor,检测是否超时。

@Override
    public void run() {
        boolean waitedHalf = false;
        File initialStack = null;
        while (true) {
            final List<HandlerChecker> blockedCheckers;
            final String subject;
            final boolean allowRestart;
            int debuggerWasConnected = 0;
            synchronized (this) {
                long timeout = CHECK_INTERVAL;
                // Make sure we (re)spin the checkers that have become idle within
                // this wait-and-check interval
                for (int i=0; i<mHandlerCheckers.size(); i++) {
                    HandlerChecker hc = mHandlerCheckers.get(i);
                    hc.scheduleCheckLocked();
                }
            ...
            }
        }
    }

在run()函数里,会有一个while循环,即WatchDog会一直死循环检测系统服务是否超时。

此处最重要的检测函数为scheduleCheckLocked和HandlerCheck本身的run()函数,这两个就是真正检测monitor和handler是否处理超时的地方

        public void scheduleCheckLocked() {
            if (mCompleted) {
                // Safe to update monitors in queue, Handler is not in the middle of work
                mMonitors.addAll(mMonitorQueue);
                mMonitorQueue.clear();
            }
            if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())
                    || (mPauseCount > 0)) {
                // Don't schedule until after resume OR
                // If the target looper has recently been polling, then
                // there is no reason to enqueue our checker on it since that
                // is as good as it not being deadlocked.  This avoid having
                // to do a context switch to check the thread. Note that we
                // only do this if we have no monitors since those would need to
                // be executed at this point.
                mCompleted = true;
                return;
            }
            if (!mCompleted) {
                // we already have a check in flight, so no need
                return;
            }

            mCompleted = false;
            mCurrentMonitor = null;
            mStartTime = SystemClock.uptimeMillis();
            mHandler.postAtFrontOfQueue(this);
        }

        @Override
        public void run() {
            // Once we get here, we ensure that mMonitors does not change even if we call
            // #addMonitorLocked because we first add the new monitors to mMonitorQueue and
            // move them to mMonitors on the next schedule when mCompleted is true, at which
            // point we have completed execution of this method.
            final int size = mMonitors.size();
            for (int i = 0 ; i < size ; i++) {
                synchronized (Watchdog.this) {
                    mCurrentMonitor = mMonitors.get(i);
                }
                mCurrentMonitor.monitor();
            }

            synchronized (Watchdog.this) {
                mCompleted = true;
                mCurrentMonitor = null;
            }
        }

回到WatchDog的run()方法,需要注意的是,WatchDog超时时间为60S,会分为两次,每次30S。block 30s会有线程堆栈log打印,并触发/data/anr/trace.txt文件生成。然后判定最终状态,判断最终状态代码:

                if (!fdLimitTriggered) {
                    final int waitState = evaluateCheckerCompletionLocked();
                    if (waitState == COMPLETED) {
                        // The monitors have returned; reset
                        waitedHalf = false;
                        continue;
                    } else if (waitState == WAITING) {
                        // still waiting but within their configured intervals; back off and recheck
                        continue;
                    } else if (waitState == WAITED_HALF) {
                        if (!waitedHalf) {
                            Slog.i(TAG, "WAITED_HALF");
                            // We've waited half the deadlock-detection interval.  Pull a stack
                            // trace and wait another half.
                            ArrayList<Integer> pids = new ArrayList<>(mInterestingJavaPids);
                            initialStack = ActivityManagerService.dumpStackTraces(pids,
                                    null, null, getInterestingNativePids(), null);
                            waitedHalf = true;
                        }
                        continue;
                    }

                    // something is overdue!
                    blockedCheckers = getBlockedCheckersLocked();
                    subject = describeCheckersLocked(blockedCheckers);
                } else {
                    blockedCheckers = Collections.emptyList();
                    subject = "Open FD high water mark reached";
                }

当超时进入overdue状态,打印出log,并且重启系统,log系统中watchdog即为此处打印

    private ArrayList<HandlerChecker> getBlockedCheckersLocked() {
        ArrayList<HandlerChecker> checkers = new ArrayList<HandlerChecker>();
        for (int i=0; i<mHandlerCheckers.size(); i++) {
            HandlerChecker hc = mHandlerCheckers.get(i);
            if (hc.isOverdueLocked()) {
                checkers.add(hc);
            }
        }
        return checkers;
    }

    private String describeCheckersLocked(List<HandlerChecker> checkers) {
        StringBuilder builder = new StringBuilder(128);
        for (int i=0; i<checkers.size(); i++) {
            if (builder.length() > 0) {
                builder.append(", ");
            }
            builder.append(checkers.get(i).describeBlockedStateLocked());
        }
        return builder.toString();
    }

随后往下的代码块就是打印log,堆栈信息

frameworks\base\services\core\java\com\android\server\Watchdog.java
@Override
public void run() {
    // 是否在等待的前半段时间 true则为等待的后半段时间
    boolean waitedHalf = false;
    // 死循环
    while (true) {
        final List<HandlerChecker> blockedCheckers;
        // 超时原因 用于日志打印
        final String subject;
        // 是否允许system_server重启 默认为true
        // 可以通过watchdog.setAllowRestart()重新设置值
        final boolean allowRestart;
        // 调试进程连接数 有的话会赋值为2
        int debuggerWasConnected = 0;
        synchronized (this) {

        //判断超时代码
        ...

        // 如果我们到这里,这意味着系统很可能挂起。
        // 首先从系统进程的所有线程收集堆栈跟踪。然后杀死这个进程,
        // 这样系统才会重新启动。
        EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

        //打印ANR log
        File watchdogTraces;
        String newTracesPath = "traces_SystemServer_WDT"
                    + mTraceDateFormat.format(new Date()) + "_pid"
                    + String.valueOf(Process.myPid());
        File tracesDir = new File(ActivityManagerService.ANR_TRACE_DIR);
        watchdogTraces = new File(tracesDir, newTracesPath);
        try {
                if (watchdogTraces.createNewFile()) {
                    FileUtils.setPermissions(watchdogTraces.getAbsolutePath(),
                            0600, -1, -1); // -rw------- permissions
                    if (initialStack != null) {
                        final long age = System.currentTimeMillis()
                                - initialStack.lastModified();
                        final long FIVE_MINUTES_IN_MILLIS = 1000 * 60 * 5;
                        if (age < FIVE_MINUTES_IN_MILLIS) {
                            appendFile(watchdogTraces, initialStack);
                        }
                    }
                    if (finalStack != null) {
                        appendFile(watchdogTraces, finalStack);
                    }
                } 
        } catch (Exception e) {
            // catch any exception that happens here;
            // why kill the system when it is going to die anyways?
            Slog.e(TAG, "Exception creating Watchdog dump file:", e);
        }

        ArrayList<Integer> pids = new ArrayList<>();
        pids.add(Process.myPid());
        if (mPhonePid > 0) pids.add(mPhonePid);
        
        // 打印java线程和native线程堆栈
        final File stack = ActivityManagerService.dumpStackTraces(
                pids, null, null, getInterestingNativePids());

        // 挂起5s确保堆栈能写入到文件中
        SystemClock.sleep(5000);

        // 让kernel dump全部的blocked线程 和 cpu信息
        doSysRq('w');
        doSysRq('l');

        // 尝试把error加到dropbox下,但假设ActivityManager自己会死锁
        // 当这种情况发生时, 导致以下语句死锁,watchdog作为一个整体将会失效
        Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
                public void run() {
                    // 如果其中一条被观察的线程在Watchdog init()方法执行前被挂起
                    // 我们则没有一个有效的AMS,所以不能把log打印存储到dropbox路径下
                    if (mActivity != null) {
                        mActivity.addErrorToDropBox(
                                "watchdog", null, "system_server", null, null, null,
                                subject, null, stack, null);
                    }
                    StatsLog.write(StatsLog.SYSTEM_SERVER_WATCHDOG_OCCURRED, subject);
                }
            };
        dropboxThread.start();
        try {
            // 等待2s 让dropboxThread返回
            dropboxThread.join(2000);  
        } catch (InterruptedException ignored) {}

        IActivityController controller;
        synchronized (this) {
            controller = mController;
        }
        // 如果ActivityController不为null 
        if (controller != null) {
            Slog.i(TAG, "Reporting stuck state to activity controller");
            try {
                // 由于挂起system process而禁用dump 防止controller在报告错误的时候被挂起
                Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
                // 1 = keep waiting, -1 = kill system
                int res = controller.systemNotResponding(subject);
                if (res >= 0) {
                    Slog.i(TAG, "Activity controller requested to coninue to wait");
                    waitedHalf = false;
                    continue;
                }
            } catch (RemoteException e) {
            }
        }

        // 只有在没有debugger连接的情况下才会杀死进程。
        if (Debug.isDebuggerConnected()) {
            debuggerWasConnected = 2;
        }
        if (debuggerWasConnected >= 2) {
            Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
        } else if (debuggerWasConnected > 0) {
            Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
        } else if (!allowRestart) {
            Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
        } else {
            Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
            WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
            Slog.w(TAG, "*** GOODBYE!");
            // 结束进程 watchdog存在于system_server进程之下 
            // 因为watchdog就是在system_server初始化的
            Process.killProcess(Process.myPid());
            System.exit(10);
        }

        waitedHalf = false;
    }
}

最后Process.killProcess(Process.mypid()) 和System.exit(10)就是重启系统

参考文章:

https://juejin.cn/post/6844904085695496199

Android 系统中的 WatchDog 详解_android watchdog-优快云博客

### Android Watchdog机制的作用 在Android系统中,Watchdog扮演着至关重要的角色。其主要职责在于监控System Server以及其它关键服务的状态,防止这些组件因死锁或其他异常情况而陷入无响应状态[^1]。 当检测到任何被监视对象出现问题时,Watchdog会触发一系列操作来尝试恢复系统的正常运行,最极端的情况下甚至可以重启整个设备以保障用户体验不受影响[^3]。 ### 实现方式概述 #### 初始化过程 Watchdog实例化发生在`startBootstrapServices()`方法内部,这是Framework启动流程的一部分: ```java private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) { ... // 尽早启动看门狗以便于能够在早期引导过程中发生死锁时崩溃掉system server t.traceBegin("StartWatchdog"); final Watchdog watchdog = Watchdog.getInstance(); watchdog.start(); ... } ``` 这段代码展示了如何尽早初始化并激活Watchdog实例,从而确保即使是在系统刚刚启动期间也能有效预防潜在的风险。 #### 主要功能模块 - **Binder线程池监测**:通过定期检查SystemServer进程中Binder线程的数量是否达到上限,以此判断是否存在资源耗尽风险。 - **内存管理优化**:针对低内存状况下的处理策略进行了改进,特别是对于lmkd(Low Memory Killer Daemon)这样的核心守护进程给予了特别关注,避免因为长时间等待而导致的服务不可用现象[^2]。 - **定时任务执行器**:利用Handler机制周期性地遍历注册过的检查点(checkpoints),一旦发现某个节点超过了预设的时间阈值,则立即采取相应措施进行干预。 ```java // 定义了一个用于提交新任务给主线程的消息处理器 final Handler mHandler = new Handler(mMainLooper.getQueue()) { @Override public void handleMessage(Message msg) { switch (msg.what) { case MSG_RUN_CHECK: runCheckLocked(false /* not interactive */); break; ... } } }; ``` 此部分实现了基于时间驱动的任务调度逻辑,使得Watchdog能够持续不断地履行自己的监督职能。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值