Android7.0 Watchdog机制

   转自 https://blog.youkuaiyun.com/fu_kevin0606/article/details/64479489

    对手机系统而言,因为肩负着接听电话和接收短信的“重任”,所以被寄予7x24小 时正常工作的希望。但是作为一个在嵌入式设备上运行的操作系统,Android运行中必须面对各种软硬件干扰,从最简单的代码出现死锁或者被阻塞,到内存越界导致的内存破坏,或者由于硬件问题导致的内存反转,甚至是极端工作环境下出现的CPU电子迁移和存储器消磁。这一切问题都可能导致系统服务发生难以预料的崩溃和死机。
    想解决这一问题,可以从正反两个方向出发,其一是提高软硬件在极端状态下的可靠性,如进行程序终止性验证,或选用抗辐射加固器件。但是基于成本考虑,普通的手机系统很难做到完全不出故障;另一个方法是及时发现系统崩溃并重启系统。手机系统的大部分的故障都会在重启后消失,不会影响继续使用。所以简单的办法是,如果检测到系统不正常了,将设备重新启动,这样用户就能继续使用了。那么如何才能判断系统是否正常呢。在早期的手机平台上通常的做法是在设备中增加一个硬件看门狗,软件系统必须定 时的向看门狗硬件中写值来表示自己没出故障(俗称“喂狗”),否则超过了规定的时间看门狗就会重新启动设备。
    硬件看门狗的问题是它的功能比较单一,只能监控整个系统。早期的手机操作系统大多是单任务的,硬件看门狗勉强能胜任。Android的SystemServer是一个非常复杂的进程,里面运行的服务超过五十种,是最可能出问题的进程,因此有必要对SystemServer中运行的各种线程实施监控。但是如果使用硬件看门狗的工作方式,每个线程隔一段时间去喂狗,不但非常浪费CPU,而且会导致程序设计更加复杂。因此Android开发了WatchDog类作为软件看门狗来监控SystemServer中的线程。一旦发现问题,WatchDog会杀死SystemServer进程。
    SystemServer的父进程Zygote接收到SystemServer的死亡信号后,会杀死自己。Zygote进程死亡的信号传递到Init进程后,Init进程会杀死Zygote进程所有的子进程并重启Zygote。这样整个手机相当于重启一遍。通常SystemServer出现问题和kernel并没有关系,所以这种“软重启”大部分时候都能够解决问题。而且这种“软重启”的速度更快,对用户的影响也更小。

WatchDog是在SystemServer进程中被初始化和启动的。在SystemServer 的run方法中,各种Android服务被注册和启动,其中也包括了WatchDog的初始化和启动。代码如下:

final Watchdog watchdog = Watchdog.getInstance();

watchdog.init(context, mActivityManagerService);

在SystemServer中startOtherServices的后半段,将通过SystemReady接口通知系统已经就绪。在ActivityManagerService的SystemReady接口的CallBack函数中实现WatchDog的启动

                Watchdog.getInstance().start();

以上代码位于frameworks/base/services/java/com/android/server/SystemServer.java中。
前面说到WatchDog是在SystemServer.java中通过getInstance方法创建的,其具体实现方式如下:

public static Watchdog getInstance() {

if (sWatchdog == null) {

sWatchdog = new Watchdog(); //单例模式创建实例

}


return sWatchdog;

}


private Watchdog() {

super("watchdog");

// Initialize handler checkers for each common thread we want to check. Note

// that we are not currently checking the background thread, since it can

// potentially hold longer running operations with no guarantees about the timeliness

// of operations there.


// The shared foreground thread is the main checker. It is where we

// will also dispatch monitor checks and do other work.

mMonitorChecker = new HandlerChecker(FgThread.getHandler(),

"foreground thread", DEFAULT_TIMEOUT);

mHandlerCheckers.add(mMonitorChecker);

// Add checker for main thread. We only do a quick check since there

// can be UI running on the thread.

mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),

"main thread", DEFAULT_TIMEOUT));

// Add checker for shared UI thread.

mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),

"ui thread", DEFAULT_TIMEOUT));

// And also check IO thread.

mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),

"i/o thread", DEFAULT_TIMEOUT));

// And the display thread.

mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),

"display thread", DEFAULT_TIMEOUT));


// Initialize monitor for Binder threads.

addMonitor(new BinderThreadMonitor());

}

在Watchdog构造函数中将main thread,UIthread,Iothread,DisplayThread加入mHandlerCheckers列表中。最后初始化monitor放入mMonitorCheckers列表中。

 public void addMonitor(Monitor monitor) {

synchronized (this) {

if (isAlive()) {

throw new RuntimeException("Monitors can't be added once the Watchdog is running");

}

mMonitorChecker.addMonitor(monitor);

}

}
上述代码仅仅是启动了watchdog服务,但watchdog还不知道需要监视哪些系统服务。为保持watchdog模块的独立性和可扩展性,需要由系统服务向watchdog注册。Watchdog提供两种监视方式,一种是通过monitor()回调监视服务关键区是否出现死锁或阻塞,一种是通过发送消息监视服务主线程是否阻塞。
以ActivityManagerService.java为例,为向watchdog注册monitor()回调,首先需要继承watchdog.Monitor接口:
public class ActivityManagerService extends ActivityManagerNativeEx

implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback {

而后在构造函数中把自身注册到watchdog monitor服务中。注意这里有两个检测项,一个是addMonitor,在每一个检测周期中watchdog会使用foreground thread的HandlerChecker回调服务注册的monitor()方法给服务的关键区上锁并马上释放,以检测关键区是否存在死锁或阻塞;另一个是addThread,watchdog会定时通过HandlerChecker向系统服务发送消息,以检测服务主线程是否被阻塞。这就是为什么在watchdog重启时有有两种提示语:“Block in Handler in ......”和“Block in monitor”,它们分别对应不同的阻塞类型。

Watchdog.getInstance().addMonitor(this);

Watchdog.getInstance().addThread(mHandler);

最后在类中实现watchdog.Monitor所需的monitor方法。watchdog运行时每30秒会回调这个方法来锁一次这个关键区,如果60秒都无法得到锁,就说明服务已经发生了死锁,必须重启设备。

/** In this method we try to acquire our lock to make sure that we have not deadlocked */

public void monitor() {

synchronized (this) { }

}

从上面分析可以知道,在watchdog的构造函数中将foreground thread、mian thread传入了一个HandlerChecker类。这个类就是watchdog检测超时的执行者。HandlerChecker类有多个实例,每个通过addThread向watchdog注册自身的服务都对应一个HandlerChecker类实例。

public void addThread(Handler thread) {

addThread(thread, DEFAULT_TIMEOUT);

}


public void addThread(Handler thread, long timeoutMillis) {

synchronized (this) {

if (isAlive()) {

throw new RuntimeException("Threads can't be added once the Watchdog is running");

}

final String name = thread.getLooper().getThread().getName();

mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));

}

}

HandlerChecker继承了Runnable,每个HandlerChecker在各自服务的主线程中运行并完成相应的检查,不会互相干扰。

/**

* Used for checking status of handle threads and scheduling monitor callbacks.

*/

public final class HandlerChecker implements Runnable {

private final Handler mHandler;

private final String mName;

private final long mWaitMax;

private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();

private boolean mCompleted;

private Monitor mCurrentMonitor;

private long mStartTime;


HandlerChecker(Handler handler, String name, long waitMaxMillis) {

mHandler = handler;

mName = name;

mWaitMax = waitMaxMillis;

mCompleted = true;

}

每个通过addThread向watchdog注册自身的服务都对应一个HandlerChecker类实例,那么通过addMonitor()注册的服务由谁来检查呢?答案就是前面出现的mMonitorChecker,也就是foreground thread的HandlerChecker。它除了需要检测主线程是否堵塞外,还需要回调系统服务注册的monitor()方法,以检测这些服务的关键区是否存在死锁或阻塞。
之所以不能在watchdog的主线程中回调monitor()方法,是由于如果被监控服务的关键区被占用,其monitor()方法可能需要一段时间才能返回。这样就无法保证watchdog每次个检测周期都是30s,所以必须交由foreground thread代为检查。
addMonitor()中会把每个monitor添加到mMonitorChecker也就是foreground thread的HandlerChecker中。除了它以外,所有HandlerChecker的mMonitors都是空的。
当watchdog的主循环开始运行后,每隔30秒,都会依次调用所有HandlerChecker的scheduleCheckLocked()方法。对于foreground thread的HandlerChecker,由于它的mMonitors不为空,需要它去锁各服务的monitor()来检查是否出现死锁,因此每个检测周期都要执行它。
对于其他的HandlerChecker,需要判断线程的Looper是否处于Idling,若为空就说明前一个消息已经执行完毕正在等下一个,消息循环肯定没阻塞,不用继续检测直接跳过本轮。
如果线程的消息循环不是Idling状态,说明服务的主线程正在处理某个消息,有阻塞的可能,就需要使用PostAtFrontOfQueue发出消息到消息队列,并记录下当前系统时间,同时将mComplete置为false,标明已经发出一个消息正在等待处理。

public void scheduleCheckLocked() {

if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {

// If the target looper has recently been polling, then

// there is no reason to enqueue our checker on it since that

// is as good as it not being deadlocked. This avoid having

// to do a context switch to check the thread. Note that we

// only do this if mCheckReboot is false and we have no

// monitors, since those would need to be executed at this point.

mCompleted = true;

return;

}


if (!mCompleted) {

// we already have a check in flight, so no need

return;

}


mCompleted = false;

mCurrentMonitor = null;

mStartTime = SystemClock.uptimeMillis();

mHandler.postAtFrontOfQueue(this);

}

如果线程的消息队列没有阻塞,PostAtFrontOfQueue很快就会触发HandlerChecker的run方法。对于foreground thread的HandlerChecker,它会回调被监控服务的monitor方法,对其关键区上锁并马上释放,以检查是否存在死锁或阻塞。对于其他线程,仅需要将mComplete标记为true,表明消息已经处理完成即可。

@Override

public void run() {

final int size = mMonitors.size();

for (int i = 0 ; i < size ; i++) {

synchronized (Watchdog.this) {

mCurrentMonitor = mMonitors.get(i);

}

mCurrentMonitor.monitor();

}


synchronized (Watchdog.this) {

mCompleted = true;

mCurrentMonitor = null;

}

}

}
如果服务的消息循环发生了堵塞,那么mComplete就会一直处于false状态。watchdog在每一个检测周期中都会一次调用每个HandlerChecker的getCompletionStateLocked方法检测超时时间,如果任何一个服务的主线程30s无响应就会提前输出其堆栈为重启做准备,如果60s无响应则进入重启流程。
public int getCompletionStateLocked() {

if (mCompleted) {

return COMPLETED;

} else {

long latency = SystemClock.uptimeMillis() - mStartTime;

if (latency < mWaitMax/2) {

return WAITING;

} else if (latency < mWaitMax) {

return WAITED_HALF;

}

}

return OVERDUE;

}

Watchdog主循环
SystemServer调用watchdog的start方法,watchdog便开始在自己线程的while循环中运行,以达到每30s检测一次的目的:

@Override

public void run() {

boolean waitedHalf = false;

while (true) {

final ArrayList<HandlerChecker> blockedCheckers;

final String subject;

final boolean allowRestart;

int debuggerWasConnected = 0;

synchronized (this) {

long timeout = CHECK_INTERVAL;

// Make sure we (re)spin the checkers that have become idle within

// this wait-and-check interval

for (int i=0; i<mHandlerCheckers.size(); i++) { //遍历各个HandlerChecker,依次检查前台,ui,主线程等系统主要线程

HandlerChecker hc = mHandlerCheckers.get(i);

hc.scheduleCheckLocked();

}
对于每个检测周期,首先需要将timeout计时器复位,而后依次检查在watchdog的init方法中注册的foreground thread,main thread,UI thread,i/o thread,以及其他通过addThread方法注册的服务的主线程是否阻塞。
检查主线程是否阻塞的方法是,如果线程Looper状态不是Idling,就通过HandlerChecker的postAtFrontOfQueue方法发送一个消息。稍后检测这个消息是否超时未返回。
通过postAtFrontOfQueue送出消息后睡眠30s。注意这里使用uptimeMillis()计算时间,不计手机在睡眠中度过的时间。这是由于手机睡眠时系统服务同样也在睡眠,无法响应watchdog送出的消息,如果把睡眠时间计算在内当手机被再次唤醒时会导致watchdog认为时间已经过去了很久,从而发生误杀。
// NOTE: We use uptimeMillis() here because we do not want to increment the time we

// wait while asleep. If the device is asleep then the thing that we are waiting

// to timeout on is asleep as well and won't have a chance to run, causing a false

// positive on when to kill things.

long start = SystemClock.uptimeMillis(); //使用uptimeMills不把手机睡眠时间算进入,手机睡眠时系统服务同样睡眠,状态无法响应watchdog会导致误杀

while (timeout > 0) {

if (Debug.isDebuggerConnected()) {

debuggerWasConnected = 2;

}

try {

wait(timeout);

} catch (InterruptedException e) {

Log.wtf(TAG, e);

}

if (Debug.isDebuggerConnected()) {

debuggerWasConnected = 2;

} //CHECK_INTERVAL的默认时间是30s,此为第一次等待时间,WatchDog判断对象是否死锁的最长等待时间为1min

timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);

}

30秒等待完成后,就要检测之前送出的消息是否已经执行完毕。通过evaluateCheckerCompletionLocked遍历所有的HandlerChecker,返回最大的waitState值。waitState共有四种情况:COMPLETED对应消息已处理完毕线程无阻塞;WAITING对应消息处理花费0~29秒,需要继续运行;WAITED_HALF对应消息处理花费30~59秒,线程可能已经被阻塞,需要保存当前AMS堆栈状态,用以在超时发生时输出堆栈;OVERDUE对应消息处理已经花费超过60s,此时便进入下一流程,输出堆栈信息并重启手机。

final int waitState = evaluateCheckerCompletionLocked();

if (waitState == COMPLETED) {

// The monitors have returned; reset

waitedHalf = false; //所有服务都正常,reset

continue;

} else if (waitState == WAITING) {

// still waiting but within their configured intervals; back off and recheck

continue;

} else if (waitState == WAITED_HALF) {

if (!waitedHalf) {

// We've waited half the deadlock-detection interval. Pull a stack

// trace and wait another half.

ArrayList<Integer> pids = new ArrayList<Integer>();

pids.add(Process.myPid());

ActivityManagerService.dumpStackTraces(true, pids, null, null,

NATIVE_STACKS_OF_INTEREST);

waitedHalf = true;

}

continue;

}

Watchdog超时已经发生,但之前evaluateCheckerCompletionLocked并不关心是哪个服务发生阻塞,仅仅返回所有服务最大的waitState值。此时需要调用getBlockedCheckersLocked判断具体是哪些应用发生了阻塞,阻塞的原因是什么。这就是我们在dropbox中看到的阻塞原因描述。而后依次输出AMS与Kernel调用堆栈。

// something is overdue!

blockedCheckers = getBlockedCheckersLocked(); //WatchDog超时,获取那个服务超时阻塞,生成崩溃描述符

subject = describeCheckersLocked(blockedCheckers); //判断是否重启

allowRestart = mAllowRestart;

}


// If we got here, that means that the system is most likely hung.

// First collect stack traces from all threads of the system process.

// Then kill this process so that the system will restart.

EventLog.writeEvent(EventLogTags.WATCHDOG, subject);


ArrayList<Integer> pids = new ArrayList<Integer>();

pids.add(Process.myPid());

if (mPhonePid > 0) pids.add(mPhonePid);

// Pass !waitedHalf so that just in case we somehow wind up here without having

// dumped the halfway stacks, we properly re-initialize the trace file.

final File stack = ActivityManagerService.dumpStackTraces(

!waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);


// Give some extra time to make sure the stack traces get written.

// The system's been hanging for a minute, another second or two won't hurt much.

SystemClock.sleep(2000);


// Pull our own kernel thread stacks as well if we're configured for that

if (RECORD_KERNEL_THREADS) {

dumpKernelStackTraces();

}


// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log

doSysRq('w');

doSysRq('l');


// Try to add the error to the dropbox, but assuming that the ActivityManager

// itself may be deadlocked. (which has happened, causing this statement to

// deadlock and the watchdog as a whole to be ineffective)

Thread dropboxThread = new Thread("watchdogWriteToDropbox") {

public void run() {

mActivity.addErrorToDropBox(

"watchdog", null, "system_server", null, null,

subject, null, stack, null);

}

};

dropboxThread.start();

try {

dropboxThread.join(2000); // wait up to 2 seconds for it to return.

} catch (InterruptedException ignored) {}


IActivityController controller;

synchronized (this) {

controller = mController;

}

if (controller != null) {

Slog.i(TAG, "Reporting stuck state to activity controller");

try {

Binder.setDumpDisabled("Service dumps disabled due to hung system process.");

// 1 = keep waiting, -1 = kill system

int res = controller.systemNotResponding(subject);

if (res >= 0) {

Slog.i(TAG, "Activity controller requested to coninue to wait");

waitedHalf = false;

continue;

}

} catch (RemoteException e) {

}

}


// Only kill the process if the debugger is not attached.

if (Debug.isDebuggerConnected()) {

debuggerWasConnected = 2;

}

if (debuggerWasConnected >= 2) {

Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");

} else if (debuggerWasConnected > 0) {

Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");

} else if (!allowRestart) {

Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");

} else {

Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);

for (int i=0; i<blockedCheckers.size(); i++) {

Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");

StackTraceElement[] stackTrace

= blockedCheckers.get(i).getThread().getStackTrace();

for (StackTraceElement element: stackTrace) {

Slog.w(TAG, " at " + element);

}

}

Slog.w(TAG, "*** GOODBYE!");

Process.killProcess(Process.myPid());

System.exit(10);

}


waitedHalf = false;

}

}

输出dropbox,并检查activity controller连接的调试器是否可以处理这次watchdog无响应,如果activity controller不要求重启,那么就忽视这次超时,从头继续运行watchdog循环。杀死SystemServer并重启手机。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值