一起来聊聊Android基础之Watchdog
标签(空格分隔): Android面试知识
资料来源:
WatchDog工作原理
相关文件:
/frameworks/base/services/core/java/com/android/server/Watchdog.java
/frameworks/base/services/java/com/android/server/SystemServer.java
Watchdog时序图:
扯闲篇
Android系统中,有HW Watchdog用于检测硬件是否正常工作;而System Server Watchdog(SWT)负责检测系统关键服务是否正常工作。
Watchdog机制广泛应用于Linux系统中,系统必须在指定时间内执行喂狗操作,否则就会触发Watchdog超时,从而强行复位系统等操作。
Watchdog初始化
Android中,Watchdog在初始化是在开机阶段,从SystemServer中完成中。
SystemServer.java
// SystemServer.java
private void startOtherServices() {
// ...
// 1. 创建Watchdog实例对象
final Watchdog watchdog = Watchdog.getInstance();
// 2. 初始化watchdog
watchdog.init(context, mActivityManagerService);
// 3. 启动watchdog
Watchdog.getInstance().start();
// ...
}
创建Watchdog实例对象
// Watchdog.java
215 public static Watchdog getInstance() {
216 if (sWatchdog == null) {
217 sWatchdog = new Watchdog();
218 }
219
220 return sWatchdog;
221 }
82 final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();
223 private Watchdog() {
224 super("watchdog");
// ...
232 mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
233 "foreground thread", DEFAULT_TIMEOUT);
234 mHandlerCheckers.add(mMonitorChecker);
235 // Add checker for main thread. We only do a quick check since there
236 // can be UI running on the thread.
237 mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
238 "main thread", DEFAULT_TIMEOUT));
239 // Add checker for shared UI thread.
240 mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
241 "ui thread", DEFAULT_TIMEOUT));
242 // And also check IO thread.
243 mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
244 "i/o thread", DEFAULT_TIMEOUT));
245 // And the display thread.
246 mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
247 "display thread", DEFAULT_TIMEOUT));
248
249 // Initialize monitor for Binder threads.
250 addMonitor(new BinderThreadMonitor());
251 }
Watchdog采用的是单例模式来创建实例对象。Watchdog继承于Thread,创建的线程名为”watchdog”。mHandlerCheckers队列包括、 主线程,fg, ui, io, display线程的HandlerChecker对象。
初始化watchdog
// Watchdog.java
253 public void init(Context context, ActivityManagerService activity) {
254 mResolver = context.getContentResolver();
255 mActivity = activity;
256
257 context.registerReceiver(new RebootRequestReceiver(),
258 new IntentFilter(Intent.ACTION_REBOOT),
259 android.Manifest.permission.REBOOT, null);
260 }
调用registerReceiver注册ACTION_REBOOT广播,当Watchdog触发超时便重启系统。
202 final class RebootRequestReceiver extends BroadcastReceiver {
203 @Override
204 public void onReceive(Context c, Intent intent) {
205 if (intent.getIntExtra("nowait", 0) != 0) {
206 rebootSystem("Received ACTION_REBOOT broadcast");
207 return;
208 }
209 Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent);
210 }
211 }
321 void rebootSystem(String reason) {
322 Slog.i(TAG, "Rebooting system because: " + reason);
323 IPowerManager pms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);
324 try {
325 pms.reboot(false, reason, false);
326 } catch (RemoteException ex) {
327 }
328 }
重启系统调用的是PowerManagerService.reboot()方法。
启动watchdog
// Watchdog.java
348 @Override
349 public void run() {
// ...
503 }
Watchdog继承自Thread,所以调用start()方法后会回调run()方法。
Watchdog工作机制
Watchdog的工作是在它的run()方法中完成的,主要任务是监测重要进程是否超时,以及超时后打印相关信息,当满足一定条件时重启。
下面详细来分析run()方法:
// Watchdog.java
398 @Override
399 public void run() {
400 boolean waitedHalf = false;
401 while (true) {
402 final ArrayList<HandlerChecker> blockedCheckers;
403 final String subject;
404 final boolean allowRestart;
405 int debuggerWasConnected = 0;
406 synchronized (this) {
// CHECK_INTERVAL = 30s
407 long timeout = CHECK_INTERVAL;
408 // Make sure we (re)spin the checkers that have become idle within
409 // this wait-and-check interval
// 1. 记录所有Checker的mStartTime
410 for (int i=0; i<mHandlerCheckers.size(); i++) {
411 HandlerChecker hc = mHandlerCheckers.get(i);
412 hc.scheduleCheckLocked();
413 }
414
415 if (debuggerWasConnected > 0) {
416 debuggerWasConnected--;
417 }
418
// 2. 等待30s
423 long start = SystemClock.uptimeMillis();
424 while (timeout > 0) {
425 if (Debug.isDebuggerConnected()) {
426 debuggerWasConnected = 2;
427 }
428 try {
429 wait(timeout);
430 } catch (InterruptedException e) {
431 Log.wtf(TAG, e);
432 }
433 if (Debug.isDebuggerConnected()) {
434 debuggerWasConnected = 2;
435 }
436 timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
437 }
// 3. 评估Checker的状态
439 final int waitState = evaluateCheckerCompletionLocked();
440 if (waitState == COMPLETED) {
441 // The monitors have returned; reset
442 waitedHalf = false;
443 continue;
444 } else if (waitState == WAITING) {
445 // still waiting but within their configured intervals; back off and recheck
446 continue;
447 } else if (waitState == WAITED_HALF) {
448 if (!waitedHalf) {
449 // We've waited half the deadlock-detection interval. Pull a stack
450 // trace and wait another half.
451 ArrayList<Integer> pids = new ArrayList<Integer>();
452 pids.add(Process.myPid());
453 ActivityManagerService.dumpStackTraces(true, pids, null, null,
454 getInterestingNativePids());
455 waitedHalf = true;
456 }
457 continue;
458 }
459
// 4. 有Checker已经超时,获取阻塞的Cherkers。
460 // something is overdue!
461 blockedCheckers = getBlockedCheckersLocked();
462 subject = describeCheckersLocked(blockedCheckers);
463 allowRestart = mAllowRestart;
464 }
1.scheduleCheckLocked()
内部类HandlerChecker实现了Runnable接口,源码对这个类的描述:
/**
* Used for checking status of handle threads and scheduling monitor callbacks.
*/
127 public void scheduleCheckLocked() {
128 if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
135 mCompleted = true;
136 return;
137 }
138
139 if (!mCompleted) {
140 // we already have a check in flight, so no need
141 return;
142 }
143
144 mCompleted = false;
145 mCurrentMonitor = null;
146 mStartTime = SystemClock.uptimeMillis();
147 mHandler.postAtFrontOfQueue(this);
148 }
185 @Override
186 public void run() {
187 final int size = mMonitors.size();
188 for (int i = 0 ; i < size ; i++) {
189 synchronized (Watchdog.this) {
190 mCurrentMonitor = mMonitors.get(i);
191 }
192 mCurrentMonitor.monitor();
193 }
194
195 synchronized (Watchdog.this) {
196 mCompleted = true;
197 mCurrentMonitor = null;
198 }
199 }
200 }
那它是如何来检查handle threads的状态呢?
就是通过计算mStartTime和当前的时间差,和mWaitMax进行对比来判断该threads的状态。
第147行,mHandler.postAtFrontOfQueue(this)会将HanderChecker插入到被监控进程的MessageQueue的队列头,当被监控进程的Looper抽取消息时便会回调HanderChecker的run()方法。
在run()方法中,遍历所有Monitor接口。如果被监控进程由于某种原因,导致monitor()方法迟迟没有执行,就会触发watchdog。
如果有其他消息不断地调用postAtFrontOfQueue()也可能导致watchdog没有机会执行;或者是每个monitor消耗一些时间,累加起来超过1分钟造成的watchdog。 这些都是非常规的Watchdog。
2.等待30S再向下执行
3.evaluateCheckerCompletionLocked()
64 static final int COMPLETED = 0;
65 static final int WAITING = 1;
66 static final int WAITED_HALF = 2;
67 static final int OVERDUE = 3;
330 private int evaluateCheckerCompletionLocked() {
331 int state = COMPLETED;
332 for (int i=0; i<mHandlerCheckers.size(); i++) {
333 HandlerChecker hc = mHandlerCheckers.get(i);
334 state = Math.max(state, hc.getCompletionStateLocked());
335 }
336 return state;
337 }
154 public int getCompletionStateLocked() {
155 if (mCompleted) {
156 return COMPLETED;
157 } else {
158 long latency = SystemClock.uptimeMillis() - mStartTime;
159 if (latency < mWaitMax/2) {
160 return WAITING;
161 } else if (latency < mWaitMax) {
162 return WAITED_HALF;
163 }
164 }
165 return OVERDUE;
166 }
这个Checker状态评估规则非常直观,一共有四种,分别是:COMPLETED(已完成),WAITING(等待时间小于mWaitMax/2),WAITED_HALF(等待时间大于mWaitMax/2),OVERDUE(超时)。
当Checker达到WAITED_HALF状态时,将调用ActivityManagerService.dumpStackTraces()方法打印相关进程的堆栈信息。
4.有Checker已经超时
继续来看run()后面的代码:
471 ArrayList<Integer> pids = new ArrayList<>();
472 pids.add(Process.myPid());
473 if (mPhonePid > 0) pids.add(mPhonePid);
// 第二次以追加的方式,再打印堆栈信息
476 final File stack = ActivityManagerService.dumpStackTraces(
477 !waitedHalf, pids, null, null, getInterestingNativePids());
478
481 SystemClock.sleep(2000);
483 // Pull our own kernel thread stacks as well if we're configured for that
484 if (RECORD_KERNEL_THREADS) {
// 输出kernel栈信息
485 dumpKernelStackTraces();
486 }
// 触发kernel输出所有阻塞线程的堆栈信息
488 // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
489 doSysRq('w');
490 doSysRq('l');
// 输出dropbox信息到/data/system/dropbox
495 Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
496 public void run() {
497 mActivity.addErrorToDropBox(
498 "watchdog", null, "system_server", null, null,
499 subject, null, stack, null);
500 }
501 };
502 dropboxThread.start();
503 try {
504 dropboxThread.join(2000); // wait up to 2 seconds for it to return.
505 } catch (InterruptedException ignored) {}
507 IActivityController controller;
508 synchronized (this) {
509 controller = mController;
510 }
511 if (controller != null) {
512 Slog.i(TAG, "Reporting stuck state to activity controller");
513 try {
514 Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
515 // 1 = keep waiting, -1 = kill system
516 int res = controller.systemNotResponding(subject);
517 if (res >= 0) {
518 Slog.i(TAG, "Activity controller requested to coninue to wait");
519 waitedHalf = false;
520 continue;
521 }
522 } catch (RemoteException e) {
523 }
524 }
// 当debugger没有attach时,才杀死进程
526 // Only kill the process if the debugger is not attached.
527 if (Debug.isDebuggerConnected()) {
528 debuggerWasConnected = 2;
529 }
530 if (debuggerWasConnected >= 2) {
531 Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
532 } else if (debuggerWasConnected > 0) {
533 Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
534 } else if (!allowRestart) {
535 Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
536 } else {
537 Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
// 遍历输出阻塞线程的栈信息
538 for (int i=0; i<blockedCheckers.size(); i++) {
539 Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
540 StackTraceElement[] stackTrace
541 = blockedCheckers.get(i).getThread().getStackTrace();
542 for (StackTraceElement element: stackTrace) {
543 Slog.w(TAG, " at " + element);
544 }
545 }
546 Slog.w(TAG, "*** GOODBYE!");
// 杀死进程system_server
547 Process.killProcess(Process.myPid());
548 System.exit(10);
549 }
550
551 waitedHalf = false;
552 }
当杀死system_server进程,从而导致zygote进程自杀,进而触发init执行重启Zygote进程,这便出现了手机framework重启的现象。
Watchdog监测的进程
69 // Which native processes to dump into dropbox's stack traces
70 public static final String[] NATIVE_STACKS_OF_INTEREST = new String[] {
71 "/system/bin/audioserver",
72 "/system/bin/cameraserver",
73 "/system/bin/drmserver",
74 "/system/bin/mediadrmserver",
75 "/system/bin/mediaserver",
76 "/system/bin/sdcard",
77 "/system/bin/surfaceflinger",
78 "media.extractor", // system/bin/mediaextractor
79 "media.codec", // vendor/bin/hw/android.hardware.media.omx@1.0-service
80 "com.android.bluetooth", // Bluetooth service
81 };
82
83 public static final List<String> HAL_INTERFACES_OF_INTEREST = Arrays.asList(
84 "android.hardware.audio@2.0::IDevicesFactory",
85 "android.hardware.bluetooth@1.0::IBluetoothHci",
86 "android.hardware.camera.provider@2.4::ICameraProvider",
87 "android.hardware.graphics.composer@2.1::IComposer",
88 "android.hardware.vr@1.0::IVr",
89 "android.hardware.media.omx@1.0::IOmx"
90 );
监控同步锁
能够被Watchdog监控的系统服务都实现了Watchdog.Monitor接口,并实现其中的monitor()方法。运行在android.fg线程, 系统中实现该接口类主要有:
- ActivityManagerService
- WindowManagerService
- InputManagerService
- PowerManagerService
- NetworkManagementService
- MountService
- NativeDaemonConnector
- BinderThreadMonitor
- MediaProjectionManagerService
- MediaRouterService
- MediaSessionService
-BinderThreadMonitor
总结
出处:
- Watchdog是一个运行在system_server进程的名为”watchdog”的线程
- Watchdog运作过程,当阻塞时间超过1分钟则触发一次watchdog,会杀死system_server,触发上层重启;
- mHandlerCheckers记录所有的HandlerChecker对象的列表,包括foreground, main, ui, i/o, display线程的handler;
- mHandlerChecker.mMonitors记录所有Watchdog目前正在监控Monitor,所有的这些monitors都运行在foreground线程。
有两种方式加入Watchdog监控:
- addThread():用于监测Handler线程,默认超时时长为60s.这种超时往往是所对应的handler线程消息处理得慢;
- addMonitor(): 用于监控实现了Watchdog.Monitor接口的服务.这种超时可能是”android.fg”线程消息处理得慢,也可能是monitor迟迟拿不到锁;
以下情况,即使触发了Watchdog,也不会杀掉system_server进程:
- monkey: 设置IActivityController,拦截systemNotResponding事件, 比如monkey.
- hang: 执行am hang命令,不重启;
- debugger: 连接debugger的情况, 不重启;