MTK——Android watchdog with Hang Detect_mtk watchdog-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_53884337/article/details/143576074

MTK 设计hang detect 来弥补 system server watchdog 失效, 而kernel scheduler 还能基本运行环境下的死机问题.

目前默认启用了运行时监控,以及关机流程监控.开机监控默认没有启用.

Android watchdog with Hang Detect run方法的实现：

//in watchdog.run 
if (exceptionHang != null) {
     exceptionHang.WDTMatterJava(300); //watchdog normal kick
 }
 
```

 Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
 WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
 Slog.w(TAG, "*** GOODBYE!");

 if (!Build.IS_USER && isCrashLoopFound()
 && !WatchdogProperties.should_ignore_fatal_count().orElse(false)) {
     breakCrashLoop();
 }
 exceptionHang.WDTMatterJava(330); // watchdog restart

```

 if (halfWatchdog) {
     if (exceptionHang != null) {
         exceptionHang.WDTMatterJava(360); // watchdog 检测到存在task blocked 超过30s(SWT第一次dump)
         exceptionHang.switchFtrace(3);
 }

```

 Slog.e(TAG, "**SWT happen **" + subject);
 if (exceptionHang != null) {
     exceptionHang.switchFtrace(2);
 }
 sfLog = (mSfHang && subject.isEmpty()) ? "surfaceflinger hang." : "";
 EventLog.writeEvent(EventLogTags.WATCHDOG, sfLog.isEmpty() ? subject : sfLog);
 if (exceptionHang != null) {
     exceptionHang.WDTMatterJava(420); //SWT happen，存在blocked 超过60s的task
 }

HANG_NORMAL    exceptionHWT.WDTMatterJava(300)  -> 正常情况下watchdog kick
HANG_WATEDHALF exceptionHWT.WDTMatterJava(360)  -> watchdog已经监测到有task blocked 超过30s，(SWT first dump)
HANG_DUE 420   exceptionHWT.WDTMatterJava(420)  -> watchdog已经监测到有task blocked 超过60s，触发SWT (SWT happen)
HANG_KILL 330  exceptionHWT.WDTMatterJava(330)  -> SWT dump 结束，准备restart(android 上层重启，并非整机)

在驱动中使用这些数值作为Hang Detect的超时时间，也就是，Hang Detect driver认为watchdog应该可以正常循环，watchdog循环中的每个阶段会告知driver其状态，若watchdog循环中发生阻塞，那么driver将以被watchdog告知的状态作为超时时间来等待watchdog从阻塞中恢复，若在超时时间内无法恢复，则到期将触发Hang Detect异常，表示Hang Detect driver监测到Hang的发生。

例如：

exceptionHWT.WDTMatterJava(300)   hang detect driver收到这个意味着watchdog必须要在10*30S(hang detect driver 30s检查一次)内再次kick，否则触发hang detect异常

Hang Detect机制分层介绍：

1）framework

framework层包含带有Hang Detect code的Android watchdog，其run方法中通过ExceptionLog类的WDTMatterJava方法调用到native层，ExceptionLog的java实现在

frameworks/base/core/java/com/mediatek/aee/ExceptionLog.java中。
2）native

native层是ExceptionLog的jni实现，code在

vendor/mediatek/proprietary/frameworks/base/core/java/com/mediatek/aee/jni/com_mediatek_aee_exceptionlog.cpp中；ExceptionLog类的WDTMatterJava方法通过native层操作/dev/RT_Monitor，实现与kernel层的通信。

3）kernel

kernel层中包含Hang Detect的driver实现，最主要的部分是实现Hang Detect的deamon thread，用于监控Hang是否发生，其code在

drivers/misc/mediatek/monitor_hang/hang_detect.c中。这一部分重点介绍Hang Detect的deamon thread的执行体函数hang_detect_thread。

hang detect

目前有开发两个feature：

一个用来监控system server 的watchdog 行为是否异常.
一个用来监控点亮屏幕的操作是否能在规定的时间内完成，因为此feature 存在一些误报的现象，目前还没有开启.

Hang Detect逻辑上并不难，主要有两个方法 : monitor_hang_ioctl、hang_detect_thread

首先，注册了一个device, /dev/RT_Monitor, 上层可以通过ioctl 和它进行沟通.

static long monitor_hang_ioctl(struct file *file, unsigned int cmd,unsigned long arg)
{
     int ret = 0;
     static long long monitor_status;
     void __user *argp = (void __user *)arg;
     char name[TASK_COMM_LEN] = {0};

     if (cmd == HANG_KICK) { //接受来自system server watchdog exceptionHWT.WDTMatterJava（XXX）的kick
          ret = compare_cmdline();
          if (ret) {
              pr_info(" invalid process kick RT_Monitor!");
              return -EFAULT;
          }
         pr_info("hang_detect HANG_KICK ( %d)\n", (int)arg);
         monitor_hang_kick((int)arg);
         return ret;
     }

     if ((cmd == HANG_SET_SF_STATE) && //SFWatchdog 来set/get pending time的cmd
       (!strncmp(current->comm, "surfaceflinger", 10) ||
       !strncmp(current->comm, "SWWatchDog", 10))) {
            if (copy_from_user(&monitor_status, argp, sizeof(long long)))
                 ret = -EFAULT;
            return ret;
     } else if (cmd == HANG_GET_SF_STATE) {
            if (copy_to_user(argp, &monitor_status, sizeof(long long)))
                 ret = -EFAULT;
            return ret;
     }

#ifdef CONFIG_MTK_HANG_DETECT_DB
     if (cmd == HANG_SET_REBOOT) { //监控init关机流程
         reboot_flag = true;
         hang_detect_counter = 5;
         hd_timeout = 5;
         hd_detect_enabled = true;
         pr_info("hang_detect: %s set reboot command.\n", current->comm);
         return ret;
     }
#endif

     if (cmd == HANG_ADD_WHITE_LIST) { //添加、删除hang监控的白名单，只有白名单内的task可以被监控到，目前有system_server，init，Kpoc(关机充电)
         if (copy_from_user(name, argp, TASK_COMM_LEN - 1))
             ret = -EFAULT;
         ret = add_white_list(name);
         pr_info("hang_detect: add white list %s status %d.\n",name, ret);
         return ret;
     }

     if (cmd == HANG_DEL_WHITE_LIST) {
         if (copy_from_user(name, argp, TASK_COMM_LEN - 1))
             ret = -EFAULT;
         ret = del_white_list(name);
         pr_info("hang_detect: del white list %s status %d.\n",
         name, ret);
         return ret;
     }

     return ret;
}

重点关注HANG_KICK当中 monitor_hang_kick 这个方法，这里涉及到hang detect当中两个最主要参数(hd_detect_enabled、hang_detect_counter )的变更

hd_detect_enabled 指示当前hang detect 是否开启， hang_detect_counter 为计时器计数. hang_detect thread 每隔30s (HD_INTER) 将其减一

void monitor_hang_kick(int lParam)
{
     if (reboot_flag) {
         pr_info("[Hang_Detect] in reboot flow.\n");
         return;
     }

     if (lParam == 0) { //hd_detect_enabled = 0 关闭hang detect
         hd_detect_enabled = 0;
         hang_detect_counter = hd_timeout;
         pr_info("[Hang_Detect] hang_detect disabled\n");
     } else if (lParam > 0) {

         /* lParem=0x1000|timeout,only set in aee call when NE in
         * system_server so only change hang_detect_counter when
         * call from AEE
         * Others ioctl, will change
         * hd_detect_enabled & hang_detect_counter
         */
      
         if (lParam & 0x1000) {
             hang_detect_counter = hd_timeout = ((long)(lParam & 0x0fff) + HD_INTER - 1) / (HD_INTER); //AEE kick,只会改变hang_detect_counter，不会涉及hd_detect_enabled 
         } else {
             hd_detect_enabled = 1;
             hang_detect_counter = hd_timeout = ((long)lParam + HD_INTER - 1) / (HD_INTER); // SSWatchdog kick,会改变hang_detect_counter，也会涉及hd_detect_enabled
         } 

         if (hd_timeout < 3) { //hang detect的最小检测时长
             /* hang detect min timeout is 10 (5min) */
             hang_detect_counter = 3;
             hd_timeout = 3;
         }
         pr_info("[Hang_Detect] hang_detect enabled %d\n", hd_timeout);
     }
     reset_hang_info();
}

接下来就是hang detect的主要工作流程：hang_detect_thread

其设计思想非常简单，即上层通过ioctl 访问此device, 设置hang_detect_counter, 即告知hang detect, 预计会在hang_detect_counter * 30s 之内再次来tick 您, 假如我没有在这个时间内tick 您，那么就意味着我已经牺牲了, 您就发动暴动, 重启手机.

对应hang_detect_thread 的实现就这个循环检测和操作hang_detect_counter.

static int hang_detect_thread(void *arg)
{
     /* unsigned long flags; */
     struct sched_param param = {
         .sched_priority = 99
     };
     struct task_struct *hd_thread;

     sched_setscheduler(current, SCHED_FIFO, &param);
     reset_hang_info();
     msleep(120 * 1000);
     pr_debug("[Hang_Detect] hang_detect thread starts.\n");

#ifdef BOOT_UP_HANG
     hd_timeout = 9; //刚启动时的初始count
     hang_detect_counter = 9;
     hd_detect_enabled = true;
#endif

     while (1) {
         pr_info("[Hang_Detect] hang_detect thread counts down %d:%d, status %d.\n", hang_detect_counter, hd_timeout, hd_detect_enabled);
#ifdef BOOT_UP_HANG
         if (hd_detect_enabled)
#else
         if (hd_detect_enabled && check_white_list()) //检查白名单
#endif
         {
             if (hd_hang_poll && (hang_detect_counter == 2)) { //这个时候恢复的希望不大了，提前poll AEE去抓更多咨询
                 hd_hang_trace = true;
                 wake_up(&hang_wait);
             }

             if (hang_detect_counter <= 0) { //此时已经触发hang，hang_detect_counter = 0是第一次dump，-1为第二次dump
#ifdef CONFIG_MTK_HANG_DETECT_DB
                 if (!Hang_first_done) {
                     memset(Hang_Info, 0, MaxHangInfoSize);
                     Hang_Info_Size = 0;
                 }
#endif
                 log_hang_info("[Hang_detect]Dump the %d time process bt.\n",Hang_first_done ? 2 : 1);

                 if (Hang_first_done == true && dump_bt_done != 1) {
                     /* some time dump thread will block in dumping native bt */
                     /* so create new thread to dump enough kernel bt */
                     hd_thread = kthread_create(dump_last_thread,NULL, "hang_detect2");
                     if (hd_thread)
                         wake_up_process(hd_thread);
                     if (dump_bt_done != 1)
                         wait_event_interruptible_timeout(dump_bt_done_wait, dump_bt_done == 1, HZ*10);
                 } else
                     wake_up_dump();


                 if (Hang_first_done == true) { //有时候我们还可以把这里屏蔽掉，保留现场，能抓到更多资讯
#ifdef CONFIG_MTK_HANG_DETECT_DB
                     trigger_hang_db();
#else
                     BUG();                                                   
#endif
                 } else
                     Hang_first_done = true;
             }
             hang_detect_counter--;
         }

         msleep((HD_INTER) * 1000); //30s
     }
     return 0;
}

问题分析

判断是哪个阶段出现的hang问题，是boot up、系统正常运行、关机/重启哪个阶段，不同阶段的分析思路不同:

a. boot up阶段Android watchdog还没有启动，因此需要要查进入kernel到Android watchdog之间哪里block住了；

b. 系统正常运行阶段Android watchdog已经启动，是最常见的情况，要以Android watchdog是否block住为出发点梳理问题；

c. 关机/重启阶段Andorid watchdog会因system server被kill而退出，此时需要分析init流程是否block住；

针对系统正常运行阶段；

解开DB后在SYS_LAST_KMSG中搜索"counts down"：

 [158759.906063]  (4)[267:hang_detect][Hang_Detect] hang_detect thread counts down -1:10, status 1.

-1:10中-1表示counts已经被减到-1，说明即将要触发hang异常，10表示Andoriod watchdog最近一次执行WDTMatterJava的参数是300，因为300S将在driver中转化为计数10，即最近一次Android watchdog执行了：

 exceptionHWT.WDTMatterJava(300);

正常情况Android watchdog应该30S执行一次以上调用，如果在300S（counts为10）内都没有再次执行以上调用，说明Android watchdog被block住或已经退出；

接下来可以查看Android watchdog的调用栈，确认其执行情况，在SYS_HANG_DETECT_RAW中可以搜索"watchdog"，以下是实例中遇到的两种情况举例：

a. Android watchdog的调用栈不存在，说明其已经退出，可能原因就是其父进程system server被kill，这种情况system server通常没有顺利退出（当然也存在system server重启失败的可能，只是概率较小），即处于僵尸态（在SYS_HANG_DETECT_RAW中是K状态），需要检查system server的状态：

 system_server pid 1615 state K, flags 4244804. stack is null.

面对以上情况就应该检查system server处于K状态（僵尸态）的原因，通常都是其下有子线程未退出，以此为线索继续查；

b. Android watchdog的调用栈可以搜索到，例如：

 watchdog D 8891.900953 1286 5016 2121 0x1404040 0x400800 1256 143164363927468
 __switch_to+0x12c/0x148
 __schedule+0x784/0x84c
 schedule+0x74/0x94
 io_schedule+0x1c/0x40
 blk_mq_get_tag+0x168/0x2ec
 blk_mq_get_request+0x160/0x398
 blk_mq_make_request+0x18c/0x5f4
 generic_make_request+0x184/0x368
 submit_bio+0x1ac/0x230
 ext4_submit_bio_read+0x150/0x1c8
 ext4_mpage_readpages+0x7d4/0x814
 ext4_readpages+0x54/0x64
 read_pages+0x68/0x148
 __do_page_cache_readahead+0xd4/0x1cc
 filemap_fault+0x254/0x540
 ext4_filemap_fault+0x34/0x50
 __do_fault+0x80/0xec
 handle_pte_fault+0x83c/0xa94
 __handle_speculative_fault+0x220/0x344
 do_page_fault+0x25c/0x4c0
 do_translation_fault+0x50/0x6c
 do_mem_abort+0x5c/0xfc
 el0_da+0x1c/0x20
 0xffffffffffffffff

那么这个调用栈就不是Android watchdog通常所处的状态，需要结合该调用栈查为什么处于这个流程中，需要注意的是，在SYS_HANG_DETECT_RAW中会打印两次系统处于R和D状态的task的调用栈，间隔30S，正两次打印的目的是拉长时间维度，因此需要确认两次打印中Android watchdog的调用栈是否一致，是否发生主/被动调度，进出cpu时间是否变化等等来判断到底是Android watchdog执行的比较忙（或block多次，内次block时间较短），还是一直被block住了。

搜索：‘counts down’ 'watchdog'

流程分析

1. 确定System Server watchdog tick hang detect 的状态

这个可以从kernel log 中明确的看到.如：

watchdog 看起来正常：

[  198.215932] (1)[1322:watchdog]hang_detect HANG_KICK ( 300)
[  198.215945] (1)[1322:watchdog][Hang_Detect] hang_detect enabled 10

watchdog 在第一次dump backtrace:

[  258.218145] (0)[1322:watchdog]hang_detect HANG_KICK ( 360)
[  258.218171] (0)[1322:watchdog][Hang_Detect] hang_detect enabled 12

watchdog 在做SWT:

[  299.046542] (0)[1322:watchdog]hang_detect HANG_KICK ( 420)
[  299.046572] (0)[1322:watchdog][Hang_Detect] hang_detect enabled 14

当然你也可以从hang detect thread 的log 中看到这个：

[  210.475572] (0)[90:hang_detect][Hang_Detect] init found pid:1.
[  210.475735] (0)[90:hang_detect][Hang_Detect] mmcqd/0 found pid:158.
[  210.475815] (0)[90:hang_detect][Hang_Detect] surfaceflinger found pid:265.
[  210.475887] (0)[90:hang_detect][Hang_Detect] system_server found pid:734.
[  210.475919] (0)[90:hang_detect][Hang_Detect] ndroid.systemui found pid:1071.
[  210.476003] (0)[90:hang_detect][Hang_Detect] debuggerd found pid:4313.
[  210.476027] (0)[90:hang_detect][Hang_Detect] debuggerd64 found pid:4314.
[  210.476056] (0)[90:hang_detect][Hang_Detect] hang_detect thread counts down 10:10.

正常情况下，watchdog thread tick 300s, 对应count=10.
watchdog 检测到hang 30s, 在dump backtrace 前，tick 360s, 对应count=12.
在SWT 发生的情况下，tick 420s, 对应count=14.
在SWT 发生, 抓取完AEE DB 后, tick 330s, 对应count=11.
在 Surfaceflinger/SystemServer 发生Native Exception, tick 600s, 对应count=20.
在 Surfaceflinger/SystemServer 发生Native Exception后, 并且我司AEE开始抓取coredump, tick 1200s, 对应count=40.
在 Surfaceflinger/SystemServer 发生Native Exception, 并且我司AEE抓取完coredump后, tick 570, 对应count=19.
在关机流程中, 进入init reboot/shutdown 过程后, tick 150s, 对应count=5. (注: Q 版本是eng build 90s, user/userdebug build 30s)

2. 确认System Server Watchdog thread 状态

Watchdog 正常时，它是停留在java wait 函数，所以它对应的backtrace 即是

 KERNEL SPACE BACKTRACE, sysTid=1299
 
    [] __switch_to+0x74/0x8c from [] 
    [] __schedule+0x314/0x794 from [] 
    [] schedule+0x24/0x68 from [] 
    [] futex_wait_queue_me+0xcc/0x158 from [] 
    [] futex_wait+0x120/0x20c from [] 
    [] do_futex+0x184/0xa48 from [] 
    [] SyS_futex+0x88/0x19c from [] 
    [] cpu_switch_to+0x48/0x4c from [] 
 
"watchdog" sysTid=1299
 
  #00 pc 0000000000019478  /system/lib64/libc.so (syscall+28)
  #01 pc 00000000000d0c74  /system/lib64/libart.so (art::ConditionVariable::TimedWait(art::Thread*, long, int)+168)
  #02 pc 00000000002a3e98  /system/lib64/libart.so (art::Monitor::Wait(art::Thread*, long, int, bool, art::ThreadState)+860)
  #03 pc 00000000002a4e04  /system/lib64/libart.so (art::Monitor::Wait(art::Thread*, art::mirror::Object*, long, int, bool, art::ThreadState)+244)
  #04 pc 0000000000000974  /data/dalvik-cache/arm64/system@framework@boot.oat

注意:因为在kernel 中，你只能看到kernel backtrace, 而SyS_futex 这个futex 只能表明是上层卡住, 可能是pthread mutex lock, 也可能是wait 也可能是其他，难以直接明了.

如果不是这个backtrace 那么很明显watchdog 已经卡住，这个时候即由它的backtrace 进行推理，以便确认进一步卡住的地方, 我们准备了另外的章节来描述如何解析出user space 的native backtrace.

那如果你没有找到watchdog 线程怎么办呢？很简单，说明此时可能是三种情况，一种是system server 发生SWT 后退出, 因为卡在了kernel 导致system server 某个线程无法退出，而这个线程就是你需要进一步分析的; 另外一种情况是system server SWT 重启后，system server 初始化失败，watchdog 还没有启动就卡死了。这两种根据前后的LOG 很容易就可以判断出来；最后就是问题是发生在关机/重启过程中, system server/zygote 已经被杀掉了, 直接查看init 的backtrace 即可.

3.关键process 状态追查

根据以往分析的经验，我们发现很多时候都是一些关键的process 卡住，导致watchdog 的行为无法达成. 所以我们在hang detect 中打印了关键的Process 的资讯，以便快速追查问题.

[  300.505542] (0)[90:hang_detect][Hang_Detect] init found pid:1.
[  300.505616] (0)[90:hang_detect][Hang_Detect] mmcqd/0 found pid:158.
[  300.505649] (0)[90:hang_detect][Hang_Detect] surfaceflinger found pid:265.
[  300.505678] (0)[90:hang_detect][Hang_Detect] system_server found pid:734.
[  300.505688] (0)[90:hang_detect][Hang_Detect] ndroid.systemui found pid:1071.
[  300.505719] (0)[90:hang_detect][Hang_Detect] debuggerd found pid:4313.
[  300.505726] (0)[90:hang_detect][Hang_Detect] debuggerd64 found pid:4314.
[  300.505735] (0)[90:hang_detect][Hang_Detect] [Hang_Detect] hang_detect thread counts down -1:12, status 1.

init 卡住的话，就意味着任何system property 的设置，或者service 的重启，wifi 的设置等都会卡住.

mmcqd 卡住的话，就意味着emmc 可能卡住.

surfaceflinger 卡住的话，就意味着display driver 和 GPU 可能卡住.

systemui 卡住的话，就意味着keyguard 可能卡住.

debuggerd 卡住的话，Watchdog 抓取backtrace 时就会卡住.

4. Hang Detect KE 调整

为了防止hang detect 的资讯, 被直接其他的kernel log 刷掉, 在N1 版本后, 我们有开辟专门的hang detect buffer 存储相关的资讯, 以后再也不担心被刷掉了. 对应的buffer 资讯在KE 的DB 里面存放在: SYS_HANG_DETECT_RAW

为了方便查看上层的backtrace, 我们也把/data/anr 的资讯抓取到了KE 的DB 里面, 存放在: DATA_ANR_TRACES

当确认到system server 的SWT 流程已经抓上层trace 时, 务必查看此项.