hang_detect典型案例

一、Kernel suspend 流程卡死,导致hang detect 重启

客户从后台反馈,出现比较多的hang detect thread KE, 请尽快分析。

[179942.389198] 0)watchdog        D
[179942.389206] 0) ffffffc000085a84 
[179942.389207] 0)    0  1374    320 0x00000008
[179942.389210] 0)Call trace:
[179942.389221] 0)[] __switch_to+0x74/0x8c
[179942.389234] 0)[] __schedule+0x314/0x794
[179942.389248] 0)[] schedule+0x24/0x68
[179942.389258] 0)[] __refrigerator+0x6c/0xf4
[179942.389269] 0)[] futex_wait_queue_me+0x150/0x158
[179942.389280] 0)[] futex_wait+0x120/0x20c
[179942.389291] 0)[] do_futex+0x184/0xa48
[179942.389301] 0)[] SyS_futex+0x88/0x19c
[179942.389313] 0)android.bg      D
[179942.389320] 0) ffffffc000085a84 
[179942.389322] 0)    0  1447    320 0x00000008
[179942.389325] 0)Call trace:
[179942.389336] 0)[] __switch_to+0x74/0x8c
[179942.389350] 0)[] __schedule+0x314/0x794
[179942.389363] 0)[] schedule+0x24/0x68
[179942.389372] 0)[] __refrigerator+0x6c/0xf4
[179942.389385] 0)[] do_nanosleep+0x108/0x110
[179942.389396] 0)[] hrtimer_nanosleep+0x8c/0x108
[179942.389407] 0)[] SyS_nanosleep+0x90/0xa8

卡在了pm_suspend 流程。搜索kworker关键字:

[179942.393792] 0)Workqueue: autosleep try_to_suspend
[179942.393795] 0)Call trace:
[179942.393807] 0)[] __switch_to+0x74/0x8c
[179942.393821] 0)[] __schedule+0x314/0x794
[179942.393835] 0)[] schedule+0x24/0x68
[179942.393850] 0)[] synchronize_irq+0x88/0xc4
[179942.393865] 0)[] suspend_device_irqs+0xdc/0xe4
[179942.393878] 0)[] dpm_suspend_noirq+0x2c/0x288
[179942.393889] 0)[] dpm_suspend_end+0x34/0x84
[179942.393905] 0)[] suspend_devices_and_enter+0x150/0x4a4
[179942.393917] 0)[] enter_state+0x15c/0x190
[179942.393929] 0)[] pm_suspend+0x2c/0xa8
[179942.393939] 0)[] try_to_suspend+0x148/0x1a8
[179942.393955] 0)[] process_one_work+0x148/0x468
[179942.393968] 0)[] worker_thread+0x138/0x3c0
[179942.393979] 0)[] kthread+0xb0/0xbc

从流程上看可以看到suspend 的流程已经进入了device 操作的尾声了。

其中执行synchronize_irq, 这个是等目前的IRQ 做完为后续的disable irq: arch_suspend_disable_irqs 做准备。当时的第一直觉是,这个irq handler 执行太久,按理早就WDT 重启了,不至于等到hang detect 来检测到了。 于是看了一下synchronize_irq 的代码:

45void synchronize_irq(unsigned int irq)
46{
47  struct irq_desc *desc = irq_to_desc(irq);
48  bool inprogress;
49
50  if (!desc)
51      return;
52
53  do {
54      unsigned long flags;
55
56      /*
57       * Wait until we're out of the critical section.  This might
58       * give the wrong answer due to the lack of memory barriers.
59       */
60      while (irqd_irq_inprogress(&desc->irq_data))
61          cpu_relax();
62
63      /* Ok, that indicated we're done: double-check carefully. */
64      raw_spin_lock_irqsave(&desc->lock, flags);
65      inprogress = irqd_irq_inprogress(&desc->irq_data);
66      raw_spin_unlock_irqrestore(&desc->lock, flags);
67
68      /* Oops, that failed? */
69  } while (inprogress);
70
71  /*
72   * We made sure that no hardirq handler is running. Now verify
73   * that no threaded handlers are active.
74   */
75  wait_event(desc->wait_for_threads, !atomic_read(&desc->threads_active));
76}

可以看到最后的wait_event最后在wait_for_threads, 对应唤醒的代码是:

831/*
832 * Interrupt handler thread
833 */
834static int irq_thread(void *data)
835{
836 struct callback_head on_exit_work;
837 static const struct sched_param param = {
838     .sched_priority = MAX_USER_RT_PRIO/2,
839 };
840 struct irqaction *action = data;
841 struct irq_desc *desc = irq_to_desc(action->irq);
842 irqreturn_t (*handler_fn)(struct irq_desc *desc,
843         struct irqaction *action);
844
845 if (force_irqthreads && test_bit(IRQTF_FORCED_THREAD,
846                 &action->thread_flags))
847     handler_fn = irq_forced_thread_fn;
848 else
849     handler_fn = irq_thread_fn;
850
851 sched_setscheduler(current, SCHED_FIFO, &param);
852
853 init_task_work(&on_exit_work, irq_thread_dtor);
854 task_work_add(current, &on_exit_work, false);
855
856 irq_thread_check_affinity(desc, action);
857
858 while (!irq_wait_for_interrupt(action)) {
859     irqreturn_t action_ret;
860
861     irq_thread_check_affinity(desc, action);
862
863     action_ret = handler_fn(desc, action);
864     if (action_ret == IRQ_HANDLED)
865         atomic_inc(&desc->threads_handled);
866
867     wake_threads_waitq(desc);
868 }
869
870 /*
871  * This is the regular exit path. __free_irq() is stopping the
872  * thread via kthread_stop() after calling
873  * synchronize_irq(). So neither IRQTF_RUNTHREAD nor the
874  * oneshot mask bit can be set. We cannot verify that as we
875  * cannot touch the oneshot mask at this point anymore as
876  * __setup_irq() might have given out currents thread_mask
877  * again.
878  */
879 task_work_cancel(current, irq_thread_dtor);
880 return 0;
881}

注册IRQ 有两种方式,requests_irq和request_thread_irq。request_irq 只能注册irq handler for line,而request_threaded_irq可同时注册irq handler和irq thread function。irq thread function 即为这个irq 开一个thread, 即irq_thread, 然后执行你的irq thread function, 对应这个过程不会disable irq, 不在irq 原子流程中了。但synchronized_irq 即会等irq handler 做完也会等irq thread function 做完。

于是此问题就变成了肯定有人注册了irq thread function, 并且没有执行完毕,因为这个thread 肯定从irq_thread 中执行,于是追查,可以看到:

[179942.389537] 0)irq/291-spi0.0  D
[179942.389544] 0) ffffffc000085a84 
[179942.389546] 0)    0  1090      2 0x00000000
[179942.389549] 0)Call trace:
[179942.389561] 0)[] __switch_to+0x74/0x8c
[179942.389575] 0)[] __schedule+0x314/0x794
[179942.389588] 0)[] schedule+0x24/0x68
[179942.389601] 0)[] schedule_timeout+0x128/0x20c
[179942.389614] 0)[] __down_timeout+0x68/0xc8
[179942.389625] 0)[] down_timeout+0x5c/0x84
[179942.389640] 0)[] connection_read_data+0x38/0xac
[179942.389654] 0)[] mc_wait_notification+0xd0/0x148
[179942.389670] 0)[] gf516m_send_cmd_secdrv+0x68/0x1e8
[179942.389681] 0)[] gf516m_irq+0x50/0x638
[179942.389692] 0)[] irq_thread+0x110/0x164
[179942.389703] 0)[] kthread+0xb0/0xbc

gf516m 这个指纹识别的irq,通过request_thread_irq 注册,然后和TEE 通讯mc_wait_notification,结果卡住了,从而导致hang detect 重启。

找汇顶的工程师协助处理,将流程改成irq — work queue 的方式执行,以规避掉目前的情况。

根本原因
suspend 时因为gf516m driver IRQ 卡住,导致hang detect 触发重启。

解决方案
vendor 修改driver 将流程改成irq -- work queue 执行的方式。

搜索:'counts down' 'watchdog' ‘kworker’ 'irq'

二、Kernel suspend 流程在驱动里卡死

待机过程中会重复重启。

[ 3005.572831].(4)[107:hang_detect][Hang_Detect] we should triger HWT ...
[ 3005.573649].(4)[107:hang_detect]BUG: failure at kernel-3.10/drivers/misc/mediatek/aee/aed/monitor_hang.c:437/hang_detect_thread()!
[ 3005.575584].(4)[107:hang_detect]Unable to handle kernel paging request at virtual address 0000dead
[ 3005.576695].(4)[107:hang_detect]pgd = ffffffc00007d000
[ 3005.577332].(4)[107:hang_detect][0000dead] *pgd=00000000ed807003, *pmd=00000000ed808003, *pte=0000000000000000
[ 3005.578356]-(4)[107:hang_detect]Internal error: Oops: 96000046 [#1] PREEMPT SMP
[ 3006.693872]-(4)[107:hang_detect]Modules linked in:
[ 3006.694237]-(4)[107:hang_detect]CPU: 4 PID: 107 Comm: hang_detect Tainted: P W O 3.10.72+ #1
[ 3006.695374]-(4)[107:hang_detect]task: ffffffc0ac4df000 ti: ffffffc0ac65c000 task.ti: ffffffc0ac65c000
[ 3006.696527]-(4)[107:hang_detect]PC is at hang_detect_thread+0x2b8/0x2c0
[ 3006.697345]-(4)[107:hang_detect]LR is at hang_detect_thread+0x2ac/0x2c0

看到上面的log,那么就可以判断是hang detect的问题了。

首先先找下count关键字,看看是否有如下log:

[Hang_Detect] hang_detect thread counts down -4529:20.

结果是没有搜索到,SYS_KERNEL_LOG比较小,log被冲走了。因此我们从watchdog线程看起,搜索watchdog关键字:

[ 3005.343913].(4)[107:hang_detect]watchdog D
[ 3005.344252].(4)[107:hang_detect] ffffffc0000864ec [ 3005.344849].(4)[107:hang_detect] 0 1533 410 0x00000008
[ 3005.345560].(4)[107:hang_detect]Call trace:
[ 3005.346089].(4)[107:hang_detect][] __switch_to+0x74/0x8c
[ 3005.346947].(4)[107:hang_detect][] __schedule+0x318/0x7b4
[ 3005.347814].(4)[107:hang_detect][] schedule+0x24/0x68
[ 3005.348638].(4)[107:hang_detect][] __refrigerator+0x7c/0x144
[ 3005.349540].(4)[107:hang_detect][] futex_wait_queue_me+0x150/0x158
[ 3005.350515].(4)[107:hang_detect][] futex_wait+0x13c/0x23c
[ 3005.351381].(4)[107:hang_detect][] do_futex+0x184/0xa48
[ 3005.352226].(4)[107:hang_detect][] SyS_futex+0x88/0x19c

搜索kworker关键字:

[ 3005.513786].(4)[107:hang_detect]kworker/u16:6 D
[ 3005.514121].(4)[107:hang_detect] ffffffc0000864ec [ 3005.514718].(4)[107:hang_detect] 0 6910 2 0x00000000
[ 3005.515437].(4)[107:hang_detect]Workqueue: autosleep try_to_suspend
[ 3005.515447].(4)[107:hang_detect]Call trace:
[ 3005.515965].(4)[107:hang_detect][] __switch_to+0x74/0x8c
[ 3005.516823].(4)[107:hang_detect][] __schedule+0x318/0x7b4
[ 3005.517690].(4)[107:hang_detect][] schedule+0x24/0x68
[ 3005.518515].(4)[107:hang_detect][] schedule_timeout+0x168/0x20c
[ 3005.519448].(4)[107:hang_detect][] wait_for_common+0xa0/0x148
[ 3005.520370].(4)[107:hang_detect][] wait_for_completion+0x10/0x1c
[ 3005.521313].(4)[107:hang_detect][] flush_kthread_work+0xe4/0x124
[ 3005.522255].(4)[107:hang_detect][] mc_fastcall+0x5c/0x70
[ 3005.523111].(4)[107:hang_detect][] mc_switch_core+0x60/0x8c
[ 3005.524001].(4)[107:hang_detect][] mc_cpu_offfline.part.1+0x54/0x68
[ 3005.524977].(4)[107:hang_detect][] mobicore_cpu_callback+0x98/0xa0
[ 3005.525943].(4)[107:hang_detect][] notifier_call_chain+0x68/0x11c
[ 3005.526897].(4)[107:hang_detect][] __raw_notifier_call_chain+0x8/0x14
[ 3005.527896].(4)[107:hang_detect][] _cpu_down+0x98/0x2a4
[ 3005.528743].(4)[107:hang_detect][] disable_nonboot_cpus+0xa8/0x168
[ 3005.529706].(4)[107:hang_detect][] suspend_devices_and_enter+0x1bc/0x4e8
[ 3005.530745].(4)[107:hang_detect][] enter_state+0x1e4/0x224
[ 3005.531620].(4)[107:hang_detect][] pm_suspend+0x2c/0xa8
[ 3005.532467].(4)[107:hang_detect][] try_to_suspend+0x1d8/0x240
[ 3005.533384].(4)[107:hang_detect][] process_one_work+0x134/0x470
[ 3005.534313].(4)[107:hang_detect][] worker_thread+0x138/0x3c0
[ 3005.535215].(4)[107:hang_detect][] kthread+0xb0/0xbc

suspend流程已经进入disable_nonboot_cpus()。但最终卡在mc_fastcall()

bool mc_fastcall(void *data)
{
    struct fastcall_work fc_work = {
        KTHREAD_WORK_INIT(fc_work.work, fastcall_work_func),
        .data = data,
    };

    if (!queue_kthread_work(&fastcall_worker, &fc_work.work))
        return false;
    flush_kthread_work(&fc_work.work);
    return true;
}

 从代码来看是等待fastcall_worker这个进程做完fc_work。我们看下fastcall_worker是哪个进程:

int mc_fastcall_init(struct mc_context *context)
{
    int ret = 0;
    ctx = context;

    fastcall_thread = kthread_create(kthread_worker_fn, &fastcall_worker, "mc_fastcall");
    ......
}

看到fastcall_worker的进程名是mc_fastcall,看下这个进程的调用栈:

[ 3004.761752].(4)[107:hang_detect]mc_fastcall D
[ 3004.762080].(4)[107:hang_detect] ffffffc0000864ec [ 3004.762677].(4)[107:hang_detect] 0 218 2 0x00000000
[ 3004.763389].(4)[107:hang_detect]Call trace:
[ 3004.763918].(4)[107:hang_detect][] __switch_to+0x74/0x8c
[ 3004.764775].(4)[107:hang_detect][] __schedule+0x318/0x7b4
[ 3004.765642].(4)[107:hang_detect][] schedule+0x24/0x68
[ 3004.766467].(4)[107:hang_detect][] __refrigerator+0x7c/0x144
[ 3004.767368].(4)[107:hang_detect][] kthread_worker_fn+0xfc/0x19c
[ 3004.768300].(4)[107:hang_detect][] kthread+0xb0/0xbc

suspend一定会让绝大部分进程进入__refrigerator,冻住这个进程。可是看起来后面的流程需要这个冻住的进程做事,这个死锁就导致了卡死重启了。

mc_fastcall进程不能进入__refrigerator状态。因此需要调整代码,将mc_fastcall进程调整为non-freezeable进程。

根本原因

mobicore驱动在suspend阶段存在死锁。

解决方案

修正mobicore驱动。

三、Sensor Kernel Driver 卡住导致 hang detect 重启

客户反馈,在拨打电话时,机器小概率卡死,等待几分钟后机器自动重启。

LOG 中有明显的Kernel Panic 的AEE DB, 解压DB, 在exp_main 里面可以看到:

Exception Class: Kernel (KE)
PC is at [] hang_detect_thread+0x278/0x280

Current Executing Process:
[hang_detect, 107][kthreadd, 2]

Backtrace:
[] __do_kernel_fault.part.5+0x70/0x84
[] do_page_fault+0x218/0x364 
[] do_translation_fault+0x40/0x4c    
[] do_mem_abort+0x38/0x9c    
[] el1_da+0x1c/0x88  
[] kthread+0xb0/0xbc 
[] ret_from_fork+0xc/0x30    
[] 0xffffffffffffffff

hang detect 主动call bug, 即hang detect 检测到android 上层已经卡住,主动重启手机。先查看system server tick hang detect 的情况。 

[ 7892.901454] (0)[107:hang_detect][Hang_Detect] init found pid:1.
[ 7892.901530] (0)[107:hang_detect][Hang_Detect] mmcqd/0 found pid:172.
[ 7892.901565] (0)[107:hang_detect][Hang_Detect] surfaceflinger found pid:296.
[ 7892.901576] (0)[107:hang_detect][Hang_Detect] debuggerd found pid:312.
[ 7892.901583] (0)[107:hang_detect][Hang_Detect] debuggerd64 found pid:313.
[ 7892.901610] (0)[107:hang_detect][Hang_Detect] system_server found pid:771.
[ 7892.901618] (0)[107:hang_detect][Hang_Detect] ndroid.systemui found pid:1129.
[ 7892.901658] (0)[107:hang_detect][Hang_Detect] hang_detect thread counts down 0:24.
[ 7892.901664] (0)[107:hang_detect][Hang_Detect] dump system_ui all thread bt

system server tick count=24, 即system server 已经发生了watchdog reboot (SWT), 但看到system server 的pid 还是771,说明system server 没有能够重启成功。就需要进一步的分析为何system server 没有重启成功

查看system server 的kernel backtrace。 

[ 7893.231460] (0)[107:hang_detect]system_server   x
[ 7893.231466] (0)[107:hang_detect] ffffffc000085b94 [ 7893.231474] (0)[107:hang_detect]    0   771    345 0x00000009
[ 7893.231480] (0)[107:hang_detect]Call trace:
[ 7893.231490] (0)[107:hang_detect][] __switch_to+0x74/0x8c
[ 7893.231499] (0)[107:hang_detect][] __schedule+0x314/0x784
[ 7893.231509] (0)[107:hang_detect][] schedule+0x24/0x68
[ 7893.231520] (0)[107:hang_detect][] do_exit+0x52c/0x8e0
[ 7893.231530] (0)[107:hang_detect][] do_group_exit+0x40/0xb0
[ 7893.231541] (0)[107:hang_detect][] get_signal_to_deliver+0x2a0/0x55c
[ 7893.231552] (0)[107:hang_detect][] do_signal+0x238/0x564
[ 7893.231561] (0)[107:hang_detect][] do_notify_resume+0x58/0x64

[ 7893.231567] (0)[107:hang_detect]Binder_C        D
[ 7893.231573] (0)[107:hang_detect] ffffffc000085b94 [ 7893.231582] (0)[107:hang_detect]    0  1868    345 0x00000001
[ 7893.231587] (0)[107:hang_detect]Call trace:
[ 7893.231596] (0)[107:hang_detect][] __switch_to+0x74/0x8c
[ 7893.231605] (0)[107:hang_detect][] __schedule+0x314/0x784
[ 7893.231614] (0)[107:hang_detect][] schedule+0x24/0x68
[ 7893.231624] (0)[107:hang_detect][] schedule_preempt_disabled+0x10/0x24
[ 7893.231633] (0)[107:hang_detect][] __mutex_lock_slowpath+0x134/0x23c
[ 7893.231642] (0)[107:hang_detect][] mutex_lock+0x40/0x60
[ 7893.231652] (0)[107:hang_detect][] inv_flush_batch_store+0x40/0x6c
[ 7893.231662] (0)[107:hang_detect][] dev_attr_store+0x14/0x28
[ 7893.231674] (0)[107:hang_detect][] sysfs_write_file+0xd8/0x154
[ 7893.231686] (0)[107:hang_detect][] vfs_write+0xa4/0x194
[ 7893.231696] (0)[107:hang_detect][] SyS_write+0x40/0x8c

system server 的主进程的状态是 X,这个就意味着system server 的主进程其实已经dead,然后在等其他的线程退出,上面的backtrace 不难发现是在等Binder_C 这个线程退出,而这个线程在等mutex lock, 一直无法退出。

四、emmc驱动卡死引起hang detect

手机在灭屏一段时间后(较长时间,半小时以上)再按powerkey不能点亮屏幕,只能长按powerkey 重新开机。

搜索counts关键字,看看是卡在哪一阶段:

[ 6396.581177] .(0)[170:hang_detect][Hang_Detect] init found pid:1.
[ 6396.582144] .(0)[170:hang_detect][Hang_Detect] mmcqd/0 found pid:183.
[ 6396.583057] .(0)[170:hang_detect][Hang_Detect] surfaceflinger found pid:300.
[ 6396.584009] .(0)[170:hang_detect][Hang_Detect] debuggerd found pid:1627.
[ 6396.584877] .(0)[170:hang_detect][Hang_Detect] debuggerd64 found pid:1628.
[ 6396.585811] .(0)[170:hang_detect][Hang_Detect] system_server found pid:2285.
[ 6396.586727] .(0)[170:hang_detect][Hang_Detect] ndroid.systemui found pid:3523.
[ 6396.587735] .(0)[170:hang_detect][Hang_Detect] hang_detect thread counts down 0:20.
[ 6396.588722] .(0)[170:hang_detect][Hang_Detect] dump system_ui all thread bt

可以看到是20,表示watchdog处于打印调用栈,打印调用栈需要通过debuggerd才行,这时应该看下deubggerd的调用栈,看看是否卡住了。搜索debuggerd关键字,发现没有看到debuggerd调用栈,debuggerd应该是没有卡住的,那我们回头看下watchdog线程调用栈。

[ 6397.822449] .(2)[170:hang_detect]watchdog D
[ 6397.822455] .(2)[170:hang_detect] ffffffc000088390 
[ 6397.822457] .(2)[170:hang_detect] 0 4062 1717 0x00000000
[ 6397.822461] .(2)[170:hang_detect]Call trace:
[ 6397.822469] .(2)[170:hang_detect][] __switch_to+0x74/0x8c
[ 6397.822476] .(2)[170:hang_detect][] __schedule+0x434/0x8ac
[ 6397.822484] .(2)[170:hang_detect][] schedule+0x24/0x74
[ 6397.822491] .(2)[170:hang_detect][] io_schedule+0x64/0xa0
[ 6397.822498] .(2)[170:hang_detect][] bit_wait_io+0x30/0x54
[ 6397.822506] .(2)[170:hang_detect][] __wait_on_bit+0x9c/0xd4
[ 6397.822514] .(2)[170:hang_detect][] wait_on_page_bit_killable+0x78/0x8c
[ 6397.822521] .(2)[170:hang_detect][] __lock_page_or_retry+0xa8/0xd0
[ 6397.822528] .(2)[170:hang_detect][] filemap_fault+0x754/0x814
[ 6397.822535] .(2)[170:hang_detect][] __do_fault+0x34/0xa0
[ 6397.822542] .(2)[170:hang_detect][] do_read_fault.isra.94+0x1d0/0x2a4
[ 6397.822552] .(2)[170:hang_detect][] handle_mm_fault+0x3dc/0xac4
[ 6397.822561] .(2)[170:hang_detect][] do_page_fault+0x278/0x390
[ 6397.822568] .(2)[170:hang_detect][] do_mem_abort+0x38/0x9c

 看到发生缺页异常,而最终卡在io,应该是emmc卡住了。需要检查下mmcqd进程的调用栈:

[ 6397.824104] .(2)[170:hang_detect][Hang_Detect] dump mmcqd/0 all thread bt
[ 6397.824113] .(2)[170:hang_detect]mmcqd/0 D
[ 6397.824119] .(2)[170:hang_detect] ffffffc000088390 
[ 6397.824121] .(2)[170:hang_detect] 0 183 2 0x00000000
[ 6397.824125] .(2)[170:hang_detect]Call trace:
[ 6397.824133] .(2)[170:hang_detect][] __switch_to+0x74/0x8c
[ 6397.824141] .(2)[170:hang_detect][] __schedule+0x434/0x8ac
[ 6397.824148] .(2)[170:hang_detect][] schedule+0x24/0x74
[ 6397.824157] .(2)[170:hang_detect][] __mmc_claim_host+0xb8/0x210
[ 6397.824165] .(2)[170:hang_detect][] mmc_get_card+0x10/0x1c
[ 6397.824172] .(2)[170:hang_detect][] mmc_blk_issue_rq+0x210/0x44c
[ 6397.824179] .(2)[170:hang_detect][] mmc_queue_thread+0xa8/0x16c
[ 6397.824186] .(2)[170:hang_detect][] kthread+0xe0/0xf8

看到这个调用栈不是mmcqd正常的调用栈,应该是卡在emmc了,我们搜到__mmc_claim_host还看到其他进程也卡在这里了:

[ 6397.821781] .(2)[170:hang_detect]rpmb_Dci D
[ 6397.821788] .(2)[170:hang_detect] ffffffc000088390 
[ 6397.821789] .(2)[170:hang_detect] 0 420 2 0x00000000
[ 6397.821793] .(2)[170:hang_detect]Call trace:
[ 6397.821801] .(2)[170:hang_detect][] __switch_to+0x74/0x8c
[ 6397.821809] .(2)[170:hang_detect][] __schedule+0x434/0x8ac
[ 6397.821817] .(2)[170:hang_detect][] schedule+0x24/0x74
[ 6397.821825] .(2)[170:hang_detect][] __mmc_claim_host+0xb8/0x210
[ 6397.821834] .(2)[170:hang_detect][] emmc_rpmb_req_handle+0x60/0xd8
[ 6397.821841] .(2)[170:hang_detect][] emmc_rpmb_listenDci+0x14c/0x1a0
[ 6397.821848] .(2)[170:hang_detect][] kthread+0xe0/0xf8

搜索emmc异常相关的log,发现emmc在suspend 进入sleep时有发生fail,然后再resume时也fail掉了:

host向emmc发送CMD5进入sleep mode发生timeout error:

[ 4268.440079] .(0)[2339:system_server][msdc][msdc_command_resp_polling]: msdc0 CMD MSDC_INT_CMDTMO Arg
[ 4268.440081] .(0)[2339:system_server][msdc][msdc_command_resp_polling]: msdc0 XXX CMD ARG is R1B, TMO not reset hw
[ 4268.440087] .(0)[2339:system_server][msdc]Sleep_Awake CMD timeout, MSDC_PS 81ff0002
[ 4268.440090] .(0)[2339:system_server][msdc]eMMC sleep CMD5 TMO will reinit

host向emmc发送CMD5 awake emmc发生timeout error:

[ 4268.471086] .(0)[2339:system_server]PM: Some devices failed to suspend, or early wake event detected
[ 4268.471178] .(0)[2339:system_server][msdc][msdc_command_resp_polling]: msdc0 CMD MSDC_INT_CMDTMO Arg
[ 4268.471180] .(0)[2339:system_server][msdc][msdc_command_resp_polling]: msdc0 XXX CMD ARG is R1B, TMO not reset hw
[ 4268.471189] .(0)[2339:system_server][msdc]msdc0 -> XXX CMD Error Resp  PID
[ 4268.471201] .(0)[2339:system_server]mmc0: error -110 during resume (card was removed?)

根本原因

多半是emmc问题,需要厂商一起排查。

解决方案

多半是emmc问题,需要厂商一起排查。

五、art abort导致死锁引起hang detect

测试过程中低概率发生hang detect,SYS_REBOOT_REASON里的hang detect time out: 0xa。表示watchdog无感知卡死。

需要现场调试,复现到了现场分析:

先抓取所有线程信息,检查是否存在kernel卡死情况:

adb shell ps -A -T > ps.log

看了ps log,除了system_server基本上停在lock,其他正常: 

USER PID TID PPID VSZ RSS WCHAN ADDR S CMD 
......
system 1009 1009 542 4671228 246048 futex_wait_queue_me 0 S system_server
system 1009 1014 542 4671228 246048 futex_wait_queue_me 0 S Signal Catcher
system 1009 1015 542 4671228 246048 futex_wait_queue_me 0 S ReferenceQueueD
system 1009 1016 542 4671228 246048 futex_wait_queue_me 0 S FinalizerDaemon
system 1009 1017 542 4671228 246048 futex_wait_queue_me 0 S FinalizerWatchd
system 1009 1018 542 4671228 246048 futex_wait_queue_me 0 S HeapTaskDaemon
system 1009 1019 542 4671228 246048 futex_wait_queue_me 0 S Binder:1009_1
system 1009 1020 542 4671228 246048 futex_wait_queue_me 0 S Binder:1009_2
system 1009 1022 542 4671228 246048 futex_wait_queue_me 0 S android.bg
system 1009 1023 542 4671228 246048 futex_wait_queue_me 0 S ActivityManager
system 1009 1024 542 4671228 246048 futex_wait_queue_me 0 S android.ui
system 1009 1025 542 4671228 246048 futex_wait_queue_me 0 S ThermalGame
system 1009 1026 542 4671228 246048 futex_wait_queue_me 0 S ActivityManager
system 1009 1027 542 4671228 246048 wait_woken 0 S FileObserver
system 1009 1028 542 4671228 246048 futex_wait_queue_me 0 S batterystats-wo
system 1009 1029 542 4671228 246048 futex_wait_queue_me 0 S vivo_firewall_t
system 1009 1030 542 4671228 246048 futex_wait_queue_me 0 S android.io
system 1009 1031 542 4671228 246048 futex_wait_queue_me 0 S rms
system 1009 1032 542 4671228 246048 futex_wait_queue_me 0 S android.fg
system 1009 1033 542 4671228 246048 futex_wait_queue_me 0 S android.display
system 1009 1034 542 4671228 246048 futex_wait_queue_me 0 S CpuTracker
system 1009 1035 542 4671228 246048 futex_wait_queue_me 0 S vivo_stats
system 1009 1036 542 4671228 246048 futex_wait_queue_me 0 S AnrMonitorThrea
system 1009 1037 542 4671228 246048 futex_wait_queue_me 0 S DuraSpeedEventT
system 1009 1039 542 4671228 246048 futex_wait_queue_me 0 S PowerManagerSer
system 1009 1040 542 4671228 246048 futex_wait_queue_me 0 S CallEndWaitThre
system 1009 1041 542 4671228 246048 futex_wait_queue_me 0 S WakelockControl
system 1009 1042 542 4671228 246048 pm_get_wakeup_count 0 S system_server
system 1009 1043 542 4671228 246048 futex_wait_queue_me 0 S system_server
system 1009 1044 542 4671228 246048 futex_wait_queue_me 0 S BatteryStats_wa
system 1009 1045 542 4671228 246048 futex_wait_queue_me 0 S PkmsDBHandlerTh
system 1009 1046 542 4671228 246048 futex_wait_queue_me 0 S PkmsUtilsHandle
system 1009 1047 542 4671228 246048 futex_wait_queue_me 0 S AdbInstallVerif
system 1009 1048 542 4671228 246048 futex_wait_queue_me 0 S PackageManager
system 1009 1117 542 4671228 246048 futex_wait_queue_me 0 S PackageInstalle
system 1009 1120 542 4671228 246048 futex_wait_queue_me 0 S android.anim
system 1009 1125 542 4671228 246048 SyS_epoll_wait 0 S SensorEventAckR
system 1009 1126 542 4671228 246048 binder_thread_read 0 S SensorService
system 1009 1127 542 4671228 246048 futex_wait_queue_me 0 S AccountManagerS
system 1009 1128 542 4671228 246048 futex_wait_queue_me 0 S SettingsProvide
system 1009 1129 542 4671228 246048 futex_wait_queue_me 0 S AlarmManager
system 1009 1130 542 4671228 246048 futex_wait_queue_me 0 S WakeupByWho
system 1009 1131 542 4671228 246048 futex_wait_queue_me 0 S WakeupByWho
system 1009 1133 542 4671228 246048 futex_wait_queue_me 0 S vivo_wmp_hook
system 1009 1136 542 4671228 246048 futex_wait_queue_me 0 S HwBinder:1009_1
system 1009 1137 542 4671228 246048 binder_thread_read 0 S HwBinder:1009_2
system 1009 1138 542 4671228 246048 binder_thread_read 0 S HwBinder:1009_3
system 1009 1139 542 4671228 246048 SyS_epoll_wait 0 S hidl_ssvc_poll
system 1009 1140 542 4671228 246048 futex_wait_queue_me 0 S InputDispatcher
system 1009 1141 542 4671228 246048 futex_wait_queue_me 0 S InputReader
system 1009 1142 542 4671228 246048 futex_wait_queue_me 0 S StorageManagerS
system 1009 1143 542 4671228 246048 unix_stream_read_generic 0 S VoldConnector
system 1009 1144 542 4671228 246048 unix_stream_read_generic 0 S CryptdConnector
system 1009 1145 542 4671228 246048 futex_wait_queue_me 0 S mDatabaseTempHa
......

由于Signal Catcher也卡在lock,因此无法抓到java调用栈。这时只能抓coredump。

手动打开direct coredump功能

adb shell setprop persist.aee.core.dump enable
adb shell setprop persist.aee.core.direct enable

然后用aee命令抓取coredump

aeev -p 542

这时就可以在/data/vendor/mtklog/aee_exp目录下看到db.fatal.00.ManualDump.dbg,将其捞出。

用SpOfflineDebugSuite分析:

== 异常报告v2.0(仅供参考) ==
报告解读: MediaTek On-Line> Quick Start> NE/KE分析报告解读> NE分析报告
详细描述: 0:信号0
版本 : alps-mp-o1.mp6/user build
发生时间: Wed Aug 1 15:08:45 CST 2018 @ 2018-08-01 15:08:45.990560
命令行 : system_server
pid : 0, ppid: 542 (zygote64)

== 线程信息(共185个线程) ==
当前线程信息:
 线程名: main, tid: 1009
 线程状态: 正在运行
 Java调用栈:
 ...... dalvik.system.DexFile.openDexFileNative() <本地方法>
 /system/framework/arm64/boot-core-libart.vdex dalvik.system.DexFile.openDexFile() 
 /system/framework/arm64/boot-core-libart.vdex dalvik.system.DexPathList.makeDexElements() 
 /system/framework/arm64/boot-core-libart.vdex dalvik.system.DexPathList.() 
 /system/framework/arm64/boot-core-libart.vdex dalvik.system.PathClassLoader.() 
 /system/framework/arm64/boot-framework.vdex com.android.internal.os.ClassLoaderFactory.createClassLoader() 
 /system/framework/arm64/boot-framework.vdex com.android.internal.os.ClassLoaderFactory.createClassLoader() 
 /system/framework/arm64/boot-framework.vdex android.app.ApplicationLoaders.getClassLoader() 
 /system/framework/arm64/boot-framework.vdex android.app.ApplicationLoaders.getClassLoader() 
 /system/framework/arm64/boot-framework.vdex android.app.LoadedApk.createOrUpdateClassLoaderLocked() 
 /system/framework/arm64/boot-framework.vdex android.app.LoadedApk.getClassLoader() 
 /system/framework/arm64/boot-framework.vdex android.app.ActivityThread.handleCreateService() 
 /system/framework/arm64/boot-framework.vdex android.app.ActivityThread.-wrap4() 
 /system/framework/arm64/boot-framework.vdex android.app.ActivityThread$H.handleMessage() 
 /system/framework/arm64/boot-framework.vdex android.os.Handler.dispatchMessage() 
 /system/framework/arm64/boot-framework.vdex android.os.Looper.loop() 
 /system/framework/oat/arm64/services.vdex com.android.server.SystemServer.run() 
 /system/framework/oat/arm64/services.vdex com.android.server.SystemServer.main() 
 ...... java.lang.reflect.Method.invoke() <本地方法>
 /system/framework/arm64/boot-framework.vdex com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run() 
 /system/framework/arm64/boot-framework.vdex com.android.internal.os.ZygoteInit.main() 
 == 栈结束 ==
 本地调用栈:
 libc.so syscall(参数1=98, 参数2=0x0000007C28903360, 参数3=0, 参数4=0x00000000FFFFFFFF, 参数5=0, 参数6=0, 参数7=0) + 28 <bionic/libc/arch-arm64/bionic/syscall.S:41>
 libart.so art::futex(val=-1, timeout=0x0000007FFA8F56E0, uaddr2=0x0000007FFA8F56E0) + 28 <art/runtime/base/mutex-inl.h:43>
 libart.so art::ReaderWriterMutex::HandleSharedLockContention(this=0x0000007C28903340, self=0x0000007C288C4E00, cur_state=-1) + 176 <art/runtime/base/mutex.cc:749>
 libart.so art::ReaderWriterMutex::SharedLock(this=0x0000007C28903340) + 48 <art/runtime/base/mutex-inl.h:139>
 libart.so art::ReaderMutexLock::ReaderMutexLock(mu=0x0000007C28903340) + 48 <art/runtime/base/mutex-inl.h:256>
 libart.so art::ClassLinker::LookupClass(this=0x0000007C28905CC0, self=0x0000007C288C4E00, descriptor=0x0000007C263D059A, hash=1781247089, class_loader=0) + 132 <art/runtime/class_linker.cc:4039>
 libart.so art::ClassLinker::LookupResolvedType(this=0x0000007C28905CC0, dex_file=0x0000007C288A5780, type_idx=440, dex_cache=0x00000000707DDF40, class_loader=0) + 280 <art/runtime/class_linker.cc:7910>
 libart.so art::ClassLinker::LookupResolvedType(class_loader=0) + 116 <art/runtime/class_linker-inl.h:70>
 libart.so art::verifier::MethodVerifier::ResolveClassAndCheckAccess(this=0x0000007FFA8F5D48, class_idx=440) + 220 <art/runtime/verifier/method_verifier.cc:3727>
 libart.so art::verifier::MethodVerifier::GetInstanceField(this=0x0000007FFA8F5D48, obj_type=0x0000007C18C44878, field_idx=7315) + 76 <art/runtime/verifier/method_verifier.cc:4857>
 libart.so art::verifier::MethodVerifier::VerifyISFieldAccess(this=0x0000007FFA8F5D48, inst=0x0000007C25F6DA98, insn_type=0x0000007C18C448F0, is_primitive=true, is_static=false) + 260 <art/runtime/verifier/method_verifier.cc:4963>
 libart.so art::verifier::MethodVerifier::CodeFlowVerifyInstruction(this=0x0000007FFA8F5D48, start_guess=0x0000007FFA8F5B54) + 11264 <art/runtime/verifier/reg_type-inl.h:0>
 libart.so art::verifier::MethodVerifier::CodeFlowVerifyMethod(this=0x0000007FFA8F5D48) + 192 <art/runtime/verifier/method_verifier.cc:1918>
 libart.so art::verifier::MethodVerifier::VerifyCodeFlow(this=0x0000007FFA8F5D48) + 424 <art/runtime/verifier/method_verifier.cc:1642>
 libart.so art::verifier::MethodVerifier::Verify(this=0x0000007FFA8F5D48) + 812 <art/runtime/verifier/method_verifier.cc:892>
 libart.so art::verifier::MethodVerifier::FindLocksAtDexPc(this=0x0000007FFA8F5D48) + 180 <art/runtime/verifier/method_verifier.cc:647>
 libart.so art::verifier::MethodVerifier::FindLocksAtDexPc(m=0x0000000070CB2AF0, dex_pc=50, monitor_enter_dex_pcs=0x0000007FFA8F60C0) + 524 <art/runtime/verifier/method_verifier.cc:615>
 libart.so art::Monitor::VisitLocks(stack_visitor=0x0000007FFA8F6380, callback=art::StackDumpVisitor::DumpLockedObject(), callback_context=0x0000007FFA8F65A0, abort_on_failure=false) + 456 <art/runtime/monitor.cc:1305>
 libart.so art::StackDumpVisitor::VisitFrame(this=0x0000007FFA8F6380) + 644 <art/runtime/thread.cc:1811>
 libart.so art::StackVisitor::WalkStack(this=0x0000007FFA8F6380, include_transitions=false) + 1124 <art/runtime/stack.cc:806>
 libart.so art::Thread::DumpJavaStack(this=0x0000007C288C4E00, os=0x0000007FFA8F65A0) + 300 <art/runtime/thread.cc:1916>
 libart.so art::Thread::DumpStack(this=0x0000007C288C4E00, os=0x0000007FFA8F65A0, dump_native_stack=true, backtrace_map=0x0000007BF6485660, force_dump_stack=false) + 372 <art/runtime/thread.cc:1947>
 libart.so art::DumpCheckpoint::Run(this=0x0000007FFA8F6990, thread=0x0000007C288C4E00) + 884 <art/runtime/thread_list.cc:217>
 libart.so art::ThreadList::RunCheckpoint(this=0x0000007C28909000, checkpoint_function=0x0000007FFA8F6990, callback=0) + 472 <art/runtime/thread_list.cc:350>
 libart.so art::ThreadList::Dump(this=0x0000007FFA8F6AF8, os=0x0000007C0CF73880, dump_native_stack=false) + 288 <art/runtime/thread_list.cc:261>
 libart.so art::AbortState::Dump(this=0x0000007FFA8F6AF8, os=0x0000007C0CF73880) + 220 <art/runtime/runtime.cc:431>
 libart.so art::Dumpable::Dump() + 4 <art/runtime/base/dumpable.h:39>
 libart.so art::operator<< () + 4 <art/runtime/base/dumpable.h:50>
 libart.so art::Runtime::Abort(msg=0x0000007C1CBB5D50) + 196 <art/runtime/runtime.cc:526>
 libart.so std::__1::function<(char const*)>::operator()(__arg=0x0000007FFA8F6B50) + 28 <external/libcxx/include/functional:1915>
 libart.so android::base::LogMessage::~LogMessage() + 1004 <system/core/base/logging.cpp:433>
 libart.so art::AppImageClassLoadersAndDexCachesHelper::Update() + 604 <art/runtime/class_linker.cc:1479>
 libart.so art::ClassLinker::AddImageSpace(this=0x0000007C28905CC0, space=0x0000007C289F9600, class_loader=0x0000007FFA8F6F7C, dex_elements=0x0000007FFA8F71D4, dex_location=0x0000007C0717CE70, out_dex_files=0x0000007FFA8F7100, error_msg=0x0000007FFA8F6F30) + 5020 <art/runtime/class_linker.cc:1987>
 libart.so art::OatFileManager::OpenDexFilesFromOat() + 2228 <art/runtime/oat_file_manager.cc:530>
 libart.so art::DexFile_openDexFileNative(参数1=0x0000007C288D6B20, 参数2=0x0000007FFA8F71C4) + 148 <art/runtime/native/dalvik_system_DexFile.cc:278>
 system@framework@boot-core-libart.oat dalvik.system.DexFile.openDexFileNative(参数2=0x0000000014B95C18, 参数3=0, 参数5=0x0000000013690A98, 参数6=0x0000000013690C28) + 264 
 system@framework@boot-core-libart.oat dalvik.system.DexFile.openDexFile(参数2=0x0000000014B95C18, 参数3=0, 参数4=0, 参数5=0x0000000013690A98, 参数6=0x0000000013690C28) + 244 
 system@framework@boot-core-libart.oat dalvik.system.DexPathList.makeDexElements(参数2=0x0000000013690B50, 参数3=0, 参数4=0x0000000013690B38, 参数5=0x0000000013690A98) + 816 
 system@framework@boot-core-libart.oat dalvik.system.DexPathList.(参数2=0x0000000013690B18, 参数3=0x0000000013690A98, 参数4=0x0000000014B95C18, 参数5=0x0000000013690A00, 参数6=0) + 812 
 system@framework@boot-core-libart.oat dalvik.system.PathClassLoader.(参数2=0x0000000013690A98, 参数3=0x0000000014B95C18, 参数4=0x0000000013690A00, 参数5=0x00000000707F2CC8) + 124 
 system@framework@boot-framework.oat com.android.internal.os.ClassLoaderFactory.createClassLoader(参数2=0x0000000014B95C18, 参数3=0x0000000013690A00, 参数4=0x00000000707F2CC8, 参数5=0, 参数6=27) + 760 
 system@framework@boot-framework.oat com.android.internal.os.ClassLoaderFactory.createClassLoader(参数2=0x0000000014B95C18, 参数3=0x0000000013690A00, 参数4=0x00000000136906C8, 参数5=0x00000000707F2CC8, 参数6=27, 参数7=1, 参数8=0, 参数9=0x0000000070CB2AF0) + 92 
 system@framework@boot-framework.oat android.app.ApplicationLoaders.getClassLoader(参数2=0x000000007084F420, 参数6=0x0000000013690A00, 参数7=0x00000000136906C8, 参数8=0, 参数9=0x0000000070CB2BE0, 参数10=0x14B95C187084F420) + 468 
 system@framework@boot-framework.oat android.app.ApplicationLoaders.getClassLoader(参数2=0x000000007084F420, 参数3=0x0000000014B95C18, 参数6=0x0000000013690A00, 参数7=0x00000000136906C8, 参数9=0x0000000070C67FD8) + 76 
 system@framework@boot-framework.oat android.app.LoadedApk.createOrUpdateClassLoaderLocked(参数2=0x0000000013690068, 参数3=0) + 2412 
 system@framework@boot-framework.oat android.app.LoadedApk.getClassLoader(参数2=0x0000000013690068) + 84 
 system@framework@boot-framework.oat android.app.ActivityThread.handleCreateService(参数2=0x000000001325FCB0, 参数3=0x0000000012C00FD0) + 360 
 system@framework@boot-framework.oat android.app.ActivityThread.-wrap4(参数2=0x000000001325FCB0, 参数3=0x0000000012C00FD0) + 52 
 system@framework@boot-framework.oat android.app.ActivityThread$H.handleMessage(参数2=0x0000000014D37BB8, 参数3=0x0000000012F7C340) + 6528 
 system@framework@boot-framework.oat android.os.Handler.dispatchMessage(参数2=0x0000000014D37BB8, 参数3=0x0000000012F7C340) + 180 
 system@framework@boot-framework.oat android.os.Looper.loop() + 1452 
 services.odex com.android.server.SystemServer.run(参数2=0x000000001324CDA8) + 3052 
 services.odex com.android.server.SystemServer.main() + 244 
 libart.so art_quick_invoke_static_stub(参数1=0x00000000978F8310, 参数2=0x0000007FFA8F7D78, 参数3=4, 参数4=0x0000007C288C4E00, 参数5=0x0000007FFA8F7D48, 参数6=0x0000007C10DCBEC3) + 604 <art/runtime/arch/arm64/quick_entrypoints_arm64.S:1814>
 libart.so art::ArtMethod::Invoke(this=0x00000000978F8310, self=0x0000007C288C4E00, args=0x0000007FFA8F7D78, args_size=4, result=0x0000007FFA8F7D48, shorty=0x0000007C10DCBEC3) + 264 <art/runtime/art_method.cc:369>
 libart.so art::InvokeWithArgArray(参数1=0x0000007FFA8F7E50, 参数2=0x00000000978F8310, 参数3=0x0000007FFA8F7D60, 参数4=0x0000007FFA8F7D48, 参数5=0x0000007C10DCBEC3) + 100 <art/runtime/reflection.cc:454>
 libart.so art::InvokeMethod(参数1=0x0000007FFA8F7E50, 参数5=1) + 1456 <art/runtime/reflection.cc:667>
 libart.so art::Method_invoke(参数1=0x0000007C288D6B20, 参数2=0x0000007FFA8F7EA4) + 48 <art/runtime/native/java_lang_reflect_Method.cc:87>
 system@framework@boot.oat java.lang.reflect.Method.invoke(参数2=0x000000001324B360, 参数3=0) + 180 
 system@framework@boot-framework.oat com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(参数2=0x000000001324D460) + 132 
 system@framework@boot-framework.oat com.android.internal.os.ZygoteInit.main() + 2708 
 libart.so art_quick_invoke_static_stub(参数1=0x0000000070CBD220, 参数2=0x0000007FFA8F82C0, 参数3=4, 参数4=0x0000007C288C4E00, 参数5=0x0000007FFA8F82A0, 参数6=0x0000007C26E5646D) + 604 <art/runtime/arch/arm64/quick_entrypoints_arm64.S:1814>
 libart.so art::ArtMethod::Invoke(this=0x0000000070CBD220, self=0x0000007C288C4E00, args=0x0000007FFA8F82C0, args_size=4, result=0x0000007FFA8F82A0, shorty=0x0000007C26E5646D) + 264 <art/runtime/art_method.cc:369>
 libart.so art::InvokeWithArgArray(参数1=0x0000007FFA8F83A8, 参数2=0x0000000070CBD220, 参数3=0x0000007FFA8F82A8, 参数4=0x0000007FFA8F82A0, 参数5=0x0000007C26E5646D) + 100 <art/runtime/reflection.cc:454>
 libart.so art::InvokeWithVarArgs(参数1=0x0000007FFA8F83A8, 参数2=0, 参数3=0x0000000070CBD220, 参数4=0x0000007FFA8F8370) + 412 <art/runtime/reflection.cc:481>
 libart.so art::JNI::CallStaticVoidMethodV(mid=0x0000000070CBD220, args=0x0000007FFA8F8500) + 612 <art/runtime/jni_internal.cc:1731>
 libandroid_runtime.so _JNIEnv::CallStaticVoidMethod(this=0x0000007C288D6B20, clazz=105) + 120 <libnativehelper/include_jni/jni.h:780>
 libandroid_runtime.so android::AndroidRuntime::start() + 964 <frameworks/base/core/jni/AndroidRuntime.cpp:1140>
 app_process64 main(argc=6, argv=0x0000007FFA8F9868) + 1328 <frameworks/base/cmds/app_process/app_main.cpp:0>
 libc.so __libc_init(raw_args=0x0000007FFA8F9860, onexit=0, slingshot=main(), structors=0x0000007FFA8F9848) + 88 <bionic/libc/bionic/libc_init_dynamic.cpp:124>
 app_process64 _start_main() + 80
 == 栈结束 ==
 对应汇编指令:
 行号 地址 指令 提示
 bionic/libc/arch-arm64/bionic/syscall.S
 41 : 0000007CAC58452C: SVC ; 线程停止在这里
 当时的寄存器值:
 X0: 0000007C28903360, X1: 0000000000000000, X2: 00000000FFFFFFFF, X3: 0000000000000000
 X4: 0000000000000000, X5: 0000000000000000, X6: 0000000000000000, X7: 63656A624F2F676E
 X8: 0000000000000062, X9: 0000000000000000, X10: 000000006A2BAC36, X11: 0000007C283F4950
 X12: 676E616C2F617661, X13: 3B7463656A624F2F, X14: 0000000000000024, X15: 0000000000000009
 X16: 0000007C283F7678, X17: 0000007CAC584510, X18: 0000000000000008, X19: 0000007C28903340
 X20: 00000000FFFFFFFF, X21: 0000007C288C4E00, X22: 0000007C28903360, X23: 0000007C28903370
 X24: 0000007CAE9DCA40, X25: 0000007C28905CC0, X26: 0000007CAE9DCA40, X27: 0000000000001C93
 X28: 0000007C25F6DA78, X29: 0000007FFA8F5740, X30: 0000007C27EAE8A8, SP: 0000007FFA8F56E0
 PC: 0000007CAC58452C

仔细分析调用栈,发现已经发生了art abort,然后在abort里dump callstack后发生deadlock。

我们仔细看下deadlock,线程停止的位置是在等锁:

mirror::Class* ClassLinker::LookupClass(Thread* self, const char* descriptor, size_t hash, ObjPtr class_loader)
{
    ReaderMutexLock mu(self, *Locks::classlinker_classes_lock_); /* 卡在这里 */
    ....
}

 那这把锁被谁拿了呢?我们仔细搜索代码发现这个调用栈前面的Update()函数拿了这把锁:

bool AppImageClassLoadersAndDexCachesHelper::Update(ClassLinker* class_linker, gc::space::ImageSpace* space,
        Handle class_loader, Handle<mirror::ObjectArray> dex_caches, ClassTable::ClassSet* new_class_set,
        bool* out_forward_dex_cache_array, std::string* out_error_msg)
    REQUIRES(!Locks::dex_lock_)
    REQUIRES_SHARED(Locks::mutator_lock_)
{
    DCHECK(out_forward_dex_cache_array != nullptr);
    DCHECK(out_error_msg != nullptr);
    PointerSize image_pointer_size = class_linker->GetImagePointerSize();
    Thread* const self = Thread::Current();
    gc::Heap* const heap = Runtime::Current()->GetHeap();
    const ImageHeader& header = space->GetImageHeader();
    {
     // Add image classes into the class table for the class loader, and fixup the dex caches and
     // class loader fields.
         WriterMutexLock mu(self, *Locks::classlinker_classes_lock_); /* 这里拿了锁 */
......
}

就形成了死锁。而由于art abort,在dump callstack会将大部分的线程停下来等dump callstack结束,导致watchdog线程也动不了,无法kick kernel hang detect count,自然就发生了hang detect重启了。

那是否要解决这个死锁呢?答案是不用解决,因为art非常多的地方调用CHECK等宏,只要条件不满足就好发生abort,进而dump callstack。而dump callstack会拿很多锁,较容易死锁,这是google原生设计。

只要把abort问题解决,死锁问题自然就不存在了。

bool AppImageClassLoadersAndDexCachesHelper::Update(ClassLinker* class_linker, gc::space::ImageSpace* space,
 Handle class_loader, Handle dex_caches, ClassTable::ClassSet* new_class_set,
 bool* out_forward_dex_cache_array, std::string* out_error_msg)
 REQUIRES(!Locks::dex_lock_)
 REQUIRES_SHARED(Locks::mutator_lock_)
{ 
......
    {
        WriterMutexLock mu2(self, *Locks::dex_lock_);
        // Make sure to do this after we update the arrays since we store the resolved types array
        // in DexCacheData in RegisterDexFileLocked. We need the array pointer to be the one in the BSS.
        CHECK(!class_linker->FindDexCacheDataLocked(*dex_file).IsValid()); /* 在这里发生abort */
        class_linker->RegisterDexFileLocked(*dex_file, dex_cache, class_loader.Get());
    }
......
}

经过分析,是加载dex file时发现class_linker已经有这只文件了,从逻辑上看可能是lppeService被加载2次引起。

根本原因

lppeService低概率被加载2次引起abort。

解决方案

去除lppeService废弃代码,只加载一次。

结语

这个问题总结了,如果大部分system_server线程停在等锁,多半是发生在dump callstack,可能是发生了abort。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值