Hard lockup occurs due to an infinite loop encountered in distribute_cfs_runtime()
环境
Red Hat Enterprise Linux 7.3 (kernel-3.10.0-514.el7.x86_64)
问题
Hard lockup occurs due to an infinite loop encountered in distribute_cfs_runtime()
Raw
[ 1432.242810] Kernel panic - not syncing: Hard LOCKUP
[ 1432.242829] CPU: 25 PID: 0 Comm: swapper/25 Not tainted 3.10.0-514.el7.x86_64 #1
[ 1432.242855] Hardware name: Cisco Systems Inc UCSC-C220-M4S/UCSC-C220-M4S, BIOS C220M4.2.0.3d.0.111120141447 11/11/2014
[ 1432.242891] ffffffff818d9764 b7b1a7a2bef0fc23 ffff88105f445b18 ffffffff81685eac
[ 1432.242921] ffff88105f445b98 ffffffff8167f2b3 0000000000000010 ffff88105f445ba8
[ 1432.242951] ffff88105f445b48 b7b1a7a2bef0fc23 ffff88105f445ba8 ffffffff818d946a
[ 1432.242980] Call Trace:
[ 1432.242991] <NMI> [<ffffffff81685eac>] dump_stack+0x19/0x1b
[ 1432.243018] [<ffffffff8167f2b3>] panic+0xe3/0x1f2
[ 1432.243039] [<ffffffff8108562f>] nmi_panic+0x3f/0x40
[ 1432.243059] [<ffffffff8112f0e6>] watchdog_overflow_callback+0xf6/0x100
[ 1432.243085] [<ffffffff8117465e>] __perf_event_overflow+0x8e/0x1f0
[ 1432.243108] [<ffffffff811752a4>] perf_event_overflow+0x14/0x20
[ 1432.243132] [<ffffffff81009d88>] intel_pmu_handle_irq+0x1f8/0x4e0
[ 1432.243156] [<ffffffff81319d7c>] ? ioremap_page_range+0x27c/0x3e0
[ 1432.243179] [<ffffffff811bedf4>] ? vunmap_page_range+0x1c4/0x310
[ 1432.243202] [<ffffffff811bef51>] ? unmap_kernel_range_noflush+0x11/0x20
[ 1432.243227] [<ffffffff813c93d4>] ? ghes_copy_tofrom_phys+0x124/0x210
[ 1432.243252] [<ffffffff813c9560>] ? ghes_read_estatus+0xa0/0x190
[ 1432.243275] [<ffffffff8168daeb>] perf_event_nmi_handler+0x2b/0x50
[ 1432.243298] [<ffffffff8168ef19>] nmi_handle.isra.0+0x69/0xb0
[ 1432.243320] [<ffffffff8168f093>] do_nmi+0x133/0x410
[ 1432.243339] [<ffffffff8168e353>] end_repeat_nmi+0x1e/0x2e
[ 1432.243360] [<ffffffff8168d812>] ? _raw_spin_lock+0x32/0x50
[ 1432.243381] [<ffffffff8168d812>] ? _raw_spin_lock+0x32/0x50
[ 1432.243402] [<ffffffff8168d812>] ? _raw_spin_lock+0x32/0x50
[ 1432.243422] <<EOE>> <IRQ> [<ffffffff810d100b>] unthrottle_cfs_rq+0x4b/0x170
[ 1432.243453] [<ffffffff810d12e2>] distribute_cfs_runtime+0xf2/0x100
[ 1432.243476] [<ffffffff810d147f>] sched_cfs_period_timer+0xcf/0x160
[ 1432.243499] [<ffffffff810d13b0>] ? sched_cfs_slack_timer+0xc0/0xc0
[ 1432.243523] [<ffffffff810b4862>] __hrtimer_run_queues+0xd2/0x260
[ 1432.243546] [<ffffffff810b4e00>] hrtimer_interrupt+0xb0/0x1e0
[ 1432.243569] [<ffffffff810510d7>] local_apic_timer_interrupt+0x37/0x60
[ 1432.243594] [<ffffffff81698bcf>] smp_apic_timer_interrupt+0x3f/0x60
[ 1432.243617] [<ffffffff8169711d>] apic_timer_interrupt+0x6d/0x80
[ 1432.243638] <EOI> [<ffffffff81513f52>] ? cpuidle_enter_state+0x52/0xc0
[ 1432.243665] [<ffffffff81514099>] cpuidle_idle_call+0xd9/0x210
[ 1432.243688] [<ffffffff8103516e>] arch_cpu_idle+0xe/0x30
[ 1432.243709] [<ffffffff810e7c95>] cpu_startup_entry+0x245/0x290
[ 1432.243732] [<ffffffff8104f12a>] start_secondary+0x1ba/0x230
The following backtrace observed from the panic task:
Raw
crash> bt
PID: 0 TASK: ffff8808fce38fb0 CPU: 25 COMMAND: "swapper/25"
#0 [ffff88105f4459f0] machine_kexec at ffffffff81059cdb
#1 [ffff88105f445a50] __crash_kexec at ffffffff81105182
#2 [ffff88105f445b20] panic at ffffffff8167f2ba
#3 [ffff88105f445ba0] nmi_panic at ffffffff8108562f
#4 [ffff88105f445bb0] watchdog_overflow_callback at ffffffff8112f0e6
#5 [ffff88105f445bc8] __perf_event_overflow at ffffffff8117465e
#6 [ffff88105f445c00] perf_event_overflow at ffffffff811752a4
#7 [ffff88105f445c10] intel_pmu_handle_irq at ffffffff81009d88
#8 [ffff88105f445e48] perf_event_nmi_handler at ffffffff8168daeb
#9 [ffff88105f445e68] nmi_handle at ffffffff8168ef19
#10 [ffff88105f445eb0] do_nmi at ffffffff8168f093
#11 [ffff88105f445ef0] end_repeat_nmi at ffffffff8168e353
[exception RIP: _raw_spin_lock+50]
RIP: ffffffff8168d812 RSP: ffff88105f443e18 RFLAGS: 00000012
RAX: 0000000000007f72 RBX: ffff88105aab1b00 RCX: 0000000000003f34
RDX: 0000000000003f42 RSI: 0000000000003f42 RDI: ffff881058a79d48
RBP: ffff88105f443e18 R8: 0000000000800008 R9: 0000000000000001
R10: 0000000000018695 R11: 0000000000000000 R12: ffff8810537cd400
R13: ffff88105f2d6c40 R14: ffff881058a79c00 R15: ffff881058a79d48
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#12 [ffff88105f443e18] _raw_spin_lock at ffffffff8168d812
#13 [ffff88105f443e20] unthrottle_cfs_rq at ffffffff810d100b
#14 [ffff88105f443e58] distribute_cfs_runtime at ffffffff810d12e2
#15 [ffff88105f443ea0] sched_cfs_period_timer at ffffffff810d147f
#16 [ffff88105f443ed8] __hrtimer_run_queues at ffffffff810b4862
#17 [ffff88105f443f30] hrtimer_interrupt at ffffffff810b4e00
#18 [ffff88105f443f80] local_apic_timer_interrupt at ffffffff810510d7
#19 [ffff88105f443f98] smp_apic_timer_interrupt at ffffffff81698bcf
#20 [ffff88105f443fb0] apic_timer_interrupt at ffffffff8169711d
--- <IRQ stack> ---
#21 [ffff8808fce47da8] apic_timer_interrupt at ffffffff8169711d
[exception RIP: cpuidle_enter_state+82]
RIP: ffffffff81513f52 RSP: ffff8808fce47e50 RFLAGS: 00000206
RAX: 0000014aa48e9400 RBX: 000000000000f8a0 RCX: 0000000000000018
RDX: 0000000225c17d03 RSI: ffff8808fce47fd8 RDI: 0000014aa48e9400
RBP: ffff8808fce47e78 R8: 000000000000608f R9: 0000000000000018
R10: 0000000000018695 R11: 0000000000000000 R12: ffff8808fce47e20
R13: ffff88105f44f8e0 R14: 0000000000000082 R15: ffff88105f44f8e0
ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#22 [ffff8808fce47e80] cpuidle_idle_call at ffffffff81514099
#23 [ffff8808fce47ec0] arch_cpu_idle at ffffffff8103516e
#24 [ffff8808fce47ed0] cpu_startup_entry at ffffffff810e7c95
#25 [ffff8808fce47f28] start_secondary at ffffffff8104f12a
根源
There is a bug that should be fixed by a patch from upstream commit c06f04c70489b9deea3212af8375e2f0c2f0b184.
distribute_cfs_runtime() intentionally only hands out enough runtime to bring each cfs_rq to 1 ns of runtime, expecting the cfs_rqs to then take the runtime they need only once they actually get to run. However, if they get to run sufficiently quickly, the period timer is still in distribute_cfs_runtime() and no runtime is available, causing them to throttle. Then distribute has to handle them again, and this can go on until distribute has handed out all of the runtime 1ns at a time, which takes far too long.
诊断步骤
System Information:
Raw
crash> sys
KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/3.10.0-514.el7.x86_64/vmlinux
DUMPFILE: /cores/retrace/tasks/702709706/crash/vmcore [PARTIAL DUMP]
CPUS: 32
DATE: Mon Nov 14 22:38:09 2016
UPTIME: 00:23:52
LOAD AVERAGE: 4.69, 1.36, 0.50
TASKS: 397
NODENAME: somehost
RELEASE: 3.10.0-514.el7.x86_64
VERSION: #1 SMP Wed Oct 19 11:24:13 EDT 2016
MACHINE: x86_64 (2400 Mhz)
MEMORY: 63.9 GB
PANIC: "Kernel panic - not syncing: Hard LOCKUP"
Backtrace of panic task:
Raw
crash> bt
PID: 0 TASK: ffff8808fce38fb0 CPU: 25 COMMAND: "swapper/25"
#0 [ffff88105f4459f0] machine_kexec at ffffffff81059cdb
#1 [ffff88105f445a50] __crash_kexec at ffffffff81105182
#2 [ffff88105f445b20] panic at ffffffff8167f2ba
#3 [ffff88105f445ba0] nmi_panic at ffffffff8108562f
#4 [ffff88105f445bb0] watchdog_overflow_callback at ffffffff8112f0e6
#5 [ffff88105f445bc8] __perf_event_overflow at ffffffff8117465e
#6 [ffff88105f445c00] perf_event_overflow at ffffffff811752a4
#7 [ffff88105f445c10] intel_pmu_handle_irq at ffffffff81009d88
#8 [ffff88105f445e48] perf_event_nmi_handler at ffffffff8168daeb
#9 [ffff88105f445e68] nmi_handle at ffffffff8168ef19
#10 [ffff88105f445eb0] do_nmi at ffffffff8168f093
#11 [ffff88105f445ef0] end_repeat_nmi at ffffffff8168e353
[exception RIP: _raw_spin_lock+50]
RIP: ffffffff8168d812 RSP: ffff88105f443e18 RFLAGS: 00000012
RAX: 0000000000007f72 RBX: ffff88105aab1b00 RCX: 0000000000003f34
RDX: 0000000000003f42 RSI: 0000000000003f42 RDI: ffff881058a79d48
RBP: ffff88105f443e18 R8: 0000000000800008 R9: 0000000000000001
R10: 0000000000018695 R11: 0000000000000000 R12: ffff8810537cd400
R13: ffff88105f2d6c40 R14: ffff881058a79c00 R15: ffff881058a79d48
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#12 [ffff88105f443e18] _raw_spin_lock at ffffffff8168d812
#13 [ffff88105f443e20] unthrottle_cfs_rq at ffffffff810d100b
#14 [ffff88105f443e58] distribute_cfs_runtime at ffffffff810d12e2
#15 [ffff88105f443ea0] sched_cfs_period_timer at ffffffff810d147f
#16 [ffff88105f443ed8] __hrtimer_run_queues at ffffffff810b4862
#17 [ffff88105f443f30] hrtimer_interrupt at ffffffff810b4e00
#18 [ffff88105f443f80] local_apic_timer_interrupt at ffffffff810510d7
#19 [ffff88105f443f98] smp_apic_timer_interrupt at ffffffff81698bcf
#20 [ffff88105f443fb0] apic_timer_interrupt at ffffffff8169711d
--- <IRQ stack> ---
#21 [ffff8808fce47da8] apic_timer_interrupt at ffffffff8169711d
[exception RIP: cpuidle_enter_state+82]
RIP: ffffffff81513f52 RSP: ffff8808fce47e50 RFLAGS: 00000206
RAX: 0000014aa48e9400 RBX: 000000000000f8a0 RCX: 0000000000000018
RDX: 0000000225c17d03 RSI: ffff8808fce47fd8 RDI: 0000014aa48e9400
RBP: ffff8808fce47e78 R8: 000000000000608f R9: 0000000000000018
R10: 0000000000018695 R11: 0000000000000000 R12: ffff8808fce47e20
R13: ffff88105f44f8e0 R14: 0000000000000082 R15: ffff88105f44f8e0
ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#22 [ffff8808fce47e80] cpuidle_idle_call at ffffffff81514099
#23 [ffff8808fce47ec0] arch_cpu_idle at ffffffff8103516e
#24 [ffff8808fce47ed0] cpu_startup_entry at ffffffff810e7c95
#25 [ffff8808fce47f28] start_secondary at ffffffff8104f12a
Analysis:
Raw
This unthrottle_cfs_rq() was all started from 'sched_cfs_period_timer()' which was in the follwoing while loop.
kernel/sched/fair.c
3467 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
3468 {
[...]
3519 while (throttled && runtime > 0) {
3520 raw_spin_unlock(&cfs_b->lock);
3521 /* we can't nest cfs_b->lock while distributing bandwidth */
3522 runtime = distribute_cfs_runtime(cfs_b, runtime,
3523 runtime_expires);
3524 raw_spin_lock(&cfs_b->lock);
3525
3526 throttled = !list_empty(&cfs_b->throttled_cfs_rq);
3527 }
[...]
crash> dis -lr ffffffff810d147f | tail
/usr/src/debug/kernel-3.10.0-514.el7/linux-3.10.0-514.el7.x86_64/kernel/sched/fair.c: 3520
0xffffffff810d1469 <sched_cfs_period_timer+185>: mov %r12,%rdi
0xffffffff810d146c <sched_cfs_period_timer+188>: callq 0xffffffff8168d710 <_raw_spin_unlock>
/usr/src/debug/kernel-3.10.0-514.el7/linux-3.10.0-514.el7.x86_64/kernel/sched/fair.c: 3522
0xffffffff810d1471 <sched_cfs_period_timer+193>: mov %r13,%rsi
0xffffffff810d1474 <sched_cfs_period_timer+196>: mov %r15,%rdx
0xffffffff810d1477 <sched_cfs_period_timer+199>: mov %r12,%rdi
0xffffffff810d147a <sched_cfs_period_timer+202>: callq 0xffffffff810d11f0 <distribute_cfs_runtime>
/usr/src/debug/kernel-3.10.0-514.el7/linux-3.10.0-514.el7.x86_64/kernel/sched/fair.c: 3524
3423 static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
3424 u64 remaining, u64 expires)
3425 {
crash> dis -lr ffffffff810d12e2 | head
/usr/src/debug/kernel-3.10.0-514.el7/linux-3.10.0-514.el7.x86_64/kernel/sched/fair.c: 3425
0xffffffff810d11f0 <distribute_cfs_runtime>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffff810d11f5 <distribute_cfs_runtime+5>: push %rbp
0xffffffff810d11f6 <distribute_cfs_runtime+6>: mov %rsp,%rbp
0xffffffff810d11f9 <distribute_cfs_runtime+9>: push %r15 <<<<<<<< runtime_expires
0xffffffff810d11fb <distribute_cfs_runtime+11>: push %r14
0xffffffff810d11fd <distribute_cfs_runtime+13>: mov %rdx,%r14
0xffffffff810d1200 <distribute_cfs_runtime+16>: push %r13 <<<<<<<< runtime
0xffffffff810d1202 <distribute_cfs_runtime+18>: push %r12 <<<<<<<< cfs_b
0xffffffff810d1204 <distribute_cfs_runtime+20>: mov %rsi,%r12
#14 [ffff88105f443e58] distribute_cfs_runtime at ffffffff810d12e2
ffff88105f443e60: ffff8810537cd500 b7b1a7a2bef0fc23
ffff88105f443e70: ffff881058a79d80 ffff881058a79d48
5th push
ffff88105f443e80: 00000000bebc2000 ffff881058a79e40
4th push
ffff88105f443e90: 0000014d2c58e112 ffff88105f443ed0
2nd push
ffff88105f443ea0: ffffffff810d147f
#15 [ffff88105f443ea0] sched_cfs_period_timer at ffffffff810d147f
cfs_b is ffff881058a79d48
runtime is 00000000bebc2000
runtime_expires is 0000014d2c58e112
It's checking if there is remaining runtime while throttle state is on.
How much runtime_remaining was remaining and how many runqueues were throttled?
crash> dis -lr ffffffff810d12e2 | tail
/usr/src/debug/kernel-3.10.0-514.el7/linux-3.10.0-514.el7.x86_64/kernel/sched/fair.c: 3447
0xffffffff810d12ce <distribute_cfs_runtime+222>: test %rdx,%rdx
/usr/src/debug/kernel-3.10.0-514.el7/linux-3.10.0-514.el7.x86_64/kernel/sched/fair.c: 3443
0xffffffff810d12d1 <distribute_cfs_runtime+225>: mov %rdx,0xd8(%r15)
/usr/src/debug/kernel-3.10.0-514.el7/linux-3.10.0-514.el7.x86_64/kernel/sched/fair.c: 3447
0xffffffff810d12d8 <distribute_cfs_runtime+232>: jle 0xffffffff810d126f <distribute_cfs_runtime+127>
/usr/src/debug/kernel-3.10.0-514.el7/linux-3.10.0-514.el7.x86_64/kernel/sched/fair.c: 3448
0xffffffff810d12da <distribute_cfs_runtime+234>: mov %r15,%rdi
0xffffffff810d12dd <distribute_cfs_runtime+237>: callq 0xffffffff810d0fc0 <unthrottle_cfs_rq>
0xffffffff810d12e2 <distribute_cfs_runtime+242>: jmp 0xffffffff810d126f <distribute_cfs_runtime+127>
3443 cfs_rq->runtime_remaining += runtime;
3444 cfs_rq->runtime_expires = expires;
3445
3446 /* we check whether we're throttled above */
3447 if (cfs_rq->runtime_remaining > 0)
3448 unthrottle_cfs_rq(cfs_rq);
crash> dis -lr ffffffff810d100b | head
/usr/src/debug/kernel-3.10.0-514.el7/linux-3.10.0-514.el7.x86_64/kernel/sched/fair.c: 3379
0xffffffff810d0fc0 <unthrottle_cfs_rq>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffff810d0fc5 <unthrottle_cfs_rq+5>: push %rbp
0xffffffff810d0fc6 <unthrottle_cfs_rq+6>: mov %rsp,%rbp
0xffffffff810d0fc9 <unthrottle_cfs_rq+9>: push %r15
0xffffffff810d0fcb <unthrottle_cfs_rq+11>: push %r14
0xffffffff810d0fcd <unthrottle_cfs_rq+13>: push %r13
0xffffffff810d0fcf <unthrottle_cfs_rq+15>: push %r12
0xffffffff810d0fd1 <unthrottle_cfs_rq+17>: mov %rdi,%r12
0xffffffff810d0fd4 <unthrottle_cfs_rq+20>: push %rbx
3378 void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
3379 {
3380 struct rq *rq = rq_of(cfs_rq);
3381 struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
3382 struct sched_entity *se;
3383 int enqueue = 1;
3384 long task_delta;
#13 [ffff88105f443e20] unthrottle_cfs_rq at ffffffff810d100b
ffff88105f443e28: ffff88105f2d6c40 000000001426f50a
ffff88105f443e38: ffff881058a79e40 0000014d2c58e112
ffff88105f443e48: ffff8810537cd400 ffff88105f443e98
2nd push
ffff88105f443e58: ffffffff810d12e2
#14 [ffff88105f443e58] distribute_cfs_runtime at ffffffff810d12e2
cfs_rq is ffff8810537cd400
How much runtime_remaining was remaining and how many runqueues were throttled?
crash> cfs_rq.runtime_remaining,runtime_expires ffff8810537cd400
runtime_remaining = 1
runtime_expires = 1430968131858
crash> cfs_rq.throttled_list ffff8810537cd400 -ox
struct cfs_rq {
[ffff8810537cd500] struct list_head throttled_list;
}
crash> list -H ffff8810537cd500 -o cfs_rq.throttled_list -s cfs_rq.runtime_remaining |grep -c runtime
28
crash> list -H ffff8810537cd500 -o cfs_rq.throttled_list -s cfs_rq.throttled,runtime_remaining
ffff8810537cd600
throttled = 1
runtime_remaining = -1690
ffff8810537cdc00
throttled = 1
runtime_remaining = -1722
ffff8810537ce000
throttled = 1
runtime_remaining = -1616
ffff8810537ce600
throttled = 1
runtime_remaining = -1688
ffff8810537cc800
throttled = 1
runtime_remaining = -1820
ffff8810537cde00
throttled = 1
runtime_remaining = -1757
ffff8810537cf600
throttled = 1
runtime_remaining = -1848
ffff8800369ed400
throttled = 1
runtime_remaining = -2394
ffff8810537cc600
throttled = 1
runtime_remaining = -1758
ffff8810537cd200
throttled = 1
runtime_remaining = -1814
ffff8800369ee800
throttled = 1
runtime_remaining = -2395
ffff8810537cc200
throttled = 1
runtime_remaining = -1759
ffff8810537cf400
throttled = 1
runtime_remaining = -1682
ffff8810537cfa00
throttled = 1
runtime_remaining = -1912
ffff8810537cca00
throttled = 1
runtime_remaining = -1597
ffff8800369ec000
throttled = 1
runtime_remaining = -15057
ffff8800369eee00
throttled = 1
runtime_remaining = -14083
ffff8800369ed000
throttled = 1
runtime_remaining = -4009
ffff8800369ede00
throttled = 1
runtime_remaining = -5090
ffff8800369ed800
throttled = 1
runtime_remaining = -14091
ffff8800369ee000
throttled = 1
runtime_remaining = -3746
ffff8800369ee600
throttled = 1
runtime_remaining = -15879
ffff8800369efe00
throttled = 1
runtime_remaining = -3457
ffff8800369eda00
throttled = 1
runtime_remaining = -3742
ffff8800369eec00
throttled = 1
runtime_remaining = -4032
ffff8800369ee400
throttled = 1
runtime_remaining = -3386
ffff8800369ed600
throttled = 1
runtime_remaining = -3530
ffff881058a79d40
throttled = 48
runtime_remaining = 0
crash> pd 1690 + 1722 + 1616 + 1688 + 1820 + 1757 + 1848 + 2394 + 1758 + 1814 + 2395 + 1759 + 1682 + 1912 + 1597 + 15057 + 14083 + 4009 + 5090 + 14091 + 3746 + 15879 + 3457 + 3742 + 4032 + 3386 + 3530
$1 = 117554
As shown above, the total runtime_remaining from each throttled runqueue was 117554.
However if we are checking remaining runtime which had in local variable in distribute_cfs_runtime() function was 338097418.
/usr/src/debug/kernel-3.10.0-514.el7/linux-3.10.0-514.el7.x86_64/kernel/sched/fair.c: 3438
0xffffffff810d12be <distribute_cfs_runtime+206>: sub %rcx,%rdx
0xffffffff810d12c1 <distribute_cfs_runtime+209>: cmp %rdx,%r12
0xffffffff810d12c4 <distribute_cfs_runtime+212>: cmovbe %r12,%rdx
/usr/src/debug/kernel-3.10.0-514.el7/linux-3.10.0-514.el7.x86_64/kernel/sched/fair.c: 3441
0xffffffff810d12c8 <distribute_cfs_runtime+216>: sub %rdx,%r12
3438 runtime = -cfs_rq->runtime_remaining + 1;
3439 if (runtime > remaining)
3440 runtime = remaining;
3441 remaining -= runtime;
#13 [ffff88105f443e20] unthrottle_cfs_rq at ffffffff810d100b
ffff88105f443e28: ffff88105f2d6c40 000000001426f50a <<<<<<<< %r12 (remaining runtime)
ffff88105f443e38: ffff881058a79e40 0000014d2c58e112
ffff88105f443e48: ffff8810537cd400 ffff88105f443e98
ffff88105f443e58: ffffffff810d12e2
#14 [ffff88105f443e58] distribute_cfs_runtime at ffffffff810d12e2
crash> pd 0x000000001426f50a
$2 = 338097418
The value is much bigger than the total runtime_remaining from each cfs_rq. Also, throttled_list is also not empty.
crash> cfs_bandwidth.throttled_cfs_rq ffff881058a79d48
throttled_cfs_rq = {
next = 0xffff8810537cd500,
prev = 0xffff8800369ed700
}
crash> list 0xffff8810537cd500 | wc -l
29
Consequently, it made the kernel in the while loop which was unable to break out as the local runtime_remaining was much bigger than the sum of each runqueue's runtime_remaining.
It's matching the bug that should be fixed by applying a patch from upstream commit c06f04c70489b9deea3212af8375e2f0c2f0b184.