PPTP双接入之死锁问题

本文详细分析了PPTP双接入时出现的死锁问题,问题源于`ppp_push`函数中无法获取`_raw_spin_lock_bh`锁。通过log分析,发现`ppp_channel_push`函数持有锁未释放,导致死锁。通过内核版本对比和补丁应用,证实问题是由于内核中的递归调用导致,最终通过合入相关修复patch解决死锁并恢复接口正常功能。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

PPTP双接入之死锁问题

问题描述:使用dhcp+pptp拨号出现死锁
拓扑图:
在这里插入图片描述

死机log如下

[2021/9/3 17:28:26] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [pptp_wan1:2261]
[2021/9/3 17:28:26] Modules linked in: dhcp_options(O) init_addr(  (null) -   (null)), core_addr(bfc0b000 - bfc0bbac)
[2021/9/3 17:28:26]  nos(O) init_addr(  (null) -   (null)), core_addr(bfbfb000 - bfc01b34)
[2021/9/3 17:28:26]  autodiscover(O) init_addr(  (null) -   (null)), core_addr(bfbf7000 - bfbf74ac)
[2021/9/3 17:28:26]  ddos_ip_fence(O) init_addr(  (null) -   (null)), core_addr(bfbf1000 - bfbf2fa0)
[2021/9/3 17:28:26]  dnsredirect(O) init_addr(  (null) -   (null)), core_addr(bfbeb000 - bfbec9a0)
[2021/9/3 17:28:26]  privilege_ip(O) init_addr(  (null) -   (null)), core_addr(bfbe6000 - bfbe6a44)
[2021/9/3 17:28:26]  url_filter(O) init_addr(  (null) -   (null)), core_addr(bfbcb000 - bfbced84)
[2021/9/3 17:28:26]  mac_filter(O) init_addr(  (null) -   (null)), core_addr(bfbc6000 - bfbc6a98)
[2021/9/3 17:28:26]  mac_group(O) init_addr(  (null) -   (null)), core_addr(bfbbf000 - bfbbf8a8)
[2021/9/3 17:28:26]  bm(O) init_addr(  (null) -   (null)), core_addr(bfbb0000 - bfbb2a34)
[2021/9/3 17:28:26]  kmbase(O) init_addr(  (null) -   (null)), core_addr(bfb92000 - bfb9927c)
[2021/9/3 17:28:26]  phy_check(O) init_addr(  (null) -   (null)), core_addr(bfb8e000 - bfb8e54c)
[2021/9/3 17:28:26]  private_ie(O) init_addr(  (null) -   (null)), core_addr(bfb84000 - bfb87744)
[2021/9/3 17:28:26]  wifibase(O) init_addr(  (null) -   (null)), core_addr(bfb75000 - bfb7b0c8)
[2021/9/3 17:28:26]  wl init_addr(  (null) -   (null)), core_addr(bf20c000 - bf647410)
[2021/9/3 17:28:26]  igs(P) init_addr(  (null) -   (null)), core_addr(bf203000 - bf205ed4)
[2021/9/3 17:28:26]  emf(P) init_addr(  (null) -   (null)), core_addr(bf1fb000 - bf1fd75c)
[2021/9/3 17:28:26]  cfg80211 init_addr(  (null) -   (null)), core_addr(bf1c5000 - bf1e71e4)
[2021/9/3 17:28:26]  hnd init_addr(  (null) -   (null)), core_addr(bf15c000 - bf1904c4)
[2021/9/3 17:28:26]  bcm_pcie_hcd init_addr(  (null) -   (null)), core_addr(bf14f000 - bf153fc0)
[2021/9/3 17:28:26]  wlcsm(P) init_addr(  (null) -   (null)), core_addr(bf14a000 - bf14adac)
[2021/9/3 17:28:26]  kmlib(O) init_addr(  (null) -   (null)), core_addr(bf142000 - bf1448a8)
[2021/9/3 17:28:26]  otp(P) init_addr(  (null) -   (null)), core_addr(bf13e000 - bf13e504)
[2021/9/3 17:28:26]  bcm_thermal init_addr(  (null) -   (null)), core_addr(bf139000 - bf1397b4)
[2021/9/3 17:28:26]  pwrmngtd(P) init_addr(  (null) -   (null)), core_addr(bf135000 - bf135480)
[2021/9/3 17:28:26]  bcmmcast init_addr(  (null) -   (null)), core_addr(bf121000 - bf12a51c)
[2021/9/3 17:28:26]  bcm_enet init_addr(  (null) -   (null)), core_addr(bf0f7000 - bf10cca4)
[2021/9/3 17:28:26]  archer(P) init_addr(  (null) -   (null)), core_addr(bf0c2000 - bf0dad70)
[2021/9/3 17:28:26]  cmdlist(P) init_addr(  (null) -   (null)), core_addr(bf0ad000 - bf0b81dc)
[2021/9/3 17:28:26]  pktflow(P) init_addr(  (null) -   (null)), core_addr(bf06a000 - bf08f3f8)
[2021/9/3 17:28:26]  bcm_ingqos(P) init_addr(  (null) -   (null)), core_addr(bf031000 - bf034ac8)
[2021/9/3 17:28:26]  chipinfo(P) init_addr(  (null) -   (null)), core_addr(bf02d000 - bf02d104)
[2021/9/3 17:28:26]  bcmvlan(P) init_addr(  (null) -   (null)), core_addr(bf010000 - bf01da20)
[2021/9/3 17:28:26]  bcmlibs(P) init_addr(  (null) -   (null)), core_addr(bf008000 - bf00a47c)
[2021/9/3 17:28:26]  gpio(O) init_addr(  (null) -   (null)), core_addr(bf000000 - bf001f18)
[2021/9/3 17:28:26] 
[2021/9/3 17:28:26] CPU: 1 PID: 2261 Comm: pptp_wan1 Tainted: P           O    4.1.52 #17
[2021/9/3 17:28:26] Hardware name: Generic DT based system
[2021/9/3 17:28:26] task: c8c14000 ti: c8d12000 task.ti: c8d12000
[2021/9/3 17:28:26] PC is at _raw_spin_lock_bh+0x48/0x5c
[2021/9/3 17:28:26] LR is at ppp_push+0x5fc/0x674
[2021/9/3 17:28:26] pc : [<c0529554>]    lr : [<c02f3eb0>]    psr: 20000013
[2021/9/3 17:28:26] sp : c8d13c98  ip : 0000002c  fp : c07200f8
[2021/9/3 17:28:26] r10: c740d4c0  r9 : c740d514  r8 : 00000021
[2021/9/3 17:28:26] r7 : c74859b0  r6 : c74859a4  r5 : c740d504  r4 : 00000000
[2021/9/3 17:28:26] r3 : 0000d6bb  r2 : 0000d6bc  r1 : 00000000  r0 : c74859a4
[2021/9/3 17:28:26] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[2021/9/3 17:28:26] Control: 10c5387d  Table: 0670404a  DAC: 00000015
[2021/9/3 17:28:26] CPU: 1 PID: 2261 Comm: pptp_wan1 Tainted: P           O    4.1.52 #17
[2021/9/3 17:28:26] Hardware name: Generic DT based system
[2021/9/3 17:28:26] [<c0026e20>] (unwind_backtrace) from [<c0022bf8>] (show_stack+0x10/0x14)
[2021/9/3 17:28:26] [<c0022bf8>] (show_stack) from [<c0524b18>] (dump_stack+0x8c/0xa0)
[2021/9/3 17:28:26] [<c0524b18>] (dump_stack) from [<c008a52c>] (watchdog_timer_fn+0x1f8/0x258)
[2021/9/3 17:28:26] [<c008a52c>] (watchdog_timer_fn) from [<c00721c0>] (__run_hrtimer+0x44/0xd4)
[2021/9/3 17:28:26] [<c00721c0>] (__run_hrtimer) from [<c0072a00>] (hrtimer_run_queues+0x94/0x118)
[2021/9/3 17:28:26] [<c0072a00>] (hrtimer_run_queues) from [<c0071630>] (update_process_times+0x28/0x64)
[2021/9/3 17:28:26] [<c0071630>] (update_process_times) from [<c007e3f8>] (tick_handle_periodic+0x28/0x88)
[2021/9/3 17:28:26] [<c007e3f8>] (tick_handle_periodic) from [<c031eee4>] (arch_timer_handler_phys+0x28/0x30)
[2021/9/3 17:28:26] [<c031eee4>] (arch_timer_handler_phys) from [<c0067b74>] (handle_percpu_devid_irq+0x6c/0x84)
[2021/9/3 17:28:26] [<c0067b74>] (handle_percpu_devid_irq) from [<c0063be0>] (generic_handle_irq+0x2c/0x3c)
[2021/9/3 17:28:26] [<c0063be0>] (generic_handle_irq) from [<c0063e7c>] (__handle_domain_irq+0x5c/0xb4)
[2021/9/3 17:28:26] [<c0063e7c>] (__handle_domain_irq) from [<c00193e4>] (gic_handle_irq+0x24/0x60)
[2021/9/3 17:28:26] [<c00193e4>] (gic_handle_irq) from [<c0023700>] (__irq_svc+0x40/0x74)
[2021/9/3 17:28:26] Exception stack(0xc8d13c50 to 0xc8d13c98)
[2021/9/3 17:28:26] 3c40:                                     c74859a4 00000000 0000d6bc 0000d6bb
[2021/9/3 17:28:26] 3c60: 00000000 c740d504 c74859a4 c74859b0 00000021 c740d514 c740d4c0 c07200f8
[2021/9/3 17:28:26] 3c80: 0000002c c8d13c98 c02f3eb0 c0529554 20000013 ffffffff
[2021/9/3 17:28:26] [<c0023700>] (__irq_svc) from [<c0529554>] (_raw_spin_lock_bh+0x48/0x5c)
[2021/9/3 17:28:26] [<c0529554>] (_raw_spin_lock_bh) from [<c02f3eb0>] (ppp_push+0x5fc/0x674)//在这里去拿这把锁,一直拿不到
[2021/9/3 17:28:26] [<c02f3eb0>] (ppp_push) from [<c02f53e4>] (__ppp_xmit_process+0x48/0x62c)
[2021/9/3 17:28:26] [<c02f53e4>] (__ppp_xmit_process) from [<c02f5ae8>] (ppp_xmit_process+0x20/0x34)
[2021/9/3 17:28:26] [<c02f5ae8>] (ppp_xmit_process) from [<c02f5c20>] (ppp_start_xmit+0x124/0x198)
[2021/9/3 17:28:26] [<c02f5c20>] (ppp_start_xmit) from [<c03b23e8>] (dev_hard_start_xmit+0x254/0x31c)
[2021/9/3 17:28:26] [<c03b23e8>] (dev_hard_start_xmit) from [<c03d36a4>] (sch_direct_xmit+0xcc/0x214)
[2021/9/3 17:28:26] [<c03d36a4>] (sch_direct_xmit) from [<c03acc2c>] (__dev_xmit_skb+0x188/0x2e4)
[2021/9/3 17:28:26] [<c03acc2c>] (__dev_xmit_skb) from [<c03b2590>] (__dev_queue_xmit+0xe0/0x344)
[2021/9/3 17:28:26] [<c03b2590>] (__dev_queue_xmit) from [<c0417898>] (ip_finish_output+0x2d0/0xa68)
[2021/9/3 17:28:26] [<c0417898>] (ip_finish_output) from [<c0419808>] (ip_output+0x128/0x134)
[2021/9/3 17:28:26] [<c0419808>] (ip_output) from [<c02fd558>] (pptp_xmit+0x3c0/0x484)
[2021/9/3 17:28:26] [<c02fd558>] (pptp_xmit) from [<c02f5a28>] (ppp_channel_push+0x60/0xf0//在这里加锁,调用发包函数发送
[2021/9/3 17:28:26] [<c02f5a28>] (ppp_channel_push) from [<c02f5d4c>] (ppp_write+0xb8/0x110)
[2021/9/3 17:28:26] [<c02f5d4c>] (ppp_write) from [<c00ba8f4>] (__vfs_write+0x1c/0xd8)
[2021/9/3 17:28:27] [<c00ba8f4>] (__vfs_write) from [<c00bb128>] (vfs_write+0x90/0x170)
[2021/9/3 17:28:27] [<c00bb128>] (vfs_write) from [<c00bb928>] (SyS_write+0x3c/0x90)
[2021/9/3 17:28:27] [<c00bb928>] (SyS_write) from [<c001f440>] (ret_fast_syscall+0x0/0x3c)
[2021/9/3 17:28:27] Kernel panic - not syncing: softlockup: hung tasks
[2021/9/3 17:28:27] CPU: 1 PID: 2261 Comm: pptp_wan1 Tainted: P           O L  4.1.52 #17

分析log:通过log可知最后死在了ppp_push中去拿_raw_spin_lock_bh这把锁,一直没有获取到该锁,一直在这里自旋等待了23s中,最后触发看门狗,从而重启
在产品代码或者平台代码中搜索ppp_push函数:xxx/drivers/net/ppp/ppp_generic.c在这个.c文件中

ppp_push函数实现如下:

static void
ppp_push(struct ppp *ppp)
{
    struct list_head *list;
    struct channel *pch;
    struct sk_buff *skb = ppp->xmit_pending;

    if (!skb)
        return;

    list = &ppp->channels;
    if (list_empty(list)) {
        /* nowhere to send the packet, just drop it */
        ppp->xmit_pending = NULL;
        kfree_skb(skb);
        return;
    }

    if ((ppp->flags & SC_MULTILINK) == 0) {
        /* not doing multilink: send it down the first channel */
        list = list->next;
        pch = list_entry(list, struct channel, clist);

        spin_lock_bh(&pch->downl);//该问题是去拿这把锁的时候,一直拿不到,从而触发看门狗,导致重启
        if (pch->chan) {
            if (pch->chan->ops->start_xmit(pch->chan, skb))//调用发包函数
                ppp->xmit_pending = NULL;
        } else {
            /* channel got unregistered */
            kfree_skb(skb);
            ppp->xmit_pending = NULL;
        }
        spin_unlock_bh(&pch->downl);
        return;
    }

#ifdef CONFIG_PPP_MULTILINK
    /* Multilink: fragment the packet over as many links
       as can take the packet at the moment. */
    if (!ppp_mp_explode(ppp, skb))
        return;
#endif /* CONFIG_PPP_MULTILINK */

    ppp->xmit_pending = NULL;
    kfree_skb(skb);
}

spin_lock_bh(&pch->downl)实现:
static inline void spin_lock_bh(spinlock_t *lock)
{
	raw_spin_lock_bh(&lock->rlock);
}
#define raw_spin_lock_bh(lock)		_raw_spin_lock_bh(lock)

从上述逻辑可以清楚的知道,是由于一直拿不到_raw_spin_lock_bh这把锁,导致该问题的出现;那么为什么拿不到这把锁了?可能是由于别的地方拿了这把锁没有释放的这种情况,所以接下来的思路就是加debug去找哪里没有释放;全局搜索spin_lock_bh(&pch->downl)如下图,并加打印去复现
在这里插入图片描述

debug版本log如下

[2021/9/1 14:01:44] [ppp_push->1531]---1--
[2021/9/1 14:01:44] 
[2021/9/1 14:01:44] 
[2021/9/1 14:01:44] [ppp_push->1541]---2--
[2021/9/1 14:01:44] 
[2021/9/1 14:01:44] 
[2021/9/1 14:01:44] [ppp_push->1531]---1--
[2021/9/1 14:01:44] 
[2021/9/1 14:01:44] 
[2021/9/1 14:01:44] [ppp_push->1541]---2--
[2021/9/1 14:01:44] 
[2021/9/1 14:01:44] 
[2021/9/1 14:01:44] [ppp_channel_push->1806]---1--

这个时候就知道哪里没释放了,是由于ppp_channel_push函数拿了这把锁没释放,从而导致后面ppp_push函数拿不到这把锁。那么为什么拿不到这把锁了?分析ppp_channel_push函数实现,是否有逻辑缺陷:
ppp_channel_push函数如下:

static void
ppp_channel_push(struct channel *pch)
{
    struct sk_buff *skb;
    struct ppp *ppp;

    spin_lock_bh(&pch->downl);//在这里拿这把锁
    if (pch->chan) {
        while (!skb_queue_empty(&pch->file.xq)) {
            skb = skb_dequeue(&pch->file.xq);
            //通过逐步加打印确认数据包从这里发包出去,然后又走到了ppp_push函数去拿这把锁
            if (!pch->chan->ops->start_xmit(pch->chan, skb)) {
                /* put the packet back and try again later */
                skb_queue_head(&pch->file.xq, skb);
                break;
            }
        }
    } else {
        /* channel got deregistered */
        skb_queue_purge(&pch->file.xq);
    }
    spin_unlock_bh(&pch->downl);//在这里释放这把锁,但是没有走到这里
    /* see if there is anything from the attached unit to be sent */
    if (skb_queue_empty(&pch->file.xq)) {
        read_lock_bh(&pch->upl);
        ppp = pch->ppp;
        if (ppp)
            ppp_xmit_process(ppp);
        read_unlock_bh(&pch->upl);
    }
}

单纯看逻辑是没有缺陷的,显然标准内核不会犯这种低级的错误,于是加打印确认了是走进了发包函数没有退出来,递归调用了一系列函数,最终去调用ppp_push函数拿这把锁导致的。那么为什么会出现这种情况了?加打印判断什么样的数据包会导致情况,判断是ppp协议中的LCP维护链路的报文触发了,但是这种报文是正常的。
由于一般标准内核不会有问题,所以去排查其他方面,比如路由,排查后是没有问题的;那么就只能怀疑标准内核了,于是去验证是标准内核的问题
去验证已上市的同方案的xx产品也有该问题,在这里更加确定是标准内核的问题了,由于是同一方案的,没办法捶死该问题。于是去找其他方案的xx,该方案是rtk的内核版本是3.4.13,验证没有该问题。这个时候就有点怀疑人生了,不太敢确定是标准内核的问题了。于是去验证4.4.10版本的产品xxx1,版本接近,出问题的xxx版本为4.1.52,验证xxx1也有问题。那么说明内核版本4.x的有该问题
在内核官网切换到ppp目录搜索_raw_spin_lock_bh,搜索到了该问题,并且被修复了;参考网站

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v4.9&id=55454a565836e1cb002d433e901804dea4406a32

合入该patch,发现死锁问题没有了,但是ppp1接口转发不了包了。结合问题,是递归调用导致的,于是搜recursion这个单词,发现还有patch。合入验证。验证ok。
该问题的关键就是递归调用了那把锁,其patch修复的原理是使用每cpu变量去检测递归调用的情况。

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值