PERCPU

最新Linux内核TC行动子系统引入新标志,允许用户跳过昂贵的percpu分配,改用内置行动统计,显著提升规则插入率并降低内存使用,尤其对硬件卸载规则有益。

https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html

percpu对某些应用来说非常高效,但是对需要频繁分配percpu变量时,就会变成累赘。因为分配时需要一个全局变量的锁。

最新的upstream kernel里面有一些patch解决了tc里面的percpu的问题。

commit d86784fe9b037baf06a154283a4e8cff46b6fe2f
Merge: 21d8bd123ac4 9ae6b78708a7
Author: David S. Miller <davem@davemloft.net>
Date:   Wed Oct 30 18:07:51 2019 -0700

    Merge branch 'Control-action-percpu-counters-allocation-by-netlink-flag'

    Vlad Buslov says:

    ====================
    Control action percpu counters allocation by netlink flag

    Currently, significant fraction of CPU time during TC filter allocation
    is spent in percpu allocator. Moreover, percpu allocator is protected
    with single global mutex which negates any potential to improve its
    performance by means of recent developments in TC filter update API that
    removed rtnl lock for some Qdiscs and classifiers. In order to
    significantly improve filter update rate and reduce memory usage we
    would like to allow users to skip percpu counters allocation for
    specific action if they don't expect high traffic rate hitting the
    action, which is a reasonable expectation for hardware-offloaded setup.
    In that case any potential gains to software fast-path performance
    gained by usage of percpu-allocated counters compared to regular integer
    counters protected by spinlock are not important, but amount of
    additional CPU and memory consumed by them is significant.

    In order to allow configuring action counters allocation type at
    runtime, implement following changes:

    - Implement helper functions to update the action counters and use them
      in affected actions instead of updating counters directly. This steps
      abstracts actions implementation from counter types that are being
      used for particular action instance at runtime.

    - Modify the new helpers to use percpu counters if they were allocated
      during action initialization and use regular counters otherwise.

    - Extend action UAPI TCA_ACT space with TCA_ACT_FLAGS field. Add
      TCA_ACT_FLAGS_NO_PERCPU_STATS action flag and update
      hardware-offloaded actions to not allocate percpu counters when the
      flag is set.

    With this changes users that prefer action update slow-path speed over
    software fast-path speed can dynamically request actions to skip percpu
    counters allocation without affecting other users.

    Now, lets look at actual performance gains provided by this change.
    Simple test is used to measure insertion rate - iproute2 TC is executed
    in parallel by xargs in batch mode, its total execution time is measured
    by shell builtin "time" command. The command runs 20 concurrent tc
    instances, each with its own batch file with 100k rules:

    $ time ls add* | xargs -n 1 -P 20 sudo tc -b

    Two main rule profiles are tested. First is simple L2 flower classifier
    with single gact drop action. The configuration is chosen as worst case
    scenario because with single-action rules pressure on percpu allocator
    is minimized. Example rule:

    filter add dev ens1f0 protocol ip ingress prio 1 handle 1 flower skip_hw
        src_mac e4:11:0:0:0:0 dst_mac e4:12:0:0:0:0 action drop

    Second profile is typical real-world scenario that uses flower
    classifier with some L2-4 fields and two actions (tunnel_key+mirred).
    Example rule:

    filter add dev ens1f0_0 protocol ip ingress prio 1 handle 1 flower
        skip_hw src_mac e4:11:0:0:0:0 dst_mac e4:12:0:0:0:0 src_ip
        192.168.111.1 dst_ip 192.168.111.2 ip_proto udp dst_port 1 src_port
        1 action tunnel_key set id 1 src_ip 2.2.2.2 dst_ip 2.2.2.3 dst_port
        4789 action mirred egress redirect dev vxlan1

     Profile           |        percpu |     no_percpu | X improvement
                       | (k rules/sec) | (k rules/sec) |
    -------------------+---------------+---------------+---------------
     Gact drop         |           203 |           259 |          1.28
     tunnel_key+mirred |            92 |           204 |          2.22

    For simple drop action removing percpu allocation leads to ~25%
    insertion rate improvement. Perf profiles highlights the bottlenecks.

    Perf profile of run with percpu allocation (gact drop):

    + 89.11% 0.48% tc [kernel.vmlinux] [k] entry_SYSCALL_64
    + 88.58% 0.04% tc [kernel.vmlinux] [k] do_syscall_64
    + 87.50% 0.04% tc libc-2.29.so [.] __libc_sendmsg
    + 86.96% 0.04% tc [kernel.vmlinux] [k] __sys_sendmsg
    + 86.85% 0.01% tc [kernel.vmlinux] [k] ___sys_sendmsg
    + 86.60% 0.05% tc [kernel.vmlinux] [k] sock_sendmsg
    + 86.55% 0.12% tc [kernel.vmlinux] [k] netlink_sendmsg
    + 86.04% 0.13% tc [kernel.vmlinux] [k] netlink_unicast
    + 85.42% 0.03% tc [kernel.vmlinux] [k] netlink_rcv_skb
    + 84.68% 0.04% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg
    + 84.56% 0.24% tc [kernel.vmlinux] [k] tc_new_tfilter
    + 75.73% 0.65% tc [cls_flower] [k] fl_change
    + 71.30% 0.03% tc [kernel.vmlinux] [k] tcf_exts_validate
    + 71.27% 0.13% tc [kernel.vmlinux] [k] tcf_action_init
    + 71.06% 0.01% tc [kernel.vmlinux] [k] tcf_action_init_1
    + 70.41% 0.04% tc [act_gact] [k] tcf_gact_init
    + 53.59% 1.21% tc [kernel.vmlinux] [k] __mutex_lock.isra.0
    + 52.34% 0.34% tc [kernel.vmlinux] [k] tcf_idr_create
    - 51.23% 2.17% tc [kernel.vmlinux] [k] pcpu_alloc
      - 49.05% pcpu_alloc
        + 39.35% __mutex_lock.isra.0 4.99% memset_erms
        + 2.16% pcpu_alloc_area
      + 2.17% __libc_sendmsg
    + 45.89% 44.33% tc [kernel.vmlinux] [k] osq_lock
    + 9.94% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc
    + 7.76% 0.00% tc [kernel.vmlinux] [k] tcf_idr_insert
    + 6.50% 0.03% tc [kernel.vmlinux] [k] tfilter_notify
    + 6.24% 6.11% tc [kernel.vmlinux] [k] mutex_spin_on_owner
    + 5.73% 5.32% tc [kernel.vmlinux] [k] memset_erms
    + 5.31% 0.18% tc [kernel.vmlinux] [k] tcf_fill_node

    Here bottleneck is clearly in pcpu_alloc() function that takes more than
    half CPU time, which is mostly wasted busy-waiting for internal percpu
    allocator global lock.

    With percpu allocation removed (gact drop):

    + 87.50% 0.51% tc [kernel.vmlinux] [k] entry_SYSCALL_64
    + 86.94% 0.07% tc [kernel.vmlinux] [k] do_syscall_64
    + 85.75% 0.04% tc libc-2.29.so [.] __libc_sendmsg
    + 85.00% 0.07% tc [kernel.vmlinux] [k] __sys_sendmsg
    + 84.84% 0.07% tc [kernel.vmlinux] [k] ___sys_sendmsg
    + 84.59% 0.01% tc [kernel.vmlinux] [k] sock_sendmsg
    + 84.58% 0.14% tc [kernel.vmlinux] [k] netlink_sendmsg
    + 83.95% 0.12% tc [kernel.vmlinux] [k] netlink_unicast
    + 83.34% 0.01% tc [kernel.vmlinux] [k] netlink_rcv_skb
    + 82.39% 0.12% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg
    + 82.16% 0.25% tc [kernel.vmlinux] [k] tc_new_tfilter
    + 75.13% 0.84% tc [cls_flower] [k] fl_change
    + 69.92% 0.05% tc [kernel.vmlinux] [k] tcf_exts_validate
    + 69.87% 0.11% tc [kernel.vmlinux] [k] tcf_action_init
    + 69.61% 0.02% tc [kernel.vmlinux] [k] tcf_action_init_1
    - 68.80% 0.10% tc [act_gact] [k] tcf_gact_init
      - 68.70% tcf_gact_init
        + 36.08% tcf_idr_check_alloc
        + 31.88% tcf_idr_insert
    + 63.72% 0.58% tc [kernel.vmlinux] [k] __mutex_lock.isra.0
    + 58.80% 56.68% tc [kernel.vmlinux] [k] osq_lock
    + 36.08% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc
    + 31.88% 0.01% tc [kernel.vmlinux] [k] tcf_idr_insert

    The gact actions (like all other actions types) are inserted in single
    idr instance protected by global (per namespace) lock that becomes new
    bottleneck with such simple rule profile and prevents achieving 2x+
    performance increase that can be expected by looking at profiling data
    for insertion action with percpu counter.

    Perf profile of run with percpu allocation (tunnel_key+mirred):

    + 91.95% 0.21% tc [kernel.vmlinux] [k] entry_SYSCALL_64
    + 91.74% 0.06% tc [kernel.vmlinux] [k] do_syscall_64
    + 90.74% 0.01% tc libc-2.29.so [.] __libc_sendmsg
    + 90.52% 0.01% tc [kernel.vmlinux] [k] __sys_sendmsg
    + 90.50% 0.04% tc [kernel.vmlinux] [k] ___sys_sendmsg
    + 90.41% 0.02% tc [kernel.vmlinux] [k] sock_sendmsg
    + 90.38% 0.04% tc [kernel.vmlinux] [k] netlink_sendmsg
    + 90.10% 0.06% tc [kernel.vmlinux] [k] netlink_unicast
    + 89.76% 0.01% tc [kernel.vmlinux] [k] netlink_rcv_skb
    + 89.28% 0.04% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg
    + 89.15% 0.03% tc [kernel.vmlinux] [k] tc_new_tfilter
    + 83.41% 0.33% tc [cls_flower] [k] fl_change
    + 81.17% 0.04% tc [kernel.vmlinux] [k] tcf_exts_validate
    + 81.13% 0.06% tc [kernel.vmlinux] [k] tcf_action_init
    + 81.04% 0.04% tc [kernel.vmlinux] [k] tcf_action_init_1
    - 73.59% 2.16% tc [kernel.vmlinux] [k] pcpu_alloc
      - 71.42% pcpu_alloc
        + 61.41% __mutex_lock.isra.0 5.02% memset_erms
        + 2.93% pcpu_alloc_area
      + 2.16% __libc_sendmsg
    + 63.58% 0.17% tc [kernel.vmlinux] [k] tcf_idr_create
    + 63.40% 0.60% tc [kernel.vmlinux] [k] __mutex_lock.isra.0
    + 57.85% 56.38% tc [kernel.vmlinux] [k] osq_lock
    + 46.27% 0.13% tc [act_tunnel_key] [k] tunnel_key_init
    + 34.26% 0.02% tc [act_mirred] [k] tcf_mirred_init
    + 10.99% 0.00% tc [kernel.vmlinux] [k] dst_cache_init
    + 5.32% 5.11% tc [kernel.vmlinux] [k] memset_erms

    With two times more actions pressure on percpu allocator doubles, so now
    it takes ~74% of CPU execution time.

    With percpu allocation removed (tunnel_key+mirred):

    + 86.02% 0.50% tc [kernel.vmlinux] [k] entry_SYSCALL_64
    + 85.51% 0.12% tc [kernel.vmlinux] [k] do_syscall_64
    + 84.40% 0.03% tc libc-2.29.so [.] __libc_sendmsg
    + 83.84% 0.03% tc [kernel.vmlinux] [k] __sys_sendmsg
    + 83.72% 0.01% tc [kernel.vmlinux] [k] ___sys_sendmsg
    + 83.56% 0.01% tc [kernel.vmlinux] [k] sock_sendmsg
    + 83.50% 0.08% tc [kernel.vmlinux] [k] netlink_sendmsg
    + 83.02% 0.17% tc [kernel.vmlinux] [k] netlink_unicast
    + 82.48% 0.00% tc [kernel.vmlinux] [k] netlink_rcv_skb
    + 81.89% 0.11% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg
    + 81.71% 0.25% tc [kernel.vmlinux] [k] tc_new_tfilter
    + 73.99% 0.63% tc [cls_flower] [k] fl_change
    + 69.72% 0.00% tc [kernel.vmlinux] [k] tcf_exts_validate
    + 69.72% 0.09% tc [kernel.vmlinux] [k] tcf_action_init
    + 69.53% 0.05% tc [kernel.vmlinux] [k] tcf_action_init_1
    + 53.08% 0.91% tc [kernel.vmlinux] [k] __mutex_lock.isra.0
    + 45.52% 43.99% tc [kernel.vmlinux] [k] osq_lock
    - 36.02% 0.21% tc [act_tunnel_key] [k] tunnel_key_init
      - 35.81% tunnel_key_init
        + 15.95% tcf_idr_check_alloc
        + 13.91% tcf_idr_insert
        - 4.70% dst_cache_init
          + 4.68% pcpu_alloc
    + 33.22% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc
    + 32.34% 0.05% tc [act_mirred] [k] tcf_mirred_init
    + 28.24% 0.01% tc [kernel.vmlinux] [k] tcf_idr_insert
    + 7.79% 0.05% tc [kernel.vmlinux] [k] idr_alloc_u32
    + 7.67% 7.35% tc [kernel.vmlinux] [k] idr_get_free
    + 6.46% 6.22% tc [kernel.vmlinux] [k] mutex_spin_on_owner
    + 5.11% 0.05% tc [kernel.vmlinux] [k] tfilter_notify

    With percpu allocation removed insertion rate is increased by ~120%.
    Such rule profile scales much better than simple single action because
    both types of actions were competing for single lock in percpu
    allocator, but not for action idr lock, which is per-action. Note that
    percpu allocator is still used by dst_cache in tunnel_key actions and
    consumes 4.68% CPU time. Dst_cache seems like good opportunity for
    further insertion rate optimization but is not addressed by this change.

    Another improvement provided by this change is significantly reduced
    memory usage. The test is implemented by sampling "used memory" value
    from "vmstat -s" command output. Following table includes memory usage
    measurements for same two configurations that were used for measuring
    insertion rate:

     Profile           | Mem per rule | Mem per rule no_percpu | Less memory used
                       |         (KB) |                   (KB) |             (KB)
    -------------------+--------------+------------------------+------------------
     Gact drop         |         3.91 |                   2.51 |              1.4
     tunnel_key+mirred |         6.73 |                   3.91 |              2.8

    Results indicate that memory usage of percpu allocator per action is
    ~1.4 KB. Note that any measurements of percpu allocator memory usage is
    inherently tied to particular setup since memory usage is linear to
    number of cores in system. It is to be expected that on current top of
    the line servers percpu allocator memory usage will be 2-5x more than on
    24 CPUs setup that was used for testing.

    Setup details: 2x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 32GB memory

    Patches applied on top of net-next branch:

    commit 2203cbf2c8b58a1e3bef98c47531d431d11639a0 (net-next) Author:
    Russell King <rmk+kernel@armlinux.org.uk> Date: Tue Oct 15 11:38:39 2019
    +0100

    net: sfp: move fwnode parsing into sfp-bus layer

    Changes V1 -> V2:

    - Include memory measurements.
    ====================

    Signed-off-by: David S. Miller <davem@davemloft.net>

 

ovs对应的patch是:

commit 292d5bd9bb344527e0da19433cf3e51f8a24058c
Author: Vlad Buslov <vladbu@mellanox.com>
Date:   Mon Nov 4 18:34:49 2019 +0200

    tc: Set 'no_percpu' flag for compatible actions

    Recent changes in Linux kernel TC action subsystem introduced new
    TCA_ACT_FLAGS_NO_PERCPU_STATS flag. The purpose of the flag is to request
    action implementation to skip allocating action stats with expensive percpu
    allocator and use regular built-in action stats instead. Such approach
    significantly improves rule insertion rate and reduce memory usage for
    hardware-offloaded rules that don't need benefits provided by percpu
    allocated stats (improved software TC fast-path performance). Set the flag
    for all compatible actions.

    Modify acinclude.m4 to use OVS-internal pkt_cls.h implementation when
    TCA_ACT_FLAGS is not defined by kernel headers and to manually define
    struct nla_bitfield32 in netlink.h (new file) when it is not defined by
    kernel headers.

    Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
    Reviewed-by: Roi Dayan <roid@mellanox.com>
    Signed-off-by: Simon Horman <simon.horman@netronome.com>

查看.data..percpu section的命令是:

$ readelf -S vmlinux
...
  [28] .data..percpu     PROGBITS         0000000000000000  01a00000
       00000000001fe9d8  0000000000000000  WA       0     0     4096
...

 

<think>我们正在讨论per-CPU变量的实现机制。根据引用[4]中的示例代码,我们可以看到per-CPU变量的基本使用方法:使用`alloc_percpu`分配,使用`per_cpu_ptr`获取特定CPU的指针,并使用`get_cpu`和`put_cpu`来防止抢占。现在,我们需要深入其实现原理。 ###一、per-CPU变量的设计目标per-CPU变量主要用于解决多核环境下对共享数据的访问竞争问题。每个CPU都有自己独立的变量副本,这样在访问时就不需要加锁,提高了性能[^2]。 ###二、实现原理####1.数据存储布局在编译时,per-CPU变量被放在一个特殊的段(section)中。在Linux内核中,这个段通常命名为`.data..percpu`(在x86架构中)[^1]。当系统启动时,为每个CPU复制这个段的内容,使得每个CPU都有自己独立的数据副本。 ####2.动态分配与静态声明-**静态声明**:使用`DEFINE_PER_CPU(type,name)`宏,例如:```cDEFINE_PER_CPU(int, my_percpu_var);```这会在`.data..percpu`段中定义一个名为`my_percpu_var`的变量,每个CPU都有一个独立的副本。- **动态分配**:使用`alloc_percpu`函数,如引用[4]中的示例:```cptr= alloc_percpu(int);```####3.访问机制-**获取当前CPU的变量**:使用`get_cpu_var`和`put_cpu_var`宏。这两个宏会禁用内核抢占,确保在当前CPU上操作的原子性。```cint*val= get_cpu_var(my_percpu_var);*val =10;put_cpu_var(my_percpu_var);``` -**访问指定CPU的变量**:使用`per_cpu_ptr`宏,如引用[4]所示:```cthe_ptr= per_cpu_ptr(ptr,cpu);```####4.初始化过程在系统启动时,会调用`setup_per_cpu_areas`函数(在x86架构中位于`arch/x86/kernel/setup_percpu.c`)。该函数会为每个CPU分配内存,并将初始的per-CPU数据(即`.data..percpu`段的内容)复制到每个CPU的私有区域中。引用[3]中的代码片段展示了这一过程: ```cfor (i =0; i< NR_CPUS;i++,ptr +=size){__per_cpu_offset[i] =ptr -__per_cpu_start;memcpy(ptr,__per_cpu_start,__per_cpu_end -__per_cpu_start);} ```这里,`__per_cpu_offset`数组存储了每个CPU的per-CPU区域相对于原始数据的偏移量。通过这个偏移量,我们可以计算出每个CPU上变量的实际地址。###三、关键数据结构####1. `__per_cpu_offset`数组这是一个全局数组,存储了每个CPU的per-CPU区域的基地址相对于原始`.data..percpu`段起始地址的偏移量。通过这个偏移量,我们可以得到某个CPU上某个per-CPU变量的地址: ```c#define per_cpu_ptr(ptr,cpu)({ \(typeof(ptr)) ((char*)(ptr)+ __per_cpu_offset[cpu]); \}) ```####2.per-CPU变量的内存布局```mermaidgraphLRA[原始.data..percpu段] -->|包含初始数据| B(CPU0的per-CPU区域)A-->|复制| C(CPU1的per-CPU区域)A -->|复制| D(CPU2的per-CPU区域)A-->|...|E(其他CPU) ```###四、使用场景与性能优势1.**计数器**:如网络子系统中的统计计数器,每个CPU独立更新,减少缓存行竞争。2.**缓存**:如slab分配器的每CPU缓存,提高内存分配效率。 3. **避免锁竞争**:当数据主要被一个CPU访问时,使用per-CPU变量可以避免锁的开销。 ###五、注意事项1. **抢占控制**:在访问per-CPU变量时,必须使用`get_cpu`和`put_cpu`来防止进程被抢占到其他CPU上,否则可能导致数据不一致。2.**跨CPU访问**:如果需要访问其他CPU的per-CPU变量,必须确保该CPU已经停止运行(例如在热插拔时),否则需要额外的同步机制。 3. **内存占用**:per-CPU变量会为每个CPU都分配一份副本,因此对于大型数据结构,需要权衡内存开销。 ###六、示例代码解析(基于引用[4]) ```c#include <linux/module.h> #include<linux/kernel.h>#include <linux/init.h> #include<asm/io.h>#include <asm/uaccess.h> #include<linux/percpu.h> staticint *ptr;static int__inithello_init(void){int cpu;int *the_ptr;//动态分配一个int类型的per-CPU变量ptr= alloc_percpu(int);//遍历所有可能的CPUfor_each_possible_cpu(cpu){//禁止抢占,并返回当前CPU的ID(同时确保不会被迁移到其他CPU)get_cpu();//获取指定CPU的per-CPU变量指针the_ptr =per_cpu_ptr(ptr, cpu);//给该CPU的副本赋值*the_ptr= cpu;printk("val= %d\n", *the_ptr);//允许抢占put_cpu();}return0;} staticvoid __exit hello_exit(void){//释放per-CPU变量free_percpu(ptr); }module_init(hello_init);module_exit(hello_exit);``` 在这个示例中:-`alloc_percpu`为每个CPU分配了一个`int`大小的内存。 -`for_each_possible_cpu`遍历所有可能的CPU。-在循环中,通过`get_cpu`禁止抢占,然后使用`per_cpu_ptr`获取当前CPU的变量指针,并赋值。-最后使用`free_percpu`释放内存。###七、实现细节####1.静态per-CPU变量的实现以`DEFINE_PER_CPU(int, my_var)`为例,其展开后大致如下:```c__attribute__((section(".data..percpu"))) int__percpu__my_var; ```这样,`__percpu__my_var`就被放置在`.data..percpu`段中。 ####2.动态分配的实现`alloc_percpu`函数最终会调用`__alloc_percpu`,该函数会为每个CPU分配指定大小的内存,并返回一个指针。这个指针实际上指向原始数据(即第一个CPU的副本),但通过`per_cpu_ptr`访问时,会加上对应CPU的偏移量。 ###八、跨架构支持不同架构的per-CPU实现可能有所不同,主要体现在:1.**偏移量数组**:`__per_cpu_offset`数组的定义和初始化位置不同。2.**段名称**:如x86使用`.data..percpu`,而ARM可能使用不同的段名。3.**访问指令**:某些架构可能提供专门的指令来访问per-CPU变量(如x86的`%gs`段寄存器)。 ###九、性能优化技巧1. **对齐缓存行**:为了避免不同CPU的per-CPU变量共享同一缓存行(导致伪共享),可以使用`____cacheline_aligned_in_smp`属性进行对齐。```cDEFINE_PER_CPU_ALIGNED(int,my_var);``` 2. **使用局部变量**:在长时间操作per-CPU变量时,可以先将其复制到局部变量,操作完再写回,减少禁用抢占的时间。 ###十、相关函数总结|函数/宏|作用| |---------|------|| `DEFINE_PER_CPU(type, name)` |静态定义一个per-CPU变量| |`alloc_percpu(type)`|动态分配一个per-CPU变量 || `free_percpu(ptr)`|释放动态分配的per-CPU变量 || `get_cpu_var(var)`|获取当前CPU的变量指针(并禁用抢占) || `put_cpu_var(var)`|释放变量(并启用抢占) || `per_cpu_ptr(ptr,cpu)`|获取指定CPU的变量指针| ###相关问题1. per-CPU变量在SMP和NUMA架构中有什么不同的表现?2.如何避免per-CPU变量的伪共享(false sharing)问题? 3. per-CPU变量在中断上下文中如何使用?4.动态分配的per-CPU变量在内存管理上有何特点? 5.在多核系统中,per-CPU变量如何与RCU机制配合使用?[^2]</think>### Linux内核中per-CPU变量的实现机制 per-CPU变量是Linux内核中用于**高效处理多核共享数据**的关键技术,其核心思想是为每个CPU创建变量的独立副本,从而**避免多核访问时的锁竞争**。以下是其实现原理和方式: --- #### 一、设计目标与核心优势 1. **零锁并发** - 每个CPU独立访问自己的变量副本,无需锁机制[^2] - 解决多核缓存一致性导致的性能瓶颈 2. **缓存局部性优化** - 变量副本位于CPU本地缓存,减少缓存行冲突 3. **动态扩展性** - 支持CPU热插拔,动态调整副本数量 --- #### 二、实现原理图解 ```mermaid graph TD A[原始数据段 .data..percpu] --> B(CPU0副本) A --> C(CPU1副本) A --> D(CPU2副本) B --> E[通过__per_cpu_offset[0]定位] C --> F[通过__per_cpu_offset[1]定位] D --> G[通过__per_cpu_offset[2]定位] ``` --- #### 三、关键技术实现 ##### 1. 内存布局管理 - **专用数据段** 所有per-CPU变量编译时放入`.data..percpu`段(x86架构)[^1] ```c #define PER_CPU_BASE_SECTION ".data..percpu" ``` - **启动时复制** 系统初始化时为每个CPU创建独立副本: ```c for (i = 0; i < NR_CPUS; i++, ptr += size) { __per_cpu_offset[i] = ptr - __per_cpu_start; memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start); } // 引用[3] ``` ##### 2. 访问机制 - **偏移量数组** 全局数组`__per_cpu_offset[]`存储每个CPU副本的基地址偏移 ```c extern unsigned long __per_cpu_offset[NR_CPUS]; ``` - **地址计算宏** 获取指定CPU的变量地址: ```c #define per_cpu_ptr(ptr, cpu) \ ((typeof(ptr))((char*)(ptr) + __per_cpu_offset[cpu])) ``` ##### 3. 安全访问控制 - **抢占保护** 使用`get_cpu()`/`put_cpu()`确保操作原子性: ```c get_cpu(); // 禁用抢占 *per_cpu_ptr(var, smp_processor_id()) = value; put_cpu(); // 启用抢占 ``` --- #### 四、使用方式示例 ##### 1. 动态分配 ```c // 分配per-CPU整型数组 int __percpu *arr = alloc_percpu(int); // 获取CPU1的副本指针 int *cpu1_ptr = per_cpu_ptr(arr, 1); *cpu1_ptr = 100; // 安全写入 ``` ##### 2. 静态声明 ```c // 声明每CPU变量 DEFINE_PER_CPU(int, counter); // 安全访问当前CPU副本 get_cpu_var(counter)++; put_cpu_var(counter); ``` ##### 3. 遍历所有CPU ```c for_each_possible_cpu(cpu) { int *ptr = per_cpu_ptr(arr, cpu); printk("CPU%d: %d\n", cpu, *ptr); } ``` --- #### 五、性能优化技术 1. **缓存对齐** 使用`____cacheline_aligned`避免伪共享 ```c DEFINE_PER_CPU_ALIGNED(int, cacheline_var); ``` 2. **热插拔支持** 通过`cpuhp_setup_state()`注册回调函数 ```c cpuhp_setup_state(CPUHP_AP_ONLINE, "mymodule:online", my_online_cb, my_offline_cb); ``` --- #### 六、典型应用场景 1. **网络收发包统计** ```c struct net_device_stats __percpu *dev_stats; ``` 2. **内存分配器缓存** SLAB/SLUB分配器的每CPU缓存 3. **任务调度统计** 内核调度器的运行队列统计 --- ### 相关问题 1. per-CPU变量如何保证跨CPU访问的安全性? 2. 在NUMA架构中per-CPU变量实现有何不同? 3. 如何动态调整per-CPU变量的大小? 4. per-CPU变量与RCU机制如何协同工作?[^2] 5. 用户态程序能否直接访问内核的per-CPU变量?
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值