Linux Per-CPU Data

本文探讨了Linux中Per-CPU数据机制的优势及其在减少locking需求和缓存失效方面的效果。介绍了2.6内核引入的新Per-CPU接口,简化了Per-CPU数据的创建与操作过程,并详细说明了如何在编译时及运行时定义Per-CPU变量。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Linux Per-CPU Data

Reasons for Using Per-CPU Data

There are a couple benefits to using per-CPU data.The first is the reduction inlocking requirements(减少locking的使用需求). Depending on thesemantics by which processors access the per-CPU data,you might not need any locking at all. Keepin mind that the "only thisprocessor accesses this data" rule is only a programmingconvention. You need to ensure that the local processor accesses only itsunique data. Nothing stops you from cheating.

Second, per-CPUdata greatly reduces cache invalidation(减少CPU之间数据同步时出现的缓存失效). This occurs as processors try to keeptheir caches in sync. If one processor manipulates data held in anotherprocessor's cache, that processor must flush or otherwise update its cache.Constant cache invalidation is called thrashing the cache and wreaks havoc on system performance.The use of per-CPU data keeps cache effects to a minimum because processorsideally access only their own data. The percpu interface cache-aligns all data to ensure thataccessing one processor's data does not bring in another processor's data onthe same cache line.

Consequently, theuse of per-CPU data often removes (or at least minimizes) the need for locking.Theonly safety requirement for the use of per-CPU data is disabling kernelpreemption, which is much cheaper than locking, and the interface doesso automatically. Per-CPU data can safely be used from either interrupt orprocess context. Note, however, that you cannot sleep in the middle ofaccessing per-CPU data (or else you might end up on a different processor).

No one is currently required to use the new per-CPUinterface. Doing things manually (with an array as originally discussed) isfine, as long as you disable kernel preemption. The new interface, however, ismuch easier to use and might gain additional optimizations in the future. Ifyou do decide to use per-CPU data in your kernel code, con sider the newinterface. One caveat against itsuse is that it is not backward compatible with earlier kernels.

The New percpu Interface

The 2.6 kernel introduced a new interface, known as percpu, for creating andmanipulating per-CPU data(Percpu Data是2.6内核引入的一个新的接口). This interfacegeneralizes the previous example. Creation and manipulation of per-CPU data issimplified with this new approach.

The previously discussed method of creating and accessingper-CPU data is still valid and accepted. This new interface, however, grew outof the needs for a simpler and more powerful method for manipulating per-CPUdata on large symmetrical multiprocessing computers.

The header <linux/percpu.h> declaresall the routines. You can find the actual definitions there, in mm/slab.c, and in <asm/percpu.h>.

PercpuData struct as follows:

structpercpu_data {

      void *ptrs[NR_CPUS];

};

Per-CPU Data atCompile-Time

Defining a per-CPU variable at compile-time is quite easy(在预编译时定义一个percpu变量):

DEFINE_PER_CPU(type, name);

This creates an instance of a variable of type type, named name, for each processor on the system(创建并实例化一个percpu对象如下). If you need adeclaration of the variable elsewhere, to avoid compile warnings, the followingmacro is your friend:

DECLARE_PER_CPU(type, name);

You can manipulate the variables with the get_cpu_var() and put_cpu_var() routines. A call to get_cpu_var() returns an lvalue for the given variable on the currentprocessor. It also disables preemption, which put_cpu_var() correspondinglyenables.

获取name变量并且禁止抢占,同时name变量++

get_cpu_var(name)++;   /* increment name on this processor */

释放这个变量的使用权,并允许内核抢占

put_cpu_var(name);     /* done; enable kernel preemption */

You can obtain the value of another processor'sper-CPU data, too:

per_cpu(name, cpu)++;  /* increment name on the given processor */

You need to be careful with this approach because per_cpu() neither disables kernel preemption nor provides any sort oflocking mechanism. The lockless nature of per-CPU data exists only if thecurrent processor is the only manipulator of the data. If other processorstouch other processors' data, you need locks. Be careful. Chapter 8,"Kernel Synchronization Introduction," and Chapter 9,"Kernel Synchronization Methods," discuss locking.

Another subtle note: These compile-time per-CPU examplesdo not work for mod ules because the linker actually creates them in a uniqueexecutable section (for the curious, .data.percpu). If you need to access per-CPU data from modules, or ifyou need to create such data dynamically, there is hope.

Per-CPU Data atRuntime

The kernel implements a dynamic allocator, similar to kmalloc(), for creating per-CPU data. Thisroutine creates an instance of the requested memory for each processor on thesystems. The prototypes are in <linux/percpu.h>:

void *alloc_percpu(type); /* a macro */

void *__alloc_percpu(size_t size, size_t align);

void free_percpu(const void *);

The alloc_percpu() macro allocates oneinstance of an object of the given type for every processor on the system. Itis a wrapper around __alloc_percpu(),which takes the actual number of bytes to allocate as a parameter and thenumber of bytes on which to align the allocation. The alloc_percpu() macroaligns the allocation on a byte boundary that is the natural alignment of thegiven type. Such alignment is the usual behavior. For example,

struct rabid_cheetah = alloc_percpu(structrabid_cheetah);

is the same as

struct rabid_cheetah = __alloc_percpu(sizeof (structrabid_cheetah),

                                     __alignof__ (struct rabid_cheetah));

The __alignof__ construct is a gccfeature that returns the required (or recommended, in the case of weirdarchitectures with no alignment requirements) alignment in bytes for a giventype or lvalue. Its syntax is just like that of sizeof. For example,

__alignof__ (unsigned long)

would return four on x86. When given an lvalue, thereturn value is the largest alignment that the lvalue might have. For example,an lvalue inside a structure could have a greater alignment requirement than ifan instance of the same type were created outside of the structure, because of structurealignment requirements. Issues of alignment are further discussed in Chapter 19,"Portability."

A corresponding call to free_percpu() frees the given dataon all processors.

A call to alloc_percpu() or __alloc_percpu() returnsa pointer, which is used to indirectly reference the dynamically createdper-CPU data. The kernel provides two macros to make this easy:

get_cpu_ptr(ptr);   /* return a void pointer to this processor's copy of ptr */

put_cpu_ptr(ptr);   /* done; enable kernel preemption */

The get_cpu_ptr() macro returns apointer to the specific instance of the current processor's data. It alsodisables kernel preemption, which a call to put_cpu_ptr() then enables.

Let's look at a full example of using these functions. Ofcourse, this example is a bit silly because you would normally allocate thememory once (perhaps in some initialization function), use it in variousplaces, and free it once (perhaps in some shutdown function). Nevertheless,this example should make usage quite clear:

void *percpu_ptr;

unsigned long *foo;

percpu_ptr = alloc_percpu(unsigned long);

if (!ptr)

        /* errorallocating memory .. */

foo = get_cpu_ptr(percpu_ptr);

/* manipulate foo .. */

put_cpu_ptr(percpu_ptr);

Finally, the function per_cpu_ptr() returns a givenprocessor's unique data:

per_cpu_ptr(ptr, cpu);

Again, it does not disable kernel preemptionand if youtouch another processor's datakeep in mind that you probably need to implementlocking.

<think>我们正在讨论per-CPU变量的实现机制。根据引用[4]中的示例代码,我们可以看到per-CPU变量的基本使用方法:使用`alloc_percpu`分配,使用`per_cpu_ptr`获取特定CPU的指针,并使用`get_cpu`和`put_cpu`来防止抢占。现在,我们需要深入其实现原理。 ###一、per-CPU变量的设计目标per-CPU变量主要用于解决多核环境下对共享数据的访问竞争问题。每个CPU都有自己独立的变量副本,这样在访问时就不需要加锁,提高了性能[^2]。 ###二、实现原理####1.数据存储布局在编译时,per-CPU变量被放在一个特殊的段(section)中。在Linux内核中,这个段通常命名为`.data..percpu`(在x86架构中)[^1]。当系统启动时,为每个CPU复制这个段的内容,使得每个CPU都有自己独立的数据副本。 ####2.动态分配与静态声明-**静态声明**:使用`DEFINE_PER_CPU(type,name)`宏,例如:```cDEFINE_PER_CPU(int, my_percpu_var);```这会在`.data..percpu`段中定义一个名为`my_percpu_var`的变量,每个CPU都有一个独立的副本。- **动态分配**:使用`alloc_percpu`函数,如引用[4]中的示例:```cptr= alloc_percpu(int);```####3.访问机制-**获取当前CPU的变量**:使用`get_cpu_var`和`put_cpu_var`宏。这两个宏会禁用内核抢占,确保在当前CPU上操作的原子性。```cint*val= get_cpu_var(my_percpu_var);*val =10;put_cpu_var(my_percpu_var);``` -**访问指定CPU的变量**:使用`per_cpu_ptr`宏,如引用[4]所示:```cthe_ptr= per_cpu_ptr(ptr,cpu);```####4.初始化过程在系统启动时,会调用`setup_per_cpu_areas`函数(在x86架构中位于`arch/x86/kernel/setup_percpu.c`)。该函数会为每个CPU分配内存,并将初始的per-CPU数据(即`.data..percpu`段的内容)复制到每个CPU的私有区域中。引用[3]中的代码片段展示了这一过程: ```cfor (i =0; i< NR_CPUS;i++,ptr +=size){__per_cpu_offset[i] =ptr -__per_cpu_start;memcpy(ptr,__per_cpu_start,__per_cpu_end -__per_cpu_start);} ```这里,`__per_cpu_offset`数组存储了每个CPUper-CPU区域相对于原始数据的偏移量。通过这个偏移量,我们可以计算出每个CPU上变量的实际地址。###三、关键数据结构####1. `__per_cpu_offset`数组这是一个全局数组,存储了每个CPUper-CPU区域的基地址相对于原始`.data..percpu`段起始地址的偏移量。通过这个偏移量,我们可以得到某个CPU上某个per-CPU变量的地址: ```c#define per_cpu_ptr(ptr,cpu)({ \(typeof(ptr)) ((char*)(ptr)+ __per_cpu_offset[cpu]); \}) ```####2.per-CPU变量的内存布局```mermaidgraphLRA[原始.data..percpu段] -->|包含初始数据| B(CPU0的per-CPU区域)A-->|复制| C(CPU1的per-CPU区域)A -->|复制| D(CPU2的per-CPU区域)A-->|...|E(其他CPU) ```###四、使用场景与性能优势1.**计数器**:如网络子系统中的统计计数器,每个CPU独立更新,减少缓存行竞争。2.**缓存**:如slab分配器的每CPU缓存,提高内存分配效率。 3. **避免锁竞争**:当数据主要被一个CPU访问时,使用per-CPU变量可以避免锁的开销。 ###五、注意事项1. **抢占控制**:在访问per-CPU变量时,必须使用`get_cpu`和`put_cpu`来防止进程被抢占到其他CPU上,否则可能导致数据不一致。2.**跨CPU访问**:如果需要访问其他CPUper-CPU变量,必须确保该CPU已经停止运行(例如在热插拔时),否则需要额外的同步机制。 3. **内存占用**:per-CPU变量会为每个CPU都分配一份副本,因此对于大型数据结构,需要权衡内存开销。 ###六、示例代码解析(基于引用[4]) ```c#include <linux/module.h> #include<linux/kernel.h>#include <linux/init.h> #include<asm/io.h>#include <asm/uaccess.h> #include<linux/percpu.h> staticint *ptr;static int__inithello_init(void){int cpu;int *the_ptr;//动态分配一个int类型的per-CPU变量ptr= alloc_percpu(int);//遍历所有可能的CPUfor_each_possible_cpu(cpu){//禁止抢占,并返回当前CPU的ID(同时确保不会被迁移到其他CPU)get_cpu();//获取指定CPUper-CPU变量指针the_ptr =per_cpu_ptr(ptr, cpu);//给该CPU的副本赋值*the_ptr= cpu;printk("val= %d\n", *the_ptr);//允许抢占put_cpu();}return0;} staticvoid __exit hello_exit(void){//释放per-CPU变量free_percpu(ptr); }module_init(hello_init);module_exit(hello_exit);``` 在这个示例中:-`alloc_percpu`为每个CPU分配了一个`int`大小的内存。 -`for_each_possible_cpu`遍历所有可能的CPU-在循环中,通过`get_cpu`禁止抢占,然后使用`per_cpu_ptr`获取当前CPU的变量指针,并赋值。-最后使用`free_percpu`释放内存。###七、实现细节####1.静态per-CPU变量的实现以`DEFINE_PER_CPU(int, my_var)`为例,其展开后大致如下:```c__attribute__((section(".data..percpu"))) int__percpu__my_var; ```这样,`__percpu__my_var`就被放置在`.data..percpu`段中。 ####2.动态分配的实现`alloc_percpu`函数最终会调用`__alloc_percpu`,该函数会为每个CPU分配指定大小的内存,并返回一个指针。这个指针实际上指向原始数据(即第一个CPU的副本),但通过`per_cpu_ptr`访问时,会加上对应CPU的偏移量。 ###八、跨架构支持不同架构的per-CPU实现可能有所不同,主要体现在:1.**偏移量数组**:`__per_cpu_offset`数组的定义和初始化位置不同。2.**段名称**:如x86使用`.data..percpu`,而ARM可能使用不同的段名。3.**访问指令**:某些架构可能提供专门的指令来访问per-CPU变量(如x86的`%gs`段寄存器)。 ###九、性能优化技巧1. **对齐缓存行**:为了避免不同CPUper-CPU变量共享同一缓存行(导致伪共享),可以使用`____cacheline_aligned_in_smp`属性进行对齐。```cDEFINE_PER_CPU_ALIGNED(int,my_var);``` 2. **使用局部变量**:在长时间操作per-CPU变量时,可以先将其复制到局部变量,操作完再写回,减少禁用抢占的时间。 ###十、相关函数总结|函数/宏|作用| |---------|------|| `DEFINE_PER_CPU(type, name)` |静态定义一个per-CPU变量| |`alloc_percpu(type)`|动态分配一个per-CPU变量 || `free_percpu(ptr)`|释放动态分配的per-CPU变量 || `get_cpu_var(var)`|获取当前CPU的变量指针(并禁用抢占) || `put_cpu_var(var)`|释放变量(并启用抢占) || `per_cpu_ptr(ptr,cpu)`|获取指定CPU的变量指针| ###相关问题1. per-CPU变量在SMP和NUMA架构中有什么不同的表现?2.如何避免per-CPU变量的伪共享(false sharing)问题? 3. per-CPU变量在中断上下文中如何使用?4.动态分配的per-CPU变量在内存管理上有何特点? 5.在多核系统中,per-CPU变量如何与RCU机制配合使用?[^2]</think>### Linux内核per-CPU变量的实现机制 per-CPU变量是Linux内核中用于**高效处理多核共享数据**的关键技术,其核心思想是为每个CPU创建变量的独立副本,从而**避免多核访问时的锁竞争**。以下是其实现原理和方式: --- #### 一、设计目标与核心优势 1. **零锁并发** - 每个CPU独立访问自己的变量副本,无需锁机制[^2] - 解决多核缓存一致性导致的性能瓶颈 2. **缓存局部性优化** - 变量副本位于CPU本地缓存,减少缓存行冲突 3. **动态扩展性** - 支持CPU热插拔,动态调整副本数量 --- #### 二、实现原理图解 ```mermaid graph TD A[原始数据段 .data..percpu] --> B(CPU0副本) A --> C(CPU1副本) A --> D(CPU2副本) B --> E[通过__per_cpu_offset[0]定位] C --> F[通过__per_cpu_offset[1]定位] D --> G[通过__per_cpu_offset[2]定位] ``` --- #### 三、关键技术实现 ##### 1. 内存布局管理 - **专用数据段** 所有per-CPU变量编译时放入`.data..percpu`段(x86架构)[^1] ```c #define PER_CPU_BASE_SECTION ".data..percpu" ``` - **启动时复制** 系统初始化时为每个CPU创建独立副本: ```c for (i = 0; i < NR_CPUS; i++, ptr += size) { __per_cpu_offset[i] = ptr - __per_cpu_start; memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start); } // 引用[3] ``` ##### 2. 访问机制 - **偏移量数组** 全局数组`__per_cpu_offset[]`存储每个CPU副本的基地址偏移 ```c extern unsigned long __per_cpu_offset[NR_CPUS]; ``` - **地址计算宏** 获取指定CPU的变量地址: ```c #define per_cpu_ptr(ptr, cpu) \ ((typeof(ptr))((char*)(ptr) + __per_cpu_offset[cpu])) ``` ##### 3. 安全访问控制 - **抢占保护** 使用`get_cpu()`/`put_cpu()`确保操作原子性: ```c get_cpu(); // 禁用抢占 *per_cpu_ptr(var, smp_processor_id()) = value; put_cpu(); // 启用抢占 ``` --- #### 四、使用方式示例 ##### 1. 动态分配 ```c // 分配per-CPU整型数组 int __percpu *arr = alloc_percpu(int); // 获取CPU1的副本指针 int *cpu1_ptr = per_cpu_ptr(arr, 1); *cpu1_ptr = 100; // 安全写入 ``` ##### 2. 静态声明 ```c // 声明每CPU变量 DEFINE_PER_CPU(int, counter); // 安全访问当前CPU副本 get_cpu_var(counter)++; put_cpu_var(counter); ``` ##### 3. 遍历所有CPU ```c for_each_possible_cpu(cpu) { int *ptr = per_cpu_ptr(arr, cpu); printk("CPU%d: %d\n", cpu, *ptr); } ``` --- #### 五、性能优化技术 1. **缓存对齐** 使用`____cacheline_aligned`避免伪共享 ```c DEFINE_PER_CPU_ALIGNED(int, cacheline_var); ``` 2. **热插拔支持** 通过`cpuhp_setup_state()`注册回调函数 ```c cpuhp_setup_state(CPUHP_AP_ONLINE, "mymodule:online", my_online_cb, my_offline_cb); ``` --- #### 六、典型应用场景 1. **网络收发包统计** ```c struct net_device_stats __percpu *dev_stats; ``` 2. **内存分配器缓存** SLAB/SLUB分配器的每CPU缓存 3. **任务调度统计** 内核调度器的运行队列统计 --- ### 相关问题 1. per-CPU变量如何保证跨CPU访问的安全性? 2. 在NUMA架构中per-CPU变量实现有何不同? 3. 如何动态调整per-CPU变量的大小? 4. per-CPU变量与RCU机制如何协同工作?[^2] 5. 用户态程序能否直接访问内核per-CPU变量?
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值