Memory barrier An illustrative example

本文探讨了内存屏障在多处理器系统中的作用及其如何确保程序执行的一致性。通过实例展示了不同处理器之间的内存可见性问题,并介绍了多种内存屏障指令,如全屏障、获取屏障和释放屏障等。此外还讨论了内存屏障在多线程编程中的应用,以及编译器优化对内存访问顺序的影响。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >


When a program runs on a single-CPU machine, the hardware performs the necessary bookkeeping to ensure that the program executes as if all memory operations were performed in the order specified by the programmer (program order), so memory barriers are not necessary. However, when the memory is shared with multiple devices, such as other CPUs in a multiprocessor system, or memory mapped peripherals, out-of-order access may affect program behavior. For example, a second CPU may see memory changes made by the first CPU in a sequence which differs from program order.

The following two-processor program gives an example of how such out-of-order execution can affect program behavior:

Initially, memory locations x and f both hold the value 0. The program running on processor #1 loops while the value of f is zero, then it prints the value of x. The program running on processor #2 stores the value 42 into x and then stores the value 1 into f. Pseudo-code for the two program fragments is shown below. The steps of the program correspond to individual processor instructions.

Processor #1:

 while (f == 0);
 // Memory fence required here
 print x;

Processor #2:

 x = 42;
 // Memory fence required here
 f = 1;

One might expect the print statement to always print the number "42"; however, if processor #2's store operations are executed out-of-order, it is possible for f to be updated before x, and the print statement might therefore print "0". Similarly, processor #1's load operations may be executed out-of-order and it is possible for x to be read before f is checked, and again the print statement might therefore print an unexpected value. For most programs neither of these situations are acceptable. A memory barrier can be inserted before processor #2's assignment to f to ensure that the new value of x is visible to other processors at or prior to the change in the value of f. Another can be inserted before processor #1's access to x to ensure the value of x is not read prior to seeing the change in the value of f.

For another illustrative example (a non-trivial one that arises in actual practice), see double-checked locking.


Memory barrier Low-level architecture-specific primitives


Memory barriers are low-level primitives and part of an architecture's memory model, which, like instruction sets, vary considerably between architectures, so it is not appropriate to generalize about memory barrier behavior. The conventional wisdom is that using memory barriers correctly requires careful study of the architecture manuals for the hardware being programmed. That said, the following paragraph offers a glimpse of some memory barriers which exist in contemporary products.

Some architectures, including the ubiquitous x86/x64, provide several memory barrier instructions including an instruction sometimes called "full fence". A full fence ensures that all load and store operations prior to the fence will have been committed prior to any loads and stores issued following the fence. Other architectures, such as the Itanium, provide separate "acquire" and "release" memory barriers which address the visibility of read-after-write operations from the point of view of a reader (sink) or writer (source) respectively. Some architectures provide separate memory barriers to control ordering between different combinations of system memory and I/O memory. When more than one memory barrier instruction is available it is important to consider that the cost of different instructions may vary considerably.


Memory barrier Multithreaded programming and memory visibility


Multithreaded programs usually use synchronization primitives provided by a high-level programming environment, such as Java and .NET Framework, or an application programming interface (API) such as POSIX Threads or Windows API. Primitives such as mutexes and semaphores are provided to synchronize access to resources from parallel threads of execution. These primitives are usually implemented with the memory barriers required to provide the expected memory visibility semantics. In such environments explicit use of memory barriers is not generally necessary.

Each API or programming environment in principle has its own high-level memory model that defines its memory visibility semantics. Although programmers do not usually need to use memory barriers in such high level environments, it is important to understand their memory visibility semantics, to the extent possible. Such understanding is not necessarily easy to achieve because memory visibility semantics are not always consistently specified or documented.

Just as programming language semantics are defined at a different level of abstraction than machine language opcodes, a programming environment's memory model is defined at a different level of abstraction than that of a hardware memory model. It is important to understand this distinction and realize that there is not always a simple mapping between low-level hardware memory barrier semantics and the high-level memory visibility semantics of a particular programming environment. As a result, a particular platform's implementation of (say) POSIX Threads may employ stronger barriers than required by the specification. Programs which take advantage of memory visibility as implemented rather than as specified may not be portable.


Memory barrier Out-of-order execution versus compiler reordering optimizations


Memory barrier instructions address reordering affects only at the hardware level. Compilers may also reorder instructions as part of the program optimization process. Although the effects on parallel program behavior can be similar in both cases, in general it is necessary to take separate measures to inhibit compiler reordering optimizations for data that may be shared by multiple threads of execution. Note that such measures are usually necessary only for data which is not protected by synchronization primitives such as those discussed in the prior section.

In C and C++, the volatile keyword was intended to allow C and C++ programs to directly access memory-mapped I/O. Memory-mapped I/O generally requires that the reads and writes specified in source code happen in the exact order specified with no omissions. Omissions or reorderings of reads and writes by the compiler would break the communication between the program and the device accessed by memory-mapped I/O. A C or C++ compiler may not reorder reads from and writes to volatile memory locations, nor may it omit a read from or write to a volatile memory location. The keyword volatile does not guarantee a memory barrier to enforce cache-consistency. Therefore the use of "volatile" alone is not sufficient to use a variable for inter-thread communication on all systems and processors.[1]

The C and C++ standards prior to C11 and C++11 do not address multiple threads (or multiple processors),[2] and as such, the usefulness of volatile depends on the compiler and hardware. Although volatile guarantees that the volatile reads and volatile writes will happen in the exact order specified in the source code, the compiler may generate code (or the CPU may re-order execution) such that a volatile read or write is reordered with regard to non-volatile reads or writes, thus limiting its usefulness as an inter-thread flag or mutex. Preventing such is compiler specific, but some compilers, like gcc, will not reorder operations around in-line assembly code with volatile and "memory" tags, like in: asm volatile ("" : : : "memory"); (See more examples in compiler memory barrier). Moreover, it is not guaranteed that volatile reads and writes will be seen in the same order by other processors or cores due to caching, cache coherence protocol and relaxed memory ordering, meaning volatile variables alone may not even work as inter-thread flags or mutexes.

Some languages and compilers may provide sufficient facilities to implement functions which address both the compiler reordering and machine reordering issues. In Java version 1.5 (also known as version 5), the volatile keyword is now guaranteed to prevent certain hardware and compiler re-orderings, as part of the new Java Memory Model. C++11 standardizes special atomic types and operations with semantics similar to those of volatile in the Java Memory Model.


Memory barrier See also



from:

http://en.potiori.com/Memory_barrier.html

<think>嗯,用户这次问的是内存屏障的作用,看来是在深入理解计算机底层机制。之前连续讨论了AUTOSAR和内存泄漏,现在转向系统级编程的并发控制,说明用户可能是嵌入式开发者或系统程序员,正在排查多线程环境下的同步问题。 内存屏障这个概念确实容易让人困惑,需要从硬件和编译器两个层面解释清楚。用户可能实际遇到过因内存乱序导致的诡异bug,比如数据明明更新了却读到旧值。我得用汽车装配线的例子类比CPU流水线,这样更直观。 硬件优化带来的指令重排是问题的核心。现代CPU为了效率会乱序执行,比如Store Buffer的存在可能让写操作延迟,导致其他核看到操作的顺序和代码顺序不一致。内存屏障就是强行给这些“调皮”的优化踩刹车,告诉CPU“到此为止之前的操作必须完成”。 编译器优化也是隐患。比如用volatile只能阻止编译器重排,但管不了CPU层面的乱序。这点在无锁编程中特别致命,用户可能正为此头疼。需要强调屏障是软硬双管齐下的解决方案。 用户应该还想知道具体使用场景,所以示例代码里要展示经典的双检锁模式。这个模式如果漏掉屏障,单例可能被初始化两次——这种bug极难复现但危害巨大。另外内核态的barrier()宏和用户态的__sync_synchronize()区别也得提一下。 最后的提问设计要层层递进:从原理(为什么需要)到实践(怎么用),再到特定领域(Linux内核)和替代方案。第五个问题特意问编译器屏障,因为很多人会混淆它与硬件屏障的作用范围。</think>内存屏障(Memory Barrier)的主要作用是**强制保证内存操作的顺序性和可见性**,解决多核/多线程环境下因**指令重排序**和**缓存一致性**引发的数据一致性问题。其核心作用如下: --- ### 核心作用 1. **禁止指令重排序** - 编译器和CPU可能为优化性能而乱序执行指令(如写操作延迟、读操作提前)。 - 内存屏障会阻止屏障前后的指令跨越屏障执行,确保程序逻辑顺序。 2. **保证内存可见性** - 强制将缓存数据刷新到主存(写屏障),或从主存加载最新数据(读屏障)。 - 确保一个线程的修改对其他线程立即可见。 3. **同步多核数据一致性** - 在弱一致性内存模型中(如ARM、PowerPC),屏障触发缓存失效/更新,维持多核缓存一致。 --- ### 类型与场景 | 屏障类型 | 作用 | 典型场景 | |-------------------|----------------------------------------------------------------------|--------------------------| | **写屏障(Store Barrier)** | 确保屏障前的所有写操作完成,数据刷入主存后才执行后续操作 | 修改共享数据后释放锁 | | **读屏障(Load Barrier)** | 确保后续读操作前,先加载主存最新数据(清空本地缓存) | 获取锁后读取共享数据 | | **全屏障(Full Barrier)** | 同时具备读写屏障功能(如 `mfence`),确保屏障前后指令无重排且全局可见 | 无锁数据结构、内核同步原语 | --- ### 代码示例(无锁编程中的屏障) ```c // 共享数据 int data = 0; bool ready = false; // 线程A:写入数据 void thread_A() { data = 42; // 写操作 // 写屏障:确保data写入完成后再更新ready __sync_synchronize(); // GCC内置全屏障 ready = true; // 标志位更新 } // 线程B:读取数据 void thread_B() { while (!ready); // 等待ready为true // 读屏障:确保读取data前获取最新值 __sync_synchronize(); printf("%d", data); // 必须输出42而非0或随机值 } ``` > **关键点**:若省略屏障,线程B可能因重排序先看到 `ready=true` 后看到 `data=0`(脏读)。 --- ### 典型应用场景 - **多线程同步**:锁实现(如自旋锁)、条件变量。 - **无锁数据结构**:CAS(Compare-And-Swap)操作前后插入屏障。 - **设备驱动**:确保硬件寄存器按顺序写入。 - **内核同步原语**:如Linux的 `smp_mb()`、`rmb()`、`wmb()`。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值