Memory barrier

本文介绍了内存屏障的基本概念及其在多处理器系统中的作用。内存屏障是一种指令,用于确保多线程环境中内存操作的顺序执行,防止乱序执行导致的数据不一致问题。文章详细解释了在多处理器环境下内存屏障如何避免不可预测的行为,并提供了具体的示例。
 

Memory barrier

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Memory barrier, also known as membar or memory fence or fence instruction, is a type of barrier and a class of instruction which causes a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction.

CPUs employ performance optimizations that can result in out-of-order execution. The reordering of memory operations (loads and stores) normally goes unnoticed within a single thread of execution, but causes unpredictable behaviour in concurrent programs and device drivers unless carefully controlled. The exact nature of an ordering constraint is hardware dependent, and defined by the architecture's memory ordering model. Some architectures provide multiple barriers for enforcing different ordering constraints.

Memory barriers are typically used when implementing low-level machine code that operates on memory shared by multiple devices. Such code includes synchronization primitives and lock-free data structures on multiprocessor systems, and device drivers that communicate with computer hardware.

[edit] An illustrative example

When a program runs on a single CPU, the hardware performs the necessary bookkeeping to ensure that programs execute as if all memory operations were performed in the order specified by the programmer (program order), hence memory barriers are not necessary. However, when the memory is shared with multiple devices, such as other CPUs in a multiprocessor system, or memory mapped peripherals, out-of-order access may affect program behavior. For example, a second CPU may see memory changes made by the first CPU in a sequence which differs from program order.

The following two processor programs give a concrete example of how such out-of-order execution can affect program behavior:

Initially, memory locations x and f both hold the value 0. The program running on processor #1 loops while the value of f is zero, then it prints the value of x. The program running on processor #2 stores the value 42 into x and then stores the value 1 into f. Pseudo-code for the two program fragments is shown below. The steps of the program correspond to individual processor instructions.

Processor #1:
 loop:
  load the value in location f, if it is 0 goto loop
 print the value in location x

Processor #2:
 store the value 42 into location x
 store the value 1 into location f




 

You might expect the print statement to always print the number "42"; however, if processor #2's store operations are executed out-of-order, it is possible for f to be updated before x, and the print statement might therefore print "0". For most programs this situation is not acceptable. A memory barrier can be inserted before processor #2's assignment to f to ensure that the new value of x is visible to other processors at or prior to the change in the value of f.

For another illustrative example (a non-trivial one that arises in actual practice), see double-checked locking.

[edit] Low-level architecture-specific primitives

Memory barriers are low-level primitives which are part of the definition of an architecture's memory model. Like instruction sets, memory models vary considerably between architectures, so it is not appropriate to generalize about memory barrier behavior. The conventional wisdom is that using memory barriers correctly requires careful study of the architecture manuals for the hardware being programmed. That said, the following paragraph offers a glimpse of some memory barriers which exist in contemporary products.

Some architectures, including the ubiquitous x86/x64, provide several memory barrier instructions including an instruction sometimes called "full fence". A full fence ensures that all load and store operations prior to the fence will have been committed prior to any loads and stores issued following the fence. Other architectures, such as the Itanium, provide separate "acquire" and "release" memory barriers which address the visibility of read-after-write operations from the point of view of a reader (sink) or writer (source) respectively. Some architectures provide separate memory barriers to control ordering between different combinations of system memory and I/O memory. When more than one memory barrier instruction is available it is important to consider that the cost of different instructions may vary considerably.

[edit] Multithreaded programming and memory visibility

Multithreaded programs usually use synchronization primitives provided by a high-level programming environment, such as Java and .NET Framework, or an application programming interface (API) such as POSIX Threads or Windows API. Primitives such as mutexes and semaphores are provided to synchronize access to resources from parallel threads of execution. These primitives are usually implemented with the memory barriers required to provide the expected memory visibility semantics. In such environments explicit use of memory barriers is not generally necessary.

Each API or programming environment in principle has its own high-level memory model that defines its memory visibility semantics. Although programmers do not usually need to use memory barriers in such high level environments, it is important to understand their memory visibility semantics, to the extent possible. Such understanding is not necessarily easy to achieve because memory visibility semantics are not always consistently specified or documented.

Just as programming language semantics are defined at a different level of abstraction than machine language opcodes, a programming environment's memory model is defined at a different level of abstraction than that of a hardware memory model. It is important to understand this distinction and realize that there is not always a simple mapping between low-level hardware memory barrier semantics and the high-level memory visibility semantics of a particular programming environment. As a result, a particular platform's implementation of (say) POSIX Threads may employ stronger barriers than required by the specification. Programs which take advantage of memory visibility as-implemented rather than as-specified may not be portable.

[edit] Out-of-order execution versus compiler reordering optimizations

Memory barrier instructions only address reordering effects at the hardware level. Compilers may also reorder instructions as part of the program optimization process. Although the effects on parallel program behavior can be similar in both cases, in general it is necessary to take separate measures to inhibit compiler reordering optimizations for data that may be shared by multiple threads of execution. Note that such measures are usually only necessary for data which is not protected by synchronization primitives such as those discussed in the prior section.

In C and C++, the volatile keyword was intended to allow C and C++ programs to directly access memory-mapped I/O. Memory-mapped I/O generally requires that the reads and writes specified in source code happen in the exact order specified with no omissions. Omissions or reorderings of reads and writes by the compiler would break the communication between the program and the device accessed by memory-mapped I/O. A C or C++ compiler may not reorder reads and writes to volatile memory locations, nor may it omit a read or write to a volatile memory location, allowing a pointer to volatile memory to be used for memory-mapped I/O.

The C and C++ standards do not address multiple threads (or multiple processors), and as such, the usefulness of volatile depends on the compiler and hardware. Although volatile guarantees that the volatile reads and volatile writes will happen in the exact order specified in the source code, the compiler may generate code (or the CPU may re-order execution) such that a volatile read or write is reordered with regard to non-volatile reads or writes, thus limiting its usefulness as an inter-thread flag or mutex. Preventing such is compiler specific, but some compilers, like gcc, will not reorder operations around in-line assembly code with volatile and "memory" tags, like in: asm volatile ("" : : : "memory"); (See more examples in compiler memory barrier). Moreover, you are not guaranteed that volatile reads and writes will be seen in the same order by other processors due to caching, cache coherence protocol and relaxed memory ordering, meaning volatile variables alone may not even work as inter-thread flags or mutexes.

Some languages and compilers may provide sufficient facilities to implement functions which address both the compiler reordering and machine reordering issues. In Java version 1.5 (also known as version 5), the volatile keyword is now guaranteed to prevent certain hardware and compiler re-orderings, as part of the new Java Memory Model. The proposed C++ memory model does not use volatile, instead C++0x will include special atomic types and operations with semantics similar to those of volatile in the Java Memory Model.

[edit] See also

[edit] External links

the original link:http://en.wikipedia.org/wiki/Memory_barrier

<think>嗯,用户这次问的是内存屏障的作用,看来是在深入理解计算机底层机制。之前连续讨论了AUTOSAR和内存泄漏,现在转向系统级编程的并发控制,说明用户可能是嵌入式开发者或系统程序员,正在排查多线程环境下的同步问题。 内存屏障这个概念确实容易让人困惑,需要从硬件和编译器两个层面解释清楚。用户可能实际遇到过因内存乱序导致的诡异bug,比如数据明明更新了却读到旧值。我得用汽车装配线的例子类比CPU流水线,这样更直观。 硬件优化带来的指令重排是问题的核心。现代CPU为了效率会乱序执行,比如Store Buffer的存在可能让写操作延迟,导致其他核看到操作的顺序和代码顺序不一致。内存屏障就是强行给这些“调皮”的优化踩刹车,告诉CPU“到此为止之前的操作必须完成”。 编译器优化也是隐患。比如用volatile只能阻止编译器重排,但管不了CPU层面的乱序。这点在无锁编程中特别致命,用户可能正为此头疼。需要强调屏障是软硬双管齐下的解决方案。 用户应该还想知道具体使用场景,所以示例代码里要展示经典的双检锁模式。这个模式如果漏掉屏障,单例可能被初始化两次——这种bug极难复现但危害巨大。另外内核态的barrier()宏和用户态的__sync_synchronize()区别也得提一下。 最后的提问设计要层层递进:从原理(为什么需要)到实践(怎么用),再到特定领域(Linux内核)和替代方案。第五个问题特意问编译器屏障,因为很多人会混淆它与硬件屏障的作用范围。</think>内存屏障(Memory Barrier)的主要作用是**强制保证内存操作的顺序性和可见性**,解决多核/多线程环境下因**指令重排序**和**缓存一致性**引发的数据一致性问题。其核心作用如下: --- ### 核心作用 1. **禁止指令重排序** - 编译器和CPU可能为优化性能而乱序执行指令(如写操作延迟、读操作提前)。 - 内存屏障会阻止屏障前后的指令跨越屏障执行,确保程序逻辑顺序。 2. **保证内存可见性** - 强制将缓存数据刷新到主存(写屏障),或从主存加载最新数据(读屏障)。 - 确保一个线程的修改对其他线程立即可见。 3. **同步多核数据一致性** - 在弱一致性内存模型中(如ARM、PowerPC),屏障触发缓存失效/更新,维持多核缓存一致。 --- ### 类型与场景 | 屏障类型 | 作用 | 典型场景 | |-------------------|----------------------------------------------------------------------|--------------------------| | **写屏障(Store Barrier)** | 确保屏障前的所有写操作完成,数据刷入主存后才执行后续操作 | 修改共享数据后释放锁 | | **读屏障(Load Barrier)** | 确保后续读操作前,先加载主存最新数据(清空本地缓存) | 获取锁后读取共享数据 | | **全屏障(Full Barrier)** | 同时具备读写屏障功能(如 `mfence`),确保屏障前后指令无重排且全局可见 | 无锁数据结构、内核同步原语 | --- ### 代码示例(无锁编程中的屏障) ```c // 共享数据 int data = 0; bool ready = false; // 线程A:写入数据 void thread_A() { data = 42; // 写操作 // 写屏障:确保data写入完成后再更新ready __sync_synchronize(); // GCC内置全屏障 ready = true; // 标志位更新 } // 线程B:读取数据 void thread_B() { while (!ready); // 等待ready为true // 读屏障:确保读取data前获取最新值 __sync_synchronize(); printf("%d", data); // 必须输出42而非0或随机值 } ``` > **关键点**:若省略屏障,线程B可能因重排序先看到 `ready=true` 后看到 `data=0`(脏读)。 --- ### 典型应用场景 - **多线程同步**:锁实现(如自旋锁)、条件变量。 - **无锁数据结构**:CAS(Compare-And-Swap)操作前后插入屏障。 - **设备驱动**:确保硬件寄存器按顺序写入。 - **内核同步原语**:如Linux的 `smp_mb()`、`rmb()`、`wmb()`。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值