Intel lock前缀指令的屏障能力

本文详细介绍了Intel CPU架构中的内存排序模型,包括不同处理器家族的内存排序特性,如P6及之后的处理器家族如何使用带有存储缓冲区转发的写排序模型。文章还深入探讨了lock前缀指令的作用及其在保持内存一致性和可见性方面的重要性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Intel lock前缀指令除了单操作原子性的能力之外,还具备可见性和有序性。

对于Intel lock前缀指令的单操作原子性和可见性,参见下面两个链接,其实本质就是锁总线或锁缓存,加上缓存一致性协议。

Intel LOCK前缀指令https://blog.youkuaiyun.com/reliveIT/article/details/90038750hotspot x86平台的内存屏障的实现https://blog.youkuaiyun.com/reliveIT/article/details/121945327

要特别声明的是,Intel lock前缀指令的有序性是禁止硬件重排序,不会禁止编译器重排序。禁止编译器重排序是C++的volatile和破坏寄存器条件为memory的内联汇编指令。

我是想找时间写一篇《聊聊volatile》的文章,好好讲讲hotspot虚拟机在X86平台上是怎么实现JSR133 Java内存模型中Java volatile的内存语义的,就是怎么做到禁止编译器重排序和处理器重排序,包括禁止volatile修饰的变量和非volatile变量之间重排序。但知道怎么回事儿,和把知道的事儿清楚的写出来告诉别人,比较麻烦,也比较费时间,所以我就分治法,等找个时间汇总。

在Intel CPU手中,关于lock前缀指令禁止处理器重排序的部分集中在卷三中,本文主要是整理归纳出来,方便后续查阅。

11.10  STORE BUFFER

Intel 64 and IA-32 processors temporarily store each write (store) to memory in a store buffer. The store buffer improves processor performance by allowing the processor to continue executing instructions without having to wait until a write to memory and/or to a cache is complete. It also allows writes to be delayed for more efficient use of memory-access bus cycles.

In general, the existence of the store buffer is transparent to software, even in systems that use multiple processors. The processor ensures that write operations are always carried out in program order. It also insures that the contents of the store buffer are always drained to memory in the following situations:

  • When an exception or interrupt is generated.
  • (P6 and more recent processor families only) When a serializing instruction is executed.
  • When an I/O instruction is executed.
  • When a LOCK operation is performed.
  • (P6 and more recent processor families only) When a BINIT operation is performed.
  • (Pentium III, and more recent processor families only) When using an SFENCE instruction to order stores.
  • (Pentium 4 and more recent processor families only) When using an MFENCE instruction to order stores.

The discussion of write ordering in Section 8.2, “Memory Ordering,” gives a detailed description of the operation of the store buffer.

8.2  MEMORY ORDERING

The term memory ordering refers to the order in which the processor issues reads (loads) and writes (stores) through the system bus to system memory. The Intel 64 and IA-32 architectures support several memory-ordering models depending on the implementation of the architecture. For example, the Intel386 processor enforces program ordering (generally referred to as strong ordering), where reads and writes are issued on the system bus in the order they occur in the instruction stream under all circumstances. 

To allow performance optimization of instruction execution, the IA-32 architecture allows departures from strong-ordering model called processor ordering in Pentium 4, Intel Xeon, and P6 family processors. These processor-ordering variations (called here the memory-ordering model) allow performance enhancing operations such as allowing reads to go ahead of buffered writes. The goal of any of these variations is to increase instruction execution speeds, while maintaining memory coherency, even in multiple-processor system.

8.2.1  Memory Ordering in the Intel ® Pentium ®  and Intel486 ™  Processors

The Pentium and Intel486 processors follow the processor-ordered memory model; however, they operate as strongly-ordered processors under most circumstances. Reads and writes always appear in programmed order at the system bus—except for the following situation where processor ordering is exhibited. Read misses are permitted to go ahead of buffered writes on the system bus when all the buffered writes are cache hits and, therefore, are not directed to the same address being accessed by the read miss. 

In the case of I/O operations, both reads and writes always appear in programmed order.

Software intended to operate correctly in processor-ordered processors (such as the Pentium 4, Intel Xeon, and P6 family processors) should not depend on the relatively strong ordering of the Pentium or Intel486 processors. Instead, it should ensure that accesses to shared variables that are intended to control concurrent execution among processors are explicitly required to obey program ordering through the use of appropriate locking or serializing operations (see Section 8.2.5, “Strengthening or Weakening the Memory-Ordering Model”).

8.2.2  Memory Ordering in P6 and More Recent Processor Families

The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium 4, and P6 family processors also use a processor-ordered memory-ordering model that can be further defined as “write ordered with store-buffer forwarding.” This model can be characterized as follows. 

In a single-processor system for memory regions defined as write-back cacheable, the memory-ordering model respects the following principles (Note the memory-ordering principles for single-processor and multiple-processor systems are written from the perspective of software executing on the processor, where the term “processor” refers to a logical processor. For example, a physical processor supporting multiple cores and/or HyperThreading Technology is treated as a multi-processor systems.):

  • Reads are not reordered with other reads.
  • Writes are not reordered with older reads.
  • Writes to memory are not reordered with other writes, with the following exceptions:
    • writes executed with the CLFLUSH instruction;
    • streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); 
    • and string operations (see Section 8.2.4.1).
  • Reads may be reordered with older writes to different locations but not with older writes to the same location. 
  • Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.
  • Reads cannot pass earlier LFENCE and MFENCE instructions.
  • Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.
  • LFENCE instructions cannot pass earlier reads.
  • SFENCE instructions cannot pass earlier writes.
  • MFENCE instructions cannot pass earlier reads or writes.

In a multiple-processor system, the following ordering principles apply:

  • Individual processors use the same ordering principles as in a single-processor system.
  • Writes by a single processor are observed in the same order by all processors.
  • Writes from an individual processor are NOT ordered with respect to the writes from other processors.
  • Memory ordering obeys causality (memory ordering respects transitive visibility).
  • Any two stores are seen in a consistent order by processors other than those performing the stores
  • Locked instructions have a total order.

8.2.3.2   Neither Loads Nor Stores Are Reordered with Like Operations

The Intel-64 memory-ordering model allows neither loads nor stores to be reordered with the same kind of operation. That is, it ensures that loads are seen in program order and that stores are seen in program order. 

附注:对于没有相关性的两个共享变量,在X86平台上,读后读、写后写不允许重排序。

8.2.3.3   Stores Are Not Reordered With Earlier Loads

The Intel-64 memory-ordering model ensures that a store by a processor may not occur before a previous load by the same processor. 

附注:对于没有相关性的两个共享变量,在X86平台上,读后写不允许重排序。

8.2.3.4   Loads May Be Reordered with Earlier Stores to Different Locations

The Intel-64 memory-ordering model allows a load to be reordered with an earlier store to a different location. However, loads are not reordered with stores to the same location.

附注:对于没有相关性的两个共享变量,在X86平台上,写后读允许重排序。

8.2.3.6   Stores Are Transitively Visible

The memory-ordering model ensures transitive visibility of stores; stores that are causally related appear to all processors to occur in an order consistent with the causality relation. 

8.2.3.8   Locked Instructions Have a Total Order

The memory-ordering model ensures that all processors agree on a single execution order of all locked instructions, including those that are larger than 8 bytes or are not naturally aligned. 

8.2.3.9   Loads and Stores Are Not Reordered with Locked Instructions

The memory-ordering model prevents loads and stores from being reordered with locked instructions that execute earlier or later. The examples in this section illustrate only cases in which a locked instruction is executed before a load or a store. The reader should note that reordering is prevented also if the locked instruction is executed after a load or a store.

22.34  STORE BUFFERS AND MEMORY ORDERING

The Pentium 4, Intel Xeon, and P6 family processors provide a store buffer for temporary storage of writes (stores) to memory (see Section 11.10, “Store Buffer”). Writes stored in the store buffer(s) are always written to memory in program order, with the exception of “fast string” store operations (see Section 8.2.4, “Fast-String Operation and Out-of-Order Stores”).

The Pentium processor has two store buffers, one corresponding to each of the pipelines. Writes in these buffers are always written to memory in the order they were generated by the processor core.

It should be noted that only memory writes are buffered and I/O writes are not. The Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors do not synchronize the completion of memory writes on the bus and instruction execution after a write. An I/O, locked, or serializing instruction needs to be executed to synchronize writes with the next instruction (see Section 8.3, “Serializing Instructions”).

The Pentium 4, Intel Xeon, and P6 family processors use processor ordering to maintain consistency in the order that data is read (loaded) and written (stored) in a program and the order the processor actually carries out the reads and writes. With this type of ordering, reads can be carried out speculatively and in any order, reads can pass buffered writes, and writes to memory are always carried out in program order. (See Section 8.2, “Memory Ordering,” for more information about processor ordering.) The Pentium III processor introduced a new instruction to serialize writes and make them globally visible. Memory ordering issues can arise between a producer and a consumer of data. The SFENCE instruction provides a performance-efficient way of ensuring ordering between routines that produce weakly-ordered results and routines that consume this data.

No re-ordering of reads occurs on the Pentium processor, except under the condition noted in Section 8.2.1, “Memory Ordering in the Intel® Pentium® and Intel486™ Processors,” and in the following paragraph describing the Intel486 processor. 

Specifically, the store buffers are flushed before the IN instruction is executed. No reads (as a result of cache miss) are reordered around previously generated writes sitting in the store buffers. The implication of this is that the store buffers will be flushed or emptied before a subsequent bus cycle is run on the external bus.

On both the Intel486 and Pentium processors, under certain conditions, a memory read will go onto the external bus before the pending memory writes in the buffer even though the writes occurred earlier in the program execution. A memory read will only be reordered in front of all writes pending in the buffers if all writes pending in the buffers are cache hits and the read is a cache miss. Under these conditions, the Intel486 and Pentium processors will not read from an external memory location that needs to be updated by one of the pending writes. 

During a locked bus cycle, the Intel486 processor will always access external memory, it will never look for the location in the on-chip cache. All data pending in the Intel486 processor's store buffers will be written to memory before a locked cycle is allowed to proceed to the external bus. Thus, the locked bus cycle can be used for eliminating the possibility of reordering read cycles on the Intel486 processor. The Pentium processor does check its cache on a read-modify-write access and, if the cache line has been modified, writes the contents back to memory before locking the bus. The P6 family processors write to their cache on a read-modify-write operation (if the access does not split across a cache line) and does not write back to system memory. If the access does split across a cache line, it locks the bus and accesses system memory.

I/O reads are never reordered in front of buffered memory writes on an IA-32 processor. This ensures an update of all memory locations before reading the status from an I/O device.

22.35  BUS LOCKING

The Intel 286 processor performs the bus locking differently than the Intel P6 family, Pentium, Intel486, and Intel386 processors. Programs that use forms of memory locking specific to the Intel 286 processor may not run properly when run on later processors.

A locked instruction is guaranteed to lock only the area of memory defined by the destination operand, but may lock a larger memory area. For example, typical 8086 and Intel 286 configurations lock the entire physical memory space. Programmers should not depend on this.

On the Intel 286 processor, the LOCK prefix is sensitive to IOPL. If the CPL is greater than the IOPL, a general-protection exception (#GP) is generated. On the Intel386 DX, Intel486, and Pentium, and P6 family processors, no check against IOPL is performed.

The Pentium processor automatically asserts the LOCK# signal when acknowledging external interrupts. After signaling an interrupt request, an external interrupt controller may use the data bus to send the interrupt vector to the processor. After receiving the interrupt request signal, the processor asserts LOCK# to insure that no other data appears on the data bus until the interrupt vector is received. This bus locking does not occur on the P6 family processors.

<think>嗯,用户问的是“lock cmpxhg”,看起来像是拼写错误或者打字错误。首先,我需要确认正确的术语。可能用户指的是“lock cmpxchg”,也就是“lock cmpxchg”指令。这是x86汇编语言中的一个指令,结合了“lock前缀和“cmpxchg”指令。 接下来,我得回忆一下cmpxchg的作用。cmpxchg是Compare and Exchange的缩写,用于原子地比较并交换寄存器的值。通常用于多线程环境下的同步操作,确保操作的原子性。而lock前缀是用来确保在执行指令期间,总线被锁定,防止其他处理器访问内存,从而保证操作的原子性。 然后,需要解释这两个部分如何协同工作。lock前缀加上cmpxchg,确保在多核系统中,这个比较和交换操作是不可中断的,避免竞态条件。这在实现锁、信号量或其他同步机制时非常重要。 可能用户对底层并发编程感兴趣,或者是学习操作系统或系统编程的学生。他们可能想了解如何实现原子操作或者同步机制。需要进一步解释应用场景,比如自旋锁的实现,或者无锁数据结构的构建。 还要注意用户可能对指令的具体用法和语法有疑问,比如操作数的顺序,支持的寄存器类型,以及在不同x86架构下的行为差异。比如cmpxchg指令的操作数顺序是目标操作数和源操作数,比较的是AL/AX/EAX寄存器的值和目标操作数的值,如果相等,则把源操作数的值存入目标操作数,否则将目标操作数的值加载到累加器。 另外,需要提到lock前缀的使用限制,比如只能用于特定的指令,且在某些情况下可能不需要,因为现代处理器已经有缓存一致性协议(如MESI),但为了确保跨平台的正确性,显式使用lock前缀仍然是必要的。 可能用户遇到的困惑还包括为什么需要原子操作,以及如果不使用lock前缀会发生什么情况。这时候需要举一个例子,比如多线程环境下的计数器递增,如果不使用原子操作,可能会导致数据竞争和不确定的结果。 最后,总结一下lock cmpxchg的作用和重要性,强调其在并发编程中的关键地位,并可能推荐进一步的学习资源,如Intel的官方文档或相关书籍。</think>你提到的“lock cmpxhg”可能是对x86汇编指令lock cmpxchg”的笔误。以下是关于这条指令的详细说明: ### 1. 指令组成 - **`lock`前缀**:用于确保后续指令的原子性(atomicity),防止多线程环境下的数据竞争。 - **`cmpxchg`指令**:全称为 **Compare and Exchange**(比较并交换),用于原子地比较并修改内存或寄存器的值。 ### 2. 功能解析 指令格式:`lock cmpxchg DST, SRC` - **操作逻辑**: 1. 比较累加器(`AL`/`AX`/`EAX`/`RAX`,取决于操作数大小)与目标操作数(`DST`)的值。 2. 若相等: - 将`SRC`的值写入`DST` - 设置标志位`ZF=1` 3. 若不相等: - 将`DST`的值加载到累加器 - 设置标志位`ZF=0` - **原子性**:`lock`前缀确保整个操作在多核/多线程中不可分割。 ### 3. 应用场景 - **实现自旋锁**:用于线程同步,例如: ```asm spin_lock: mov eax, 1 ; 期望锁状态为“未占用”(0) retry: lock cmpxchg [lock_var], ebx ; 尝试原子地获取锁 jnz retry ; 若失败则重试 ``` - **无锁数据结构**:如原子计数器、队列等。 ### 4. 技术细节 - **操作数要求**: - `DST`可以是内存或寄存器 - `SRC`必须是寄存器 - **标志位影响**:`ZF`(零标志)表示操作是否成功 - **现代优化**:在支持缓存一致性协议(如MESI)的CPU上,`lock`前缀可能通过缓存锁定而非总线锁定实现,性能更高。 ### 5. 示例说明 假设需要原子地将内存地址`[0x1000]`的值从5改为10: ```asm mov eax, 5 ; 期望原值为5 mov ebx, 10 ; 新值 lock cmpxchg [0x1000], ebx ``` - 若`[0x1000]`实际值为5 → 修改为10,`ZF=1` - 若`[0x1000]`实际非5 → `eax`更新为当前值,`ZF=0` ### 6. 注意事项 - **性能影响**:`lock`前缀会阻止指令重排序并刷新内存屏障,可能影响性能。 - **替代方案**:C/C++中可通过`std::atomic`实现类似功能,编译器会自动生成高效指令。 建议参考Intel官方手册《Intel® 64 and IA-32 Architectures Software Developer’s Manual》卷2获取更详细的指令说明。
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值