out of order write

17. Out of order execution (PPro, PII and PIII)

--------------------------------------------------------------------------------
http://www.XiaoHui.com 日期: 2000-04-01 14:00
  您好, 来自 www.google.cn 的朋友! 您通过搜索 out-of-order+writes 来到本站。
  如果您是第一次来到本站, 欢迎点此将本站添加至您的收藏夹,或使用 RSS FEED 订阅本站更新 ,也可以订阅本站邮件列表,获取最新更新通知。

17. Out of order execution (PPro, PII and PIII)

The reorder buffer (ROB) can hold 40 uops. Each uop waits in the ROB until all its operands are ready and there is a vacant execution unit for it. This makes out-of-order execution possible. If one part of the code is delayed because of a cache miss then it won't delay later parts of the code if they are independent of the delayed operations.

Writes to memory cannot execute out of order relative to other writes. There are four write buffers, so if you expect many cache misses on writes or you are writing to uncached memory then it is recommended that you schedule four writes at at time and make sure the processor has something else to do before you give it the next four writes. Memory reads and other instructions can execute out of order, except IN, OUT and serializing instructions.

If your code writes to a memory address and soon after reads from the same address, then the read may by mistake be executed before the write because the ROB doesn't know the memory addresses at the time of reordering. This error is detected when the write address is calculated, and then the read operation (which was executed speculatively) has to be re-done. The penalty for this is approximately 3 clocks. The only way to avoid this penalty is to make sure the execution unit has other things to do between a write and a subsequent read from the same memory address.

There are several execution units clustered around five ports. Port 0 and 1 are for arithmetic operations etc. Simple move, arithmetic and logic operations can go to either port 0 or 1, whichever is vacant first. Port 0 also handles multiplication, division, integer shifts and rotates, and floating point operations. Port 1 also handles jumps and some MMX and XMM operations. Port 2 handles all reads from memory and a few string and XMM operations, port 3 calculates addresses for memory write, and port 4 executes all memory write operations. In chapter 29 you'll find a complete list of the uops generated by code instructions with an indication of which ports they go to. Note that all memory write operations require two uops, one for port 3 and one for port 4, while memory read operations use only one uop (port 2).

In most cases each port can receive one new uop per clock cycle. This means that you can execute up to 5 uops in the same clock cycle if they go to five different ports, but since there is a limit of 3 uops per clock earlier in the pipeline you will never execute more than 3 uops per clock on average.

You must make sure that no execution port receives more than one third of the uops if you want to maintain a throughput of 3 uops per clock. Use the table of uops in chapter 29 and count how many uops go to each port. If port 0 and 1 are saturated while port 2 is free then you can improve your code by replacing some MOV register,register or MOV register,immediate instructions with MOV register,memory in order to move some of the load from port 0 and 1 to port 2.

Most uops take only one clock cycle to execute, but multiplications, divisions, and many floating point operations take more:

Floating point addition and subtraction takes 3 clocks, but the execution unit is fully pipelined so that it can receive a new FADD or FSUB in every clock cycle before the preceding ones are finished (provided, of course, that they are independent).

Integer multiplication takes 4 clocks, floating point multiplication 5, and MMX multiplication 3 clocks. Integer and MMX multiplication is pipelined so that it can receive a new instruction every clock cycle. Floating point multiplication is partially pipelined: The execution unit can receive a new FMUL instruction two clocks after the preceding one, so that the maximum throughput is one FMUL per two clock cycles. The holes between the FMUL's cannot be filled by integer multiplications because they use the same circuitry. XMM additions and multiplications take 3 and 4 clocks respectively, and are fully pipelined. But since each logical XMM register is implemented as two physical 64-bit registers, you need two uops for a packed XMM operation, and the throughput will then be one arithmetic XMM instruction every two clock cycles. XMM add and multiply instructions can execute in parallel because they don't use the same execution port.

Integer and floating point division takes up to 39 clocks and is not pipelined. This means that the execution unit cannot begin a new division until the previous division is finished. The same applies to squareroot and transcendental functions.

Also jump instructions, calls, and returns are not fully pipelined. You cannot execute a new jump in the first clock cycle after a preceding jump. So the maximum throughput for jumps, calls, and returns is one for every two clocks.

You should, of course, avoid instructions that generate many uops. The LOOP XX instruction, for example, should be replaced by DEC ECX / JNZ XX.

If you have consecutive POP instructions then you may break them up to reduce the number of uops:

POP ECX / POP EBX / POP EAX ; can be changed to: MOV ECX,[ESP] / MOV EBX,[ESP+4] / MOV EAX,[ESP] / ADD ESP,12


The former code generates 6 uops, the latter generates only 4 and decodes faster. Doing the same with PUSH instructions is less advantageous because the split-up code is likely to generate register read stalls unless you have other instructions to put in between or the registers have been renamed recently. Doing it with CALL and RET instructions will interfere with prediction in the return stack buffer. Note also that the ADD ESP instruction can cause an AGI stall in earlier processors.
http://www.xiaohui.com/dev/mmx/mmx_p_17.htm
### TCP重传后出现乱序的原因 当TCP数据包在网络传输过程中发生丢失或延迟时,发送方会通过超时机制触发重传操作[^1]。然而,在某些情况下,原始的数据包可能在重传之前已经到达接收端,而此时重新发送的数据包也会随后抵达。由于网络路径的不同以及路由器队列处理顺序的变化,这些数据包可能会按照不同的时间顺序到达目标主机,从而造成乱序现象。 此外,如果中间节点存在不对称路由或者负载均衡策略,则可能导致原本按序列号排列好的分组被打散并经由不同链路传递至目的地[^1]。这种情形下即使没有发生任何丢包事件也有可能观察到out-of-order的现象。 ### 解决方案 为了应对因重传而导致的乱序问题,TCP协议栈通常采用以下几种方法来优化性能: #### 增强ACK机制 利用SACK (Selective Acknowledgment)选项可以精确告知发送者哪些部分已经被成功接收到,而不是仅仅依赖于累积确认的方式[^1]。这有助于减少不必要的重复传送,并允许更快地恢复正常的通信流程。 ```python # 示例代码展示如何启用 SACK 功能(以 Linux 系统为例) import os def enable_sack(): with open('/proc/sys/net/ipv4/tcp_sack', 'w') as f: f.write('1') enable_sack() ``` #### 调整拥塞控制算法 现代操作系统支持多种先进的拥塞控制算法如CUBIC、BBR等,它们能够更准确地判断当前网络状况下的最佳窗口大小调整方式,进而降低因为错误估计RTT所引发的一系列连锁反应包括但不限于频繁的RTO timeout 和过多的冗余副本流入网络之中[^1]。 #### 提高缓冲区容量 适当增大socket buffer size 可以为暂时性的无序提供更多的容忍度而不至于立即触发丢弃动作。这样做的好处是可以等待后续到来的确切位置上的片段填补空缺之后再提交给应用层处理。 ```bash # 设置Linux系统的tcp_rmem和tcp_wmem参数增加缓存空间 sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216" sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216" ```
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值