Avoiding the Cost of Branch Misprediction

本文探讨了现代处理器如何通过预测执行来提高性能,并讨论了错误预测带来的性能损失。文章提供了几种消除分支指令的方法,包括将分支指令移出循环、使用SIMD指令集进行逻辑运算以及利用内在函数等,以减少处理器因分支预测而产生的停顿。

Introduction

by Rajiv Kapoor

Today's modern processors derive much of their performance by executing instructions in parallel and before the time when their results are actually needed. When a sequential block of code is executed, the processor has full visibility of the next set of instructions it needs to execute, and after evaluating their interdependencies, it executes them. However, when a comparison instruction and a branch based on its result are encountered, the processor does not know which set of instructions to execute after the branch until it knows the result of the comparison instruction. The next set of instructions could be either the ones immediately following the branch target address, if the branch is taken, or they could be the ones following the branch instruction, if the branch is not taken. This causes a stall in the processor's execution pipeline. To avoid such stalls, the processor makes an educated "guess" of the result of the comparison and executes the instructions that it thinks need to be next. These predictions are based on past history of that particular branch instruction that is maintained in the processor's branch history buffer. This article discusses problems encountered as a consequence of branch mispredictions and offers the reader approaches for avoiding the performance costs that come with these events.


The Problem With Branches

Predicting the branch based on history results in a correct guess only if the branch has been visited before and behaves in some sort of a consistent pattern. A typical case of unpredictable branch behavior is when the comparison result is dependent on data. When the prediction made by the processor is incorrect, all the speculatively executed instructions are discarded as soon as the correct branch is determined, and the processor execution pipeline restarts with instructions from the correct branch destination. This halt in productive work while the new instructions work their way down the execution pipeline causes a processor stall, which is a major drain on performance.

No Branches Means No Mispredicts

One of the most effective techniques of reducing processor stalls due to mispredicted branches is to eliminate the branches altogether. Removing the branches not only improves runtime performance of the code, it also helps the compiler to optimize the code. When branch removal is not possible, the branch can be coded to help the processor predict it correctly based on the static branch prediction rules. The references below provide more information on static branch prediction rules.

Which Branches To Remove

Branches are any decision points in a program. Branches occur at every if and case statement, in every for and do-while loop, and every gotostatement, among others. Some of these, such as the goto statement, are unconditional, are always predicted correctly, and need no optimization. Others, especially branches that rely on data values, can be very difficult to predict. Such data-dependent branche s are the best ones to target for removal.

Removing all branches may not be very useful. To make the right decision, consider the correlation between the execution time spent in the code block containing the branches and the rate of misprediction of those branches. Use a tool such as the Intel® VTune™ Performance Analyzer, to measure two metrics for the code block - percentage of execution time spent, and the ratio of branches mispredicted to branches retired (also known as the branch mispredict ratio). As a rule of thumb, remove or optimize a branch if the branch mispredict ratio is greater than about 8% and the location of the branch is a hotspot in the execution profile relative to the rest of the code. It is important to consider both the metrics together because a branch location may be an execution hotspot simply due to the fact that it is execute frequently, even though it is never mispredicted.


Branch Elimination By Example

Let's consider some simple synthetic code snippets to illustrate several techniques of branch removal. The following function shows three code segments, appropriately commented as segments 1, 2 and 3. These code segments or blocks will be used to illustrate different ways of removing branches in the following sections. For simplifying the example, it is assumed that size is a multiple of 4 and all pointer input parameters to the function are 16 byte aligned.

 
01void MyFunc ( int
02 
03size, int blend, float *src, float *dest, float *src_1, ... )
04 
05{
06 
07  int j;
08 
09  // Code segment 1
10 
11  for (j = 0; j < size; j++ )
12 
13  {
14 
15    if ( blend == 255 ) // invariant branch
16 
17      dest[j] = src_1[j];
18 
19    else if ( blend == 0 )
20 
21      dest[j] = src_2[j];
22 
23    else
24 
25      dest[j] = ( src_1[j] * blend + src_2[j] * (255-blend) ) / 256;
26 
27  }
28 
29 
30  // Code segment 2
31 
32  for (j = 0; j < size; j++ ) // assume (size % 4) = 0
33 
34  {
35 
36    if ( src[j] > dest[j] )
37 
38    {
39 
40      minVal[j] = dest[j];
41 
42      maxVal[j] = src[j];
43 
44    }
45 
46       else
47 
48    {
49 
50      maxVal[j] = dest[j];
51 
52      minVal[j] = src[j];
53 
54    }
55 
56  }
57 
58 
59  // Code segment 3
60 
61  for (j = 0; j < size; j++) // Assume (size % 4) = 0
62 
63    if (src[j] < 0.0)
64 
65      src[j] = -1.0 * src[j];
66 
67}

 

Code Segment 1

This segment shows a simple case where a busy loop contains invariant branches inside the loop. This code is the kind of loop all C and C++ programmers have written hundreds if not thousands of times. Compare it with this segment:

 
01void MyFunc ( int
02 
03 size, int blend, float *src, float *dest, float *src_1, ...
04)
05 
06 
07 {
08 
09 
10   int j;
11 
12 
13 
14 
15   // Code segment 1
16 
17 
18   if ( blend = 255 ) // invariant branch before
19loop
20 
21 
22     for ( j = 0; j < size; j++ )
23 
24 
25       dest[j] = src_1[j];
26 
27 
28   else if ( blend == 0 ) // invariant branch before
29loop
30 
31 
32     for ( j = 0; j < size; j++ )
33 
34 
35       sest[j]= src_2[j];
36 
37 
38   else
39 
40 
41     for ( j = 0; j < size; j++ )
42 
43 
44       dest[j] = ( src_1[j] *
45blend + src_2[j] * (255-blend) ) / 256;
46 
47 
48 
49 .........
50 
51 
52 
53 }

 

The same operations are still performed, but none of the loops have any comparison within them. This means that the loops can be optimally executed by the processor without decreasing performance due to mispredicted branches. This change alone will significantly increase the performance of these loops.

Code Segment 2

This segment shows code that calculates the minimum and maximum values from two arrays. By using Single Instruction Multiple Data (SIMD) operations, this kind of branchy code can be easily converted into logical operations. One way to do this would be to use the mask generating compare instructions in the Streaming SIMD Extension (SSE) instruction set offered on the Pentium® III and later processors. Here is an implementation using intrinsics that can be called in C for the SSE instructions that are supported by Intel and Microsoft C/C++ compilers.

 
01void MyFunc ( int
02 
03 size, int blend, float *src, float *dest, float *src_1, ...
04)
05 
06 {
07 
08   int j;
09 
10   .........
11 
12   // Code segment 2
13 
14 
15   __m128 src4, dest4, min4, max4, mask4, max4_s,
16max4_d, min4_s, min4_d;
17 
18   for (j = 0; j < size/4 ; j++, src+=4, dest+=4,
19minVal+=4, maxVal+=4)
20 
21 
22   {
23 
24       src4 = _mm_load_ps(src);
25// load the 4 src values
26 
27 
28       dest4 = _mm_load_ps(dest);
29// load the 4 dest values
30 
31 
32       mask4 =
33_mm_cmpgt_ps(src4,
34 
35 
36 dest4); // Fs if (src > dest) else 0s
37 
38       max4_s = _mm_and_ps(mask4,
39src4); // all max vals from src set
40 
41 
42       min4_d = _mm_and_ps(mask4,
43dest4); // all min vals from dest set
44 
45       max4_d =
46_mm_andnot_ps(mask4, dest4);// all max vals from dest set
47 
48 
49       min4_s =
50_mm_andnot_ps(mask4, src4); // all min vals from src set
51 
52       max4 = _mm_or_ps(max4_s,
53max4_d); // all max values set
54 
55       min4 = _mm_or_ps(min4_s,
56min4_d); // all min values set
57 
58       _mm_store_ps(minVal,
59 
60 min4); // store the 4 mins
61 
62 
63       _mm_store_ps(maxVal,
64max4); // store the 4 maxs
65 
66 
67   }
68 
69   .........
70 
71 }

 

This can also be implemented in a more efficient way by using the SIMD min and max instructions provided by the SSE instructions. The following code shows the implementation using the _mm_min_ps() and _mm_max_ps() intrinsics.

 
01     void MyFunc ( int
02 
03     size, int blend, float *src, float *dest, float *src_1, ...
04    )
05 
06     {
07 
08       int j;
09 
10       .........
11 
12 
13       // Code segment 2
14 
15 
16       __m128 src4, dest4, min4, max4;
17 
18       for (j = 0; j < size/4 ; j++, src+=4, dest+=4,
19    minVal+=4, maxVal+=4)
20 
21       {
22 
23           src4 = _mm_load_ps(src);
24    // load the 4 src values
25 
26           dest4 = _mm_load_ps(dest);
27    // load the 4 dest values
28 
29           min
304 =
31    _mm_min_ps(src4,
32 
33     dest4); // calculate the 4 mins
34 
35 
36           max4 = _mm_max_ps(src4,
37    dest4); // calculate the 4 maxs
38 
39 
40           _mm_store_ps(minVal,
41 
42     min4); // store the 4 mins
43 
44 
45           _mm_store_ps(maxVal,
46    max4); // store the 4 maxs
47 
48       }
49 
50     ... ...
51 
52     }

 

Code Segment 3

This piece of code calculates the absolute values of those stored in the array src. The most obvious way to remove the branch here is to use the fabs() intrinsic as:

 
01{
02 
03   .........
04 
05   // Code segment 2
06 
07 
08   for (j = 0; j < size; j++)
09 
10       src[j] = (float)
11fabs((double)src[j]);
12 
13 }

 

Using SSE, the code could be further optimized to process four values at one time and further reduce the branches due to the loop execution, since the loop count is now a fourth of the original. Here is an implementation that exploits the fact that calculating the absolute value of a single precision floating point number is just a matter of clearing bit 31 of the floating point representation.

 
01     {
02 
03       .........
04 
05       // Code segment 2
06 
07 
08       __declspec(align (16)) int temp[4] =
09 
10             {0x7FFFFFFF,
11    0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF};
12 
13       __m128 const_sign_bit_0
14 
15     = mm_load_ps((float *) temp);
16 
17       for (j = 0;
18j < size/4; j++, src+=4)
19 
20 
21       {
22 
23           src4 = _mm_load_ps(src);
24    // load the 4 src values
25 
26 
27           src4 = _mm_and_ps(src4,
28    const_sign_bit_0); // clear bit 31
29 
30           _mm_store_ps(src, src4);
31    // store the 4 values back
32 
33       }
34 
35     }

 


Conclusion

The Occasional Cost Of Branch Removal

Branch removal may not always be free. Consider the optimized versions of code segment 3 shown in the previous section. In the optimized versions, all src values are processed as opposed to the original code, which only processes the negative values. If the values are non-negative in a majority of the cases, then we will be doing more processing in the optimized code compared to the original code. Furthermore, when using SSE2 or SSE instructions, one must be aware of the time spent in "setup" code as opposed to the real intended operation. Setup code could be loading and/or rearranging of data to make it amenable to SIMD operations that follow the setup. Always try to concatenate as many SIMD operations as possible to amortize the fixed cost of setup.

Resources

For more information on this topic, please see the Optimization manual for the Pentium® 4 and Intel® Xeon® processors, as well as the processor manuals, that contain significant helpful information. They can be downloaded at: http://developer.intel.com/products/processor/manuals/index.htm

Also see Richard Gerber's book, The Software Optimization Cookbook, Intel Press, 2002, which inspired the code in the first example.

代码转载自:https://pan.quark.cn/s/a4b39357ea24 本文重点阐述了利用 LabVIEW 软件构建的锁相放大器的设计方案及其具体实施流程,并探讨了该设备在声波相位差定位系统中的实际运用情况。 锁相放大器作为一项基础测量技术,其核心功能在于能够精确锁定微弱信号的频率参数并完成相关测量工作。 在采用 LabVIEW 软件开发的锁相放大器系统中,通过计算测量信号与两条参考信号之间的互相关函数,实现对微弱信号的频率锁定,同时输出被测信号的幅值信息。 虚拟仪器技术是一种基于计算机硬件平台的仪器系统,其显著特征在于用户可以根据实际需求自主设计仪器功能,配备虚拟化操作界面,并将测试功能完全由专用软件程序实现。 虚拟仪器系统的基本架构主要由计算机主机、专用软件程序以及硬件接口模块等核心部件构成。 虚拟仪器最突出的优势在于其功能完全取决于软件编程,用户可以根据具体应用场景灵活调整系统功能参数。 在基于 LabVIEW 软件开发的锁相放大器系统中,主要运用 LabVIEW 软件平台完成锁相放大器功能的整体设计。 LabVIEW 作为一个图形化编程环境,能够高效地完成虚拟仪器的开发工作。 借助 LabVIEW 软件,可以快速构建锁相放大器的用户操作界面,并且可以根据实际需求进行灵活调整和功能扩展。 锁相放大器系统的关键构成要素包括测量信号输入通道、参考信号输入通道、频率锁定处理单元以及信号幅值输出单元。 测量信号是系统需要检测的对象,参考信号则用于引导系统完成对测量信号的频率锁定。 频率锁定处理单元负责实现测量信号的锁定功能,信号幅值输出单元则负责输出被测信号的幅值大小。 在锁相放大器的实际实现过程中,系统采用了双路参考信号输入方案来锁定测量信号。 通过分析两路参考信号之间的相...
边缘计算环境中基于启发式算法的深度神经网络卸载策略(Matlab代码实现)内容概要:本文介绍了在边缘计算环境中,利用启发式算法实现深度神经网络任务卸载的策略,并提供了相应的Matlab代码实现。文章重点探讨了如何通过合理的任务划分与调度,将深度神经网络的计算任务高效地卸载到边缘服务器,从而降低终端设备的计算负担、减少延迟并提高整体系统效率。文中涵盖了问题建模、启发式算法设计(如贪心策略、遗传算法、粒子群优化等可能的候选方法)、性能评估指标(如能耗、延迟、资源利用率)以及仿真实验结果分析等内容,旨在为边缘智能计算中的模型推理优化提供可行的技术路径。; 适合人群:具备一定编程基础,熟悉Matlab工具,从事边缘计算、人工智能、物联网或智能系统优化方向的研究生、科研人员及工程技术人员。; 使用场景及目标:①研究深度神经网络在资源受限设备上的部署与优化;②探索边缘计算环境下的任务卸载机制与算法设计;③通过Matlab仿真验证不同启发式算法在实际场景中的性能表现,优化系统延迟与能耗。; 阅读建议:建议读者结合提供的Matlab代码进行实践操作,重点关注算法实现细节与仿真参数设置,同时可尝试复现并对比不同启发式算法的效果,以深入理解边缘计算中DNN卸载的核心挑战与解决方案。
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值