Avoiding the Cost of Branch Misprediction

最新推荐文章于 2025-11-06 16:28:22 发布

转载最新推荐文章于 2025-11-06 16:28:22 发布 · 1.3k 阅读

本文探讨了现代处理器如何通过预测执行来提高性能，并讨论了错误预测带来的性能损失。文章提供了几种消除分支指令的方法，包括将分支指令移出循环、使用SIMD指令集进行逻辑运算以及利用内在函数等，以减少处理器因分支预测而产生的停顿。

Introduction

by Rajiv Kapoor

Today's modern processors derive much of their performance by executing instructions in parallel and before the time when their results are actually needed. When a sequential block of code is executed, the processor has full visibility of the next set of instructions it needs to execute, and after evaluating their interdependencies, it executes them. However, when a comparison instruction and a branch based on its result are encountered, the processor does not know which set of instructions to execute after the branch until it knows the result of the comparison instruction. The next set of instructions could be either the ones immediately following the branch target address, if the branch is taken, or they could be the ones following the branch instruction, if the branch is not taken. This causes a stall in the processor's execution pipeline. To avoid such stalls, the processor makes an educated "guess" of the result of the comparison and executes the instructions that it thinks need to be next. These predictions are based on past history of that particular branch instruction that is maintained in the processor's branch history buffer. This article discusses problems encountered as a consequence of branch mispredictions and offers the reader approaches for avoiding the performance costs that come with these events.

The Problem With Branches

Predicting the branch based on history results in a correct guess only if the branch has been visited before and behaves in some sort of a consistent pattern. A typical case of unpredictable branch behavior is when the comparison result is dependent on data. When the prediction made by the processor is incorrect, all the speculatively executed instructions are discarded as soon as the correct branch is determined, and the processor execution pipeline restarts with instructions from the correct branch destination. This halt in productive work while the new instructions work their way down the execution pipeline causes a processor stall, which is a major drain on performance.

No Branches Means No Mispredicts

One of the most effective techniques of reducing processor stalls due to mispredicted branches is to eliminate the branches altogether. Removing the branches not only improves runtime performance of the code, it also helps the compiler to optimize the code. When branch removal is not possible, the branch can be coded to help the processor predict it correctly based on the static branch prediction rules. The references below provide more information on static branch prediction rules.

Which Branches To Remove

Branches are any decision points in a program. Branches occur at every if and case statement, in every for and do-while loop, and every gotostatement, among others. Some of these, such as the goto statement, are unconditional, are always predicted correctly, and need no optimization. Others, especially branches that rely on data values, can be very difficult to predict. Such data-dependent branche s are the best ones to target for removal.

Removing all branches may not be very useful. To make the right decision, consider the correlation between the execution time spent in the code block containing the branches and the rate of misprediction of those branches. Use a tool such as the Intel® VTune™ Performance Analyzer, to measure two metrics for the code block - percentage of execution time spent, and the ratio of branches mispredicted to branches retired (also known as the branch mispredict ratio). As a rule of thumb, remove or optimize a branch if the branch mispredict ratio is greater than about 8% and the location of the branch is a hotspot in the execution profile relative to the rest of the code. It is important to consider both the metrics together because a branch location may be an execution hotspot simply due to the fact that it is execute frequently, even though it is never mispredicted.

Branch Elimination By Example

Let's consider some simple synthetic code snippets to illustrate several techniques of branch removal. The following function shows three code segments, appropriately commented as segments 1, 2 and 3. These code segments or blocks will be used to illustrate different ways of removing branches in the following sections. For simplifying the example, it is assumed that size is a multiple of 4 and all pointer input parameters to the function are 16 byte aligned.

 
  
   
    01 void MyFunc ( int
   
   
    02  
   
   
    03 size, int blend, float *src, float *dest, float *src_1, ... )
   
   
    04  
   
   
    05 {
   
   
    06  
   
   
    07   int j;
   
   
    08  
   
   
    09   // Code segment 1
   
   
    10  
   
   
    11   for (j = 0; j < size; j++ )
   
   
    12  
   
   
    13   {
   
   
    14  
   
   
    15     if ( blend == 255 ) // invariant branch
   
   
    16  
   
   
    17       dest[j] = src_1[j];
   
   
    18  
   
   
    19     else if ( blend == 0 )
   
   
    20  
   
   
    21       dest[j] = src_2[j];
   
   
    22  
   
   
    23     else
   
   
    24  
   
   
    25       dest[j] = ( src_1[j] * blend + src_2[j] * (255-blend) ) / 256;
   
   
    26  
   
   
    27   }
   
   
    28  
   
   
    29  
   
   
    30   // Code segment 2
   
   
    31  
   
   
    32   for (j = 0; j < size; j++ ) // assume (size % 4) = 0
   
   
    33  
   
   
    34   {
   
   
    35  
   
   
    36     if ( src[j] > dest[j] )
   
   
    37  
   
   
    38     {
   
   
    39  
   
   
    40       minVal[j] = dest[j];
   
   
    41  
   
   
    42       maxVal[j] = src[j];
   
   
    43  
   
   
    44     }
   
   
    45  
   
   
    46        else
   
   
    47  
   
   
    48     {
   
   
    49  
   
   
    50       maxVal[j] = dest[j];
   
   
    51  
   
   
    52       minVal[j] = src[j];
   
   
    53  
   
   
    54     }
   
   
    55  
   
   
    56   }
   
   
    57  
   
   
    58  
   
   
    59   // Code segment 3
   
   
    60  
   
   
    61   for (j = 0; j < size; j++) // Assume (size % 4) = 0
   
   
    62  
   
   
    63     if (src[j] < 0.0)
   
   
    64  
   
   
    65       src[j] = -1.0 * src[j];
   
   
    66  
   
   
    67 }

Code Segment 1

This segment shows a simple case where a busy loop contains invariant branches inside the loop. This code is the kind of loop all C and C++ programmers have written hundreds if not thousands of times. Compare it with this segment:

 
  
   
    01 void MyFunc ( int
   
   
    02  
   
   
    03  size, int blend, float *src, float *dest, float *src_1, ...
   
   
    04 )
   
   
    05  
   
   
    06  
   
   
    07  {
   
   
    08  
   
   
    09  
   
   
    10    int j;
   
   
    11  
   
   
    12  
   
   
    13  
   
   
    14  
   
   
    15    // Code segment 1
   
   
    16  
   
   
    17  
   
   
    18    if ( blend = 255 ) // invariant branch before
   
   
    19 loop
   
   
    20  
   
   
    21  
   
   
    22      for ( j = 0; j < size; j++ )
   
   
    23  
   
   
    24  
   
   
    25        dest[j] = src_1[j];
   
   
    26  
   
   
    27  
   
   
    28    else if ( blend == 0 ) // invariant branch before
   
   
    29 loop
   
   
    30  
   
   
    31  
   
   
    32      for ( j = 0; j < size; j++ )
   
   
    33  
   
   
    34  
   
   
    35        sest[j]= src_2[j];
   
   
    36  
   
   
    37  
   
   
    38    else
   
   
    39  
   
   
    40  
   
   
    41      for ( j = 0; j < size; j++ )
   
   
    42  
   
   
    43  
   
   
    44        dest[j] = ( src_1[j] *
   
   
    45 blend + src_2[j] * (255-blend) ) / 256;
   
   
    46  
   
   
    47  
   
   
    48  
   
   
    49  .........
   
   
    50  
   
   
    51  
   
   
    52  
   
   
    53  }

The same operations are still performed, but none of the loops have any comparison within them. This means that the loops can be optimally executed by the processor without decreasing performance due to mispredicted branches. This change alone will significantly increase the performance of these loops.

Code Segment 2

This segment shows code that calculates the minimum and maximum values from two arrays. By using Single Instruction Multiple Data (SIMD) operations, this kind of branchy code can be easily converted into logical operations. One way to do this would be to use the mask generating compare instructions in the Streaming SIMD Extension (SSE) instruction set offered on the Pentium® III and later processors. Here is an implementation using intrinsics that can be called in C for the SSE instructions that are supported by Intel and Microsoft C/C++ compilers.

 
  
   
    01 void MyFunc ( int
   
   
    02  
   
   
    03  size, int blend, float *src, float *dest, float *src_1, ...
   
   
    04 )
   
   
    05  
   
   
    06  {
   
   
    07  
   
   
    08    int j;
   
   
    09  
   
   
    10    .........
   
   
    11  
   
   
    12    // Code segment 2
   
   
    13  
   
   
    14  
   
   
    15    __m128 src4, dest4, min4, max4, mask4, max4_s,
   
   
    16 max4_d, min4_s, min4_d;
   
   
    17  
   
   
    18    for (j = 0; j < size/4 ; j++, src+=4, dest+=4,
   
   
    19 minVal+=4, maxVal+=4)
   
   
    20  
   
   
    21  
   
   
    22    {
   
   
    23  
   
   
    24        src4 = _mm_load_ps(src);
   
   
    25 // load the 4 src values
   
   
    26  
   
   
    27  
   
   
    28        dest4 = _mm_load_ps(dest);
   
   
    29 // load the 4 dest values
   
   
    30  
   
   
    31  
   
   
    32        mask4 =
   
   
    33 _mm_cmpgt_ps(src4,
   
   
    34  
   
   
    35  
   
   
    36  dest4); // Fs if (src > dest) else 0s
   
   
    37  
   
   
    38        max4_s = _mm_and_ps(mask4,
   
   
    39 src4); // all max vals from src set
   
   
    40  
   
   
    41  
   
   
    42        min4_d = _mm_and_ps(mask4,
   
   
    43 dest4); // all min vals from dest set
   
   
    44  
   
   
    45        max4_d =
   
   
    46 _mm_andnot_ps(mask4, dest4);// all max vals from dest set
   
   
    47  
   
   
    48  
   
   
    49        min4_s =
   
   
    50 _mm_andnot_ps(mask4, src4); // all min vals from src set
   
   
    51  
   
   
    52        max4 = _mm_or_ps(max4_s,
   
   
    53 max4_d); // all max values set
   
   
    54  
   
   
    55        min4 = _mm_or_ps(min4_s,
   
   
    56 min4_d); // all min values set
   
   
    57  
   
   
    58        _mm_store_ps(minVal,
   
   
    59  
   
   
    60  min4); // store the 4 mins
   
   
    61  
   
   
    62  
   
   
    63        _mm_store_ps(maxVal,
   
   
    64 max4); // store the 4 maxs
   
   
    65  
   
   
    66  
   
   
    67    }
   
   
    68  
   
   
    69    .........
   
   
    70  
   
   
    71  }

This can also be implemented in a more efficient way by using the SIMD min and max instructions provided by the SSE instructions. The following code shows the implementation using the _mm_min_ps() and _mm_max_ps() intrinsics.

 
  
   
    01      void MyFunc ( int
   
   
    02  
   
   
    03      size, int blend, float *src, float *dest, float *src_1, ...
   
   
    04     )
   
   
    05  
   
   
    06      {
   
   
    07  
   
   
    08        int j;
   
   
    09  
   
   
    10        .........
   
   
    11  
   
   
    12  
   
   
    13        // Code segment 2
   
   
    14  
   
   
    15  
   
   
    16        __m128 src4, dest4, min4, max4;
   
   
    17  
   
   
    18        for (j = 0; j < size/4 ; j++, src+=4, dest+=4,
   
   
    19     minVal+=4, maxVal+=4)
   
   
    20  
   
   
    21        {
   
   
    22  
   
   
    23            src4 = _mm_load_ps(src);
   
   
    24     // load the 4 src values
   
   
    25  
   
   
    26            dest4 = _mm_load_ps(dest);
   
   
    27     // load the 4 dest values
   
   
    28  
   
   
    29            min
   
   
    30 4 =
   
   
    31     _mm_min_ps(src4,
   
   
    32  
   
   
    33      dest4); // calculate the 4 mins
   
   
    34  
   
   
    35  
   
   
    36            max4 = _mm_max_ps(src4,
   
   
    37     dest4); // calculate the 4 maxs
   
   
    38  
   
   
    39  
   
   
    40            _mm_store_ps(minVal,
   
   
    41  
   
   
    42      min4); // store the 4 mins
   
   
    43  
   
   
    44  
   
   
    45            _mm_store_ps(maxVal,
   
   
    46     max4); // store the 4 maxs
   
   
    47  
   
   
    48        }
   
   
    49  
   
   
    50      ... ...
   
   
    51  
   
   
    52      }

Code Segment 3

This piece of code calculates the absolute values of those stored in the array src. The most obvious way to remove the branch here is to use the fabs() intrinsic as:

 
  
   
    01 {
   
   
    02  
   
   
    03    .........
   
   
    04  
   
   
    05    // Code segment 2
   
   
    06  
   
   
    07  
   
   
    08    for (j = 0; j < size; j++)
   
   
    09  
   
   
    10        src[j] = (float)
   
   
    11 fabs((double)src[j]);
   
   
    12  
   
   
    13  }

Using SSE, the code could be further optimized to process four values at one time and further reduce the branches due to the loop execution, since the loop count is now a fourth of the original. Here is an implementation that exploits the fact that calculating the absolute value of a single precision floating point number is just a matter of clearing bit 31 of the floating point representation.

 
  
   
    01      {
   
   
    02  
   
   
    03        .........
   
   
    04  
   
   
    05        // Code segment 2
   
   
    06  
   
   
    07  
   
   
    08        __declspec(align (16)) int temp[4] =
   
   
    09  
   
   
    10              {0x7FFFFFFF,
   
   
    11     0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF};
   
   
    12  
   
   
    13        __m128 const_sign_bit_0
   
   
    14  
   
   
    15      = mm_load_ps((float *) temp);
   
   
    16  
   
   
    17        for (j = 0; 
   
   
    18 j < size/4; j++, src+=4)
   
   
    19  
   
   
    20  
   
   
    21        {
   
   
    22  
   
   
    23            src4 = _mm_load_ps(src);
   
   
    24     // load the 4 src values
   
   
    25  
   
   
    26  
   
   
    27            src4 = _mm_and_ps(src4,
   
   
    28     const_sign_bit_0); // clear bit 31
   
   
    29  
   
   
    30            _mm_store_ps(src, src4);
   
   
    31     // store the 4 values back
   
   
    32  
   
   
    33        }
   
   
    34  
   
   
    35      }

Conclusion

The Occasional Cost Of Branch Removal

Branch removal may not always be free. Consider the optimized versions of code segment 3 shown in the previous section. In the optimized versions, all src values are processed as opposed to the original code, which only processes the negative values. If the values are non-negative in a majority of the cases, then we will be doing more processing in the optimized code compared to the original code. Furthermore, when using SSE2 or SSE instructions, one must be aware of the time spent in "setup" code as opposed to the real intended operation. Setup code could be loading and/or rearranging of data to make it amenable to SIMD operations that follow the setup. Always try to concatenate as many SIMD operations as possible to amortize the fixed cost of setup.

Resources

For more information on this topic, please see the Optimization manual for the Pentium® 4 and Intel® Xeon® processors, as well as the processor manuals, that contain significant helpful information. They can be downloaded at: http://developer.intel.com/products/processor/manuals/index.htm

Also see Richard Gerber's book, The Software Optimization Cookbook, Intel Press, 2002, which inspired the code in the first example.