Note: Before start optimization, use profiler such as gprof to determine the function that consumes the most percentage of time. gprof does not work with multithreading enabled.
Increasing IPC (Instruction Per Cycle)
The main procedure for increasing IPC is to run several independent instructions together. Under the -O3
optimizer, the compiler will run these instruction in same cycle and increase the IPC. One application is to unroll the loop to execute several iterations at once. This method requires that the next iteration of the loop does not depend on the current iteration.
Example:
before optimization:
for ( int n = 0; n < n_limit; n++ ) {
double& ai = C[n];
for ( int i = 0; i < i_limit; i++ ) {
double a = A[i];
double b = B[i];
double mul = a * b;
double acc = ai + mul_val;
ai = acc;
}
}
after optimization:
for ( int n = 0; n < n_limit; n++ ) {
double tmp = 0.0;
int mod = i_limit % 7;
for (int i = 0 ; i < i_limit - mod - 6; i+=7){
double tmp0 = A[i] * B[i];
double tmp1= A[i + 1] * B[i + 1];
double tmp2= A[i + 2] * B[i + 2];
double tmp3= A[i + 3] * B[i + 3];
double tmp4= A[i + 4] * B[i + 4];
double tmp5= A[i + 5] * B[i + 5];
double tmp6= A[i + 6] * B[i + 6];
tmp += tmp0 + tmp1 + tmp2 + tmp3 + tmp4 + tmp5 + tmp6;
}
for (int i = i_limit - mod; i < i_limit; i++){
tmp += A[i] * B[i];
}
C[n] = tmp;
}
Loop Reordering
This is a very effective way of optimizing the software. The main procedural is to have the most inner loop to loop on the element that are closer between each other. We do this because we want to increase the spatial locality and temporal locality.
Example:
before optimization:
for ( int n = 0; n < n_limit; n++ ) {
for ( int i = 0; i < i_limit; i++ ) {
sum += A[i][n];
}
}
after optimization:
for ( int i = 0; i < i_limit; i++ ) {
for ( int n = 0; n < n_limit; n++ ) {
sum += A[i][n];
}
}
Loop Tiling
This method is used to load portion of loop to fit the cache size so that the cache can always hit with in the tile. It can be useful if the tile size is choosen correctly.
Example
before optimization:
for ( int n = 0; n < n_limit; n++ ) {
double& ai = C[n];
for ( int i = 0; i < i_limit; i++ ) {
double a = A[i];
double b = B[i];
double mul = a * b;
double acc = ai + mul_val;
ai = acc;
}
}
after optimization (tiling on n):
#define TILE_SIZE 4
for (int nn = 0; nn < n_limit; nn += TILE_SIZE){
for ( int n = nn; n < n_limit && n < nn + TILE_SIZE; n++ ) {
double& ai = C[n];
for ( int i = 0; i < i_limit; i++ ) {
double a = A[i];
double b = B[i];
double mul = a * b;
double acc = ai + mul_val;
ai = acc;
}
}
}
Multithreading
Apply multithreading on loop so that the loop can executed in parallel. The library we are using here is the OpenMP
. Note that the number of thread should be corresponding to the number of cores on your processor.
Example
before optimization:
for ( int nn = 0; nn < out.size.x; nn+=BLOCK_SIZE ) {
for ( int b = 0; b < out.size.b; b++ ) {
for ( int n = nn; n < nn + BLOCK_SIZE && n < out.size.x; n++ ) {
for ( int i = 0; i < grads_out.size.x; i++ ) {
grads_out(i, 0, 0, b) += act_grad(n, 0, 0, b) * weights( i, n, 0);
}
}
}
}
#pragma omp parallel for
for ( int nn = 0; nn < out.size.x; nn+=BLOCK_SIZE ) {
tensor_t<double> tmp_grads_out (grads_out.size);
tmp_grads_out.clear();
for ( int b = 0; b < out.size.b; b++ ) {
for ( int n = nn; n < nn + BLOCK_SIZE && n < out.size.x; n++ ) {
for ( int i = 0; i < grads_out.size.x; i++ ) {
tmp_grads_out(i, 0, 0, b) += act_grad(n, 0, 0, b) * weights( i, n, 0);
}
}
}
#pragma omp critical
{
for ( int b = 0; b < out.size.b; b++ ) {
for ( int i = 0; i < grads_out.size.x; i++ ) {
grads_out(i, 0, 0, b) += tmp_grads_out(i, 0, 0, b);
}
}
}
}
This enables each thread to accumulate their results locally. However, we now need to combine the results from each thread. That is the reason there is a critical section. The combining step will usually cause the speed up to reduce.
A better way of optimize the above loop:
for ( int nn = 0; nn < out.size.x; nn+=BLOCK_SIZE ) {
#pragma omp parallel for
for ( int b = 0; b < out.size.b; b++ ) {
for ( int n = nn; n < nn + BLOCK_SIZE && n < out.size.x; n++ ) {
for ( int i = 0; i < grads_out.size.x; i++ ) {
grads_out(i, 0, 0, b) += act_grad(n, 0, 0, b) * weights( i, n, 0);
}
}
}
}
This will not require a critical section since each thread just write to different portion of the grads_out vector.
Vectorization
Since vectorization is already applied by gcc to all your loops if the -O3
or higher flag is enabled, we do not need to do any work to implement that. However, after we apply the command line argument -fopt-info-vec-all
to get feedback of each vectorization. After that, we can find some loop that is missed and make it work for vectorization. Or we can also use #pragma omp simd
to manually apply vectorization to the following loop.
Some typical message from feedback:
complicated access pattern
: maybe try reordering loops or rewriting the loop to simplify the access pattern.
relevant stmt not supported: <stmt>
: for example if it has type casting or if-else that makes the loop unvectorizable -> try to find a way to rewrite the logic to not have those statements inside the innermost loop.
Example
Before optimization:
for (unsigned n = a + 1; n <= b - 1; n++)
{
float d_an = path_res * static_cast<float>(n - a);
float d_nb = path_res * static_cast<float>(b - n);
float h = elevation_path[n] + (d_an * d_nb) / (2.0f * r_e) - (h_a * d_nb + h_b * d_an) / d_ab;
float v = h * std::sqrt((2.0f * d_ab) / (wavelength * d_an * d_nb));
diff_path[n] = v;
}
After optimization:
const unsigned it=(b-a)-1;
const unsigned diff=b-a;
for (unsigned n = 0; n < it; n++)
{
float d_an = path_res * static_cast<float>(n);
float d_nb = path_res * static_cast<float>(diff - n);
float h = elevation_path[n] + (d_an * d_nb) / (2.0f * r_e) - (h_a * d_nb + h_b * d_an) / d_ab;
float v = h * sqrt((2.0f * d_ab) / (wavelength * d_an * d_nb));
diff_path[n] = v;
}