Accelerating Program Based on Architecture From Software Level_after optimization: cast -65-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_45360983/article/details/104981646

Note: Before start optimization, use profiler such as gprof to determine the function that consumes the most percentage of time. gprof does not work with multithreading enabled.

Increasing IPC (Instruction Per Cycle)

The main procedure for increasing IPC is to run several independent instructions together. Under the -O3 optimizer, the compiler will run these instruction in same cycle and increase the IPC. One application is to unroll the loop to execute several iterations at once. This method requires that the next iteration of the loop does not depend on the current iteration.

Example:

before optimization:

for ( int n = 0; n < n_limit; n++ ) {
	double& ai = C[n];
	for ( int i = 0; i < i_limit; i++ ) {
		double a = A[i];
		double b = B[i];
		double mul = a * b;
		double acc = ai + mul_val;
		ai = acc;
	}
}

after optimization:

for ( int n = 0; n < n_limit; n++ ) {
	double tmp = 0.0;
	int mod = i_limit % 7;
	for (int i = 0 ; i < i_limit - mod - 6; i+=7){
		double tmp0 = A[i] * B[i];
		double tmp1= A[i + 1] * B[i + 1];
		double tmp2= A[i + 2] * B[i + 2];
		double tmp3= A[i + 3] * B[i + 3];
		double tmp4= A[i + 4] * B[i + 4];
		double tmp5= A[i + 5] * B[i + 5];
		double tmp6= A[i + 6] * B[i + 6];

		tmp += tmp0 + tmp1 + tmp2 + tmp3 + tmp4 + tmp5 + tmp6;
	}
	for (int i = i_limit - mod; i < i_limit; i++){
		tmp += A[i] * B[i];
	}
	C[n] = tmp;
}

Loop Reordering

This is a very effective way of optimizing the software. The main procedural is to have the most inner loop to loop on the element that are closer between each other. We do this because we want to increase the spatial locality and temporal locality.

Example:

before optimization:

for ( int n = 0; n < n_limit; n++ ) {
	for ( int i = 0; i < i_limit; i++ ) {
		sum += A[i][n];
	}
}

after optimization:

for ( int i = 0; i < i_limit; i++ ) {
	for ( int n = 0; n < n_limit; n++ ) {
		sum += A[i][n];
	}
}

Loop Tiling

This method is used to load portion of loop to fit the cache size so that the cache can always hit with in the tile. It can be useful if the tile size is choosen correctly.

Example

before optimization:

for ( int n = 0; n < n_limit; n++ ) {
	double& ai = C[n];
	for ( int i = 0; i < i_limit; i++ ) {
		double a = A[i];
		double b = B[i];
		double mul = a * b;
		double acc = ai + mul_val;
		ai = acc;
	}
}

after optimization (tiling on n):

#define TILE_SIZE 4
for (int nn = 0; nn < n_limit; nn += TILE_SIZE){
	for ( int n = nn; n < n_limit && n < nn + TILE_SIZE; n++ )  {
		double& ai = C[n];
		for ( int i = 0; i < i_limit; i++ ) {
			double a = A[i];
			double b = B[i];
			double mul = a * b;
			double acc = ai + mul_val;
			ai = acc;
		}
	}
}

Multithreading

Apply multithreading on loop so that the loop can executed in parallel. The library we are using here is the OpenMP. Note that the number of thread should be corresponding to the number of cores on your processor.

Example

before optimization:

for ( int nn = 0; nn < out.size.x; nn+=BLOCK_SIZE ) {
	for ( int b = 0; b < out.size.b; b++ ) {
		for ( int n = nn; n < nn + BLOCK_SIZE && n < out.size.x; n++ ) {
			for ( int i = 0; i < grads_out.size.x; i++ ) {
				grads_out(i, 0, 0, b) += act_grad(n, 0, 0, b) * weights( i, n, 0);
			}
		}
    }
}

#pragma omp parallel for 
for ( int nn = 0; nn < out.size.x; nn+=BLOCK_SIZE ) {
	tensor_t<double> tmp_grads_out (grads_out.size);
	tmp_grads_out.clear();
	for ( int b = 0; b < out.size.b; b++ ) {
		for ( int n = nn; n < nn + BLOCK_SIZE && n < out.size.x; n++ ) {
			for ( int i = 0; i < grads_out.size.x; i++ ) {
				tmp_grads_out(i, 0, 0, b) += act_grad(n, 0, 0, b) * weights( i, n, 0);
			}
		}
    }
	#pragma omp critical
	{
		for ( int b = 0; b < out.size.b; b++ ) {
			for ( int i = 0; i < grads_out.size.x; i++ ) {
				grads_out(i, 0, 0, b) += tmp_grads_out(i, 0, 0, b);
			}
		}
	}
}

This enables each thread to accumulate their results locally. However, we now need to combine the results from each thread. That is the reason there is a critical section. The combining step will usually cause the speed up to reduce.

A better way of optimize the above loop:

for ( int nn = 0; nn < out.size.x; nn+=BLOCK_SIZE ) {
	#pragma omp parallel for 
	for ( int b = 0; b < out.size.b; b++ ) {
		for ( int n = nn; n < nn + BLOCK_SIZE && n < out.size.x; n++ ) {
			for ( int i = 0; i < grads_out.size.x; i++ ) {
				grads_out(i, 0, 0, b) += act_grad(n, 0, 0, b) * weights( i, n, 0);
			}
		}
	}
}

This will not require a critical section since each thread just write to different portion of the grads_out vector.

Vectorization

Since vectorization is already applied by gcc to all your loops if the -O3 or higher flag is enabled, we do not need to do any work to implement that. However, after we apply the command line argument -fopt-info-vec-all to get feedback of each vectorization. After that, we can find some loop that is missed and make it work for vectorization. Or we can also use #pragma omp simd to manually apply vectorization to the following loop.

Some typical message from feedback:
complicated access pattern: maybe try reordering loops or rewriting the loop to simplify the access pattern.
relevant stmt not supported: <stmt>: for example if it has type casting or if-else that makes the loop unvectorizable -> try to find a way to rewrite the logic to not have those statements inside the innermost loop.

Example

Before optimization:

for (unsigned n = a + 1; n <= b - 1; n++)
{
    float d_an = path_res * static_cast<float>(n - a);
    float d_nb = path_res * static_cast<float>(b - n);

    float h = elevation_path[n] + (d_an * d_nb) / (2.0f * r_e) - (h_a * d_nb + h_b * d_an) / d_ab;
    float v = h * std::sqrt((2.0f * d_ab) / (wavelength * d_an * d_nb));

    diff_path[n] = v;
}

After optimization:

const unsigned it=(b-a)-1;
const unsigned diff=b-a;
for (unsigned n = 0; n < it; n++)
{
    float d_an = path_res * static_cast<float>(n);
    float d_nb = path_res * static_cast<float>(diff - n);

    float h = elevation_path[n] + (d_an * d_nb) / (2.0f * r_e) - (h_a * d_nb + h_b * d_an) / d_ab;
    float v = h * sqrt((2.0f * d_ab) / (wavelength * d_an * d_nb));

    diff_path[n] = v;
}