Udacity的CUDA编程课程中介绍了CUDA实现矩阵转置的六种方式,本文介绍其中的一种方式
如果矩阵为N*N的方阵。该方式让每个线程处理一个矩阵元素,总共需要N*N个线程。首先,声明两个常量并配置blocks,threads:
const int N=1024;
const int K=32;
dim3 blocks(N/K,N/K);
dim3 threads(K,K);
内核函数:
__global__ void
transpose_parallel_per_element_tiled(float in[], float out[])
{
// (i,j) locations of the tile corners for input & output matrices:
int in_corner_i = blockIdx.x * blockDim.x, in_corner_j = blockIdx.y * blockDim.y;
int out_corner_i = in_corner_j, out_corner_j = in_corner_i;
int x = threadIdx.x, y = threadIdx.y;
__shared__ float tile[K][K];
// coalesced read from global mem, TRANSPOSED write into shared mem:
tile[y][x] = in[(in_corner_i + x) + (in_corner_j + y)*N];
__syncthreads();
// read from shared mem, coalesced write to global mem:
out[(out_corner_i + x) + (out_corner_j + y)*N] = tile[x][y];
}
内核涉及两个输入参数,in代表输入矩