CUDA矩阵转置（共享内存 tile）

最新推荐文章于 2025-07-20 16:43:50 发布

原创

最新推荐文章于 2025-07-20 16:43:50 发布 · 4.7k 阅读

18 ·

CC 4.0 BY-SA版权

文章标签：

#CUDA #矩阵转置 #共享内存

Udacity的CUDA编程课程中介绍了CUDA实现矩阵转置的六种方式，本文介绍其中的一种方式

如果矩阵为N*N的方阵。该方式让每个线程处理一个矩阵元素，总共需要N*N个线程。首先，声明两个常量并配置blocks，threads：

const int N=1024;
const int K=32;
dim3 blocks(N/K,N/K); 
dim3 threads(K,K);

内核函数：

__global__ void 
transpose_parallel_per_element_tiled(float in[], float out[])
{
	// (i,j) locations of the tile corners for input & output matrices:
	int in_corner_i  = blockIdx.x * blockDim.x, in_corner_j  = blockIdx.y * blockDim.y;
	int out_corner_i = in_corner_j, out_corner_j = in_corner_i;

	int x = threadIdx.x, y = threadIdx.y;

	__shared__ float tile[K][K];

	// coalesced read from global mem, TRANSPOSED write into shared mem:
	tile[y][x] = in[(in_corner_i + x) + (in_corner_j + y)*N];
	__syncthreads();
	// read from shared mem, coalesced write to global mem:
	out[(out_corner_i + x) + (out_corner_j + y)*N] = tile[x][y];
}

内核涉及两个输入参数，in代表输入矩