cuda卷积代码学习

1.第一步:先定义后续用到的各种参数,有mask的长宽,图像的长宽,图像通道数,host的输入输出数据,device的输入输出数据,device下的mask。代码如下:

const int maskRows = 5;
	const int maskColumns = 5;
	int imageChannels;
	int imageWidth;
	int imageHeight;
	Image_t* inputImage;
	Image_t* outputImage;
	float *hostInputImageData;
	float *hostOutputImageData;
	float *deviceInputImageData;
	float *deviceOutputImageData;
	float *deviceMaskData;
	float hostMaskData[maskRows * maskColumns] = { 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04,
			0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04,
			0.04, 0.04, 0.04, 0.04, 0.04, 0.04, };

2.第二步:
该代码是基于PPM的数据,因此需要使用PPM_import函数进行解析。
inputImage = PPM_import(“/home/zsm/CUDA_Image_Convolution/computer_programming.ppm”);
3.第三步:
可以通过cuda自带的一些异常函数进行检查是否异常。cudaError_t cudaStatus;
使用cudaMalloc函数分配内存。
使用cudaMemcpy可以将host的数据传递到device上,也可以将device的代码传递到host上。
4.第四步:
分配grid和block。
#define TILE_WIDTH 16
dim3 dimGrid(ceil((float)imageWidth/TILE_WIDTH), ceil((float)imageHeight/TILE_WIDTH));
dim3 dimBlock(TILE_WIDTH,TILE_WIDTH,1);
核函数解读:

__global__ void convolution(float *I, const float *__restrict__ M, float *P,
		int channels, int width, int height) {
	__shared__ float N_ds[w][w];
	int k;
	for (k = 0; k < channels; k++) {
		// First batch loading
		int dest = threadIdx.y * TILE_WIDTH + threadIdx.x;
		int destY = dest / w;
		int destX = dest % w;
		int srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius;
		int srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius;
		int src = (srcY * width + srcX) * channels + k;
		if (srcY >= 0 && srcY < height && srcX >= 0 && srcX < width) {
			N_ds[destY][destX] = I[src];
		} else {
			N_ds[destY][destX] = 0;
		}

		// Second batch loading
		dest = threadIdx.y * TILE_WIDTH + threadIdx.x + TILE_WIDTH * TILE_WIDTH;
		destY = dest / w;
		destX = dest % w;
		srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius;
		srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius;
		src = (srcY * width + srcX) * channels + k;
		if (destY < w) {
			if (srcY >= 0 && srcY < height && srcX >= 0 && srcX < width) {
				N_ds[destY][destX] = I[src];
			} else {
				N_ds[destY][destX] = 0;
			}
		}
		__syncthreads();

		float accum = 0;
		int y, x;
		for (y = 0; y < Mask_width; y++) {
			for (x = 0; x < Mask_width; x++) {
				accum += N_ds[threadIdx.y + y][threadIdx.x + x]
						* M[y * Mask_width + x];
			}
		}
		y = blockIdx.y * TILE_WIDTH + threadIdx.y;
		x = blockIdx.x * TILE_WIDTH + threadIdx.x;
		if (y < height && x < width)
			P[(y * width + x) * channels + k] = clamp(accum);
		__syncthreads();
	}
}

该代码是基于shared_memory写的,这种内存方式使得程序的性能大幅度提升。
首先构建一个共享内存模块,shared float N_ds[w=20][w=20];
该算法分为两批load
1)第一批load数据
// First batch loading
int dest = threadIdx.y * TILE_WIDTH + threadIdx.x;
// 第destY行
int destY = dest / w;
// 第destX行
int destX = dest % w;
int srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius;
int srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius;
int src = (srcY * width + srcX) * channels + k;
if (srcY >= 0 && srcY < height && srcX >= 0 && srcX < width) {
N_ds[destY][destX] = I[src];
} else {
N_ds[destY][destX] = 0;
}
2)第二批load数据
// Second batch loading
dest = threadIdx.y * TILE_WIDTH + threadIdx.x + TILE_WIDTH * TILE_WIDTH;
destY = dest / w;
destX = dest % w;
srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius;
srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius;
src = (srcY * width + srcX) * channels + k;
if (destY < w) {
if (srcY >= 0 && srcY < height && srcX >= 0 && srcX < width) {
N_ds[destY][destX] = I[src];
} else {
N_ds[destY][destX] = 0;
}
}

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值