GPU-Puzzles项目是一个很棒的学习cuda编程的项目,可以让你学习到GPU编程和cuda核心并行编程的概念,通过一个个小问题让你理解cuda的编程和调用,创建共享显存空间,实现卷积和矩阵乘法等
https://github.com/srush/GPU-Puzzleshttps://github.com/srush/GPU-Puzzles
本文是接续我上一篇文章的讲解,深入分析几个比较困难的puzzles,讲解实现和优化原理
Puzzle 11 - 1D Convolution
Implement a kernel that computes a 1D convolution between a
and b
and stores it in out
. You need to handle the general case. You only need 2 global reads and 1 global write per thread.
实现一维卷积的教学,这里要求控制全局读写次数。一开始很自然的会想到把待卷积的向量和卷积核内容分别存下来。同时考虑到共享内存空间一般和每块的线程数直接关联,而卷积需要乘周围的几个元素,我们需要额外存储边界的几个元素。不仅要在共享内存中存储卷积核的内容,又为了控制到2次的全局读写,这里用了一个trick:即利用不需要存卷积核时的索引去存边界的额外元素
def conv_spec(a, b):
out = np.zeros(*a.shape)
len = b.shape[0]
for i in range(a.shape[0]):
out[i] = sum([a[i + j] * b[j] for j in range(len) if i + j < a.shape[0]])
return out
MAX_CONV = 4
TPB = 8
TPB_MAX_CONV = TPB + MAX_CONV
def conv_test(cuda):
def call(out, a, b, a_size, b_size) -> None:
i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
local_i = cuda.threadIdx.x
# FILL ME IN (roughly 17 lines)
shared_a = cuda.shared.array(TPB_MAX_CONV, numba.float32)
shared_b = cuda.shared.array(MAX_CONV, numba.float32)
if i < a_size:
shared_a[local_i] = a[i]
if local_i < b_size:
shared_b[local_i] = b[local_i]
else:
local_j = local_i - b_size
if i-b_size+TPB < a_size and local_j < b_size:
shared_a[local_j+TPB] = a[i-b_size+TPB]
cuda.syncthreads()
total =