一、核函数介绍
- 核函数是cuda编程的关键,通过**.cu**创建cudac程序文件,把cu交给nvcc编译,才能识别cuda语法;cu文件当做正常cpp写即可,他是cpp的超集,兼容支持cpp的所有特性
- 核函数需要使用__global__修饰,由host调用;设备函数需要使用__device__修饰,由device调用;主机函数使用__host__修饰,由host调用;共享变量由__shared__修饰
- 核函数内的代码在GPU上运行,故在核函数内部调用某个非常规函数需要使用__device__修饰
- 若某个函数既想被host调用,又想被device调用,使用__global__ __device__修饰
二、核函数调用
function<<<gridDim, blockDim, sharedMemorySize, stream>>>(args···)
int nthread = gridDim.x * gridDim.y * gridDim.z * blockDim.x * blockDim.y * blockDim.z;
- 只有__global__修饰的函数才可以使用<<<>>>方式调用
- 调用核函数是传值的,不能传引用,可以传递类、结构体等,核函数也可以是模板,返回值必须是void
- 如:
__global__ void kernel(const float* pdata, int ndata)
必须以kernel<<<gridDim, blockDim, bytesSharedMemorySize, stream>>>(pdata, ndata)
方式调用 - 核函数执行是异步的,立即返回
- 线程layout主要用到blockDim和gridDim
三、线程索引
- 核函数内访问线程索引,主要用到内置变量threadIdx、blockInd、blockDim、gridDim(可以看为索引和大小的关系,blockDim定义块大小,每一个块有多少个线程threadIdx,即blockDim对应threadIdx,gridDim对应blockIdx)
Index shape
threadIdx < blockDim
blockIdx < gridDim
int idx = threadIdx.x + blockIdx.x * blockDim.x;
dims indexs
gridDim.z blockIdx.z
gridDim.y blockIdx.y
gridDim.x blockIdx.x
blockDim.z threadIdx.z
blockDim.y threadIdx.y
blockDim.x threadIdx.x
int threadidx = ((((blckIdx.z * gridDim.y + blockIdx.y) * gridDim.x + blockIdx.x) * blockDim.z + threadIdx.z) * blockDim.y + threadIdx.y) * blockDim.x + thredIdx.x
position = 0
for i in 6:
position *= dims[i]
position += indexs[i]
dims indexs
gridDim.z = 1 blockIdx.z = 0
gridDim.y = 1 blockIdx.y = 0
gridDim.x = m blockIdx.x = [0 —— m-1]
blockDim.z = 1 threadIdx.z = 0
blockDim.y = 1 threadIdx.y = 0
blockDim.x = n threadIdx.x = [0 —— n-1]
int threadidx = ((((blckIdx.z(0) * gridDim.y(1) + blockIdx.y(0)) * gridDim.x(m) + blockIdx.x(0,1,···,m-1)) * blockDim.z(1) + threadIdx.z(0)) * blockDim.y(1) + threadIdx.y(0)) * blockDim.x(n) + thredIdx.x(0,1···,n-1)
=blockIdx.x * blockDim.x + threadIdx.x
