CUDA extension to C language:
storage class qualifiers:
functions:
1*__global__ Device kernels callable from host (and from device on CC 3.x or higher
2*__device__ Device functions (only callable from device)
3*__host__ Host functions (only callable from host)
- default if not specified
- can be combined with *__device__
(实际编程的时候没有前面的*,这里是因为前后下划线的书写与Markdown语法冲突)
data:
*__shared__ Memory shared by a block of threads executing on a
multiprocessor.
*__constant__ Special memory for constants (cached)
CUDA数据类型:
• C primatives:
– char, int, float, double, …
• Short vectors:
– int2, int3, int4, uchar2, uchar4, float2, float3, float4, …
– no built-in vector math (although a utility header, cutil_math.h, defines some
common operations)
• Special type used to represent dimensions
– dim3
• Support for user-defined structures, e.g.:
struct particle
{
float3 position, velocity, acceleration;
float mass;
};
主要就是在C语言基础上拓展了一些vector类型和dim这个表示维度的类型。
以dim为类型,十分重要的几个CUDA内置参数如下:
Built-in kernel variables
dim3 gradDim – number of blocks in grid
dim3 blockDim – number of threads per block
dim3 blockIdx – number of current block within grid
dim3 threadIdx – index of current thread within block
CUDA函数基本限制:
CUDA kernels: limitations
• No recursion in *__global__ functions
• Can have recursion in *__device__ functions on cards with CC 2.x or higher
• No variable argument lists
• No dynamic memory allocation
• Function pointers to *__device__ functions in device code only supported on CC 2.x or higher
• No static variables inside kernels (except *__shared__)
注意:我们可以根据不同的GPU计算能力(CC)设置不同的code,
Launching kernels
• Launchable kernels must be declared as ‘*__global__ void’
__global__ void myKernel(paramList);
• Kernel calls must specify device execution environment
grid definition – number of blocks in grid
block definition – number of threads per block
optionally, may specify amount of shared memory per block (more on that later)
• Kernel launch syntax:
myKernel<<<GridDef, BlockDef>>>(paramList);
GridDef and BlockDef can be specified as dim3
objects
– grids can be 1D, 2D or 3D
– blocks can be 1D, 2D or 3D
• This makes it easy to set up different memory addressing for multi-dimensional data.
Thread addressing:
• 1D addressing example: 100 blocks with 256 threads per block:
dim3 gridDef1(100,1,1);
dim3 blockDef1(256,1,1);
kernel1<<<gridDef1, blockDef1>>>(paramList);
• 2D addressing example: 10x10 blocks with 16x16 threads per block:
dim3 gridDef2(10,10,1);
dim3 blockDef2(16,16,1);
kernel2<<<gridDef2, blockDef2>>>(paramList);
• Both examples launch the same number of threads, but block and thread
indexing is different
– kernel1 uses blockIdx.x, blockDim.x and threadIdx.x
– kernel2 uses blockIdx.[xy], blockDim.[xy], threadIdx.[xy]
one dimensional addressing example:
__global__ void kernel1(float *idata, float *odata)
{
int i;
i = blockIdx.x * blockDim.x + threadIdx.x;
odata[i] = func(idata[i]);
}
two dimensional addressing example:
__global__ void kernel2(float *idata, float *odata, int pitch)
{
int x, y, i;
x = blockIdx.x * blockDim.x + threadIdx.x;
y = blockIdx.y * blockDim.y + threadIdx.y;
i = y * pitch + x;
odata[i] = func(idata[i]);
}
...
dim3 gridDef2(10,10,1);
dim3 blockDef2(16,16,1);
kernel2<<<gridDef2, blockDef2>>>(paramList);
addressing 示意图: