CUDA-GPU programming introduction (2)

本文介绍了CUDA编程的基本概念,包括存储类别限定符、数据类型、内置内核变量等,并详细阐述了如何定义和调用CUDA内核函数,以及如何利用dim3进行网格和块的配置。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

CUDA extension to C language:
storage class qualifiers:
functions:
1*__global__ Device kernels callable from host (and from device on CC 3.x or higher
2*__device__ Device functions (only callable from device)
3*__host__ Host functions (only callable from host)
- default if not specified
- can be combined with *__device__
(实际编程的时候没有前面的*,这里是因为前后下划线的书写与Markdown语法冲突)

data:
*__shared__ Memory shared by a block of threads executing on a
multiprocessor.
*__constant__ Special memory for constants (cached)

CUDA数据类型:

• C primatives:
– char, int, float, double, …
• Short vectors:
– int2, int3, int4, uchar2, uchar4, float2, float3, float4, …
– no built-in vector math (although a utility header, cutil_math.h, defines some
common operations)
• Special type used to represent dimensions
– dim3
• Support for user-defined structures, e.g.:
struct particle
{
float3 position, velocity, acceleration;
float mass;
};

主要就是在C语言基础上拓展了一些vector类型和dim这个表示维度的类型。
以dim为类型,十分重要的几个CUDA内置参数如下:

Built-in kernel variables
dim3 gradDim – number of blocks in grid
dim3 blockDim – number of threads per block
dim3 blockIdx – number of current block within grid
dim3 threadIdx – index of current thread within block

CUDA函数基本限制:

CUDA kernels: limitations
• No recursion in *__global__ functions
• Can have recursion in *__device__ functions on cards with CC 2.x or higher
• No variable argument lists
• No dynamic memory allocation
• Function pointers to *__device__ functions in device code only supported on CC 2.x or higher
• No static variables inside kernels (except *__shared__)

注意:我们可以根据不同的GPU计算能力(CC)设置不同的code,
这里写图片描述

Launching kernels
• Launchable kernels must be declared as ‘*__global__ void’

__global__ void myKernel(paramList);

• Kernel calls must specify device execution environment

grid definition – number of blocks in grid
block definition – number of threads per block
optionally, may specify amount of shared memory per block (more on that later)

• Kernel launch syntax:

myKernel<<<GridDef, BlockDef>>>(paramList);

GridDef and BlockDef can be specified as dim3
objects
– grids can be 1D, 2D or 3D
– blocks can be 1D, 2D or 3D
• This makes it easy to set up different memory addressing for multi-dimensional data.

Thread addressing:

1D addressing example: 100 blocks with 256 threads per block:
dim3 gridDef1(100,1,1);
dim3 blockDef1(256,1,1);
kernel1<<<gridDef1, blockDef1>>>(paramList);

• 2D addressing example: 10x10 blocks with 16x16 threads per block:
dim3 gridDef2(10,10,1);
dim3 blockDef2(16,16,1);
kernel2<<<gridDef2, blockDef2>>>(paramList);

• Both examples launch the same number of threads, but block and thread
indexing is different
– kernel1 uses blockIdx.x, blockDim.x and threadIdx.x
– kernel2 uses blockIdx.[xy], blockDim.[xy], threadIdx.[xy]

one dimensional addressing example:

__global__ void kernel1(float *idata, float *odata)
{
    int i;
    i = blockIdx.x * blockDim.x + threadIdx.x;
    odata[i] = func(idata[i]);
}

two dimensional addressing example:

__global__ void kernel2(float *idata, float *odata, int pitch)
{
    int x, y, i;
    x = blockIdx.x * blockDim.x + threadIdx.x;
    y = blockIdx.y * blockDim.y + threadIdx.y;
    i = y * pitch + x;
    odata[i] = func(idata[i]);
}
...
dim3 gridDef2(10,10,1);
dim3 blockDef2(16,16,1);
kernel2<<<gridDef2, blockDef2>>>(paramList);

addressing 示意图:
这里写图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值