CUDA-GPU programming introduction (2)

最新推荐文章于 2025-08-31 19:44:35 发布

NewCoder_BL

最新推荐文章于 2025-08-31 19:44:35 发布

阅读量477

点赞数 1

CC 4.0 BY-SA版权

分类专栏： cuda 文章标签： cuda c语言

本文链接：https://blog.youkuaiyun.com/u011934885/article/details/70837229

cuda 专栏收录该内容

7 篇文章

订阅专栏

本文介绍了CUDA编程的基本概念，包括存储类别限定符、数据类型、内置内核变量等，并详细阐述了如何定义和调用CUDA内核函数，以及如何利用dim3进行网格和块的配置。

CUDA extension to C language:
storage class qualifiers:
functions:
1*__global__ Device kernels callable from host (and from device on CC 3.x or higher
2*__device__ Device functions (only callable from device)
3*__host__ Host functions (only callable from host)
- default if not specified
- can be combined with *__device__
(实际编程的时候没有前面的*，这里是因为前后下划线的书写与Markdown语法冲突)

data:
*__shared__ Memory shared by a block of threads executing on a
multiprocessor.
*__constant__ Special memory for constants (cached)

CUDA数据类型：

• C primatives:
– char, int, float, double, …
• Short vectors:
– int2, int3, int4, uchar2, uchar4, float2, float3, float4, …
– no built-in vector math (although a utility header, cutil_math.h, defines some
common operations)
• Special type used to represent dimensions
– dim3
• Support for user-defined structures, e.g.:
struct particle
{
float3 position, velocity, acceleration;
float mass;
};

主要就是在C语言基础上拓展了一些vector类型和dim这个表示维度的类型。
以dim为类型，十分重要的几个CUDA内置参数如下：

Built-in kernel variables
dim3 gradDim – number of blocks in grid
dim3 blockDim – number of threads per block
dim3 blockIdx – number of current block within grid
dim3 threadIdx – index of current thread within block

CUDA函数基本限制：

CUDA kernels: limitations
• No recursion in *__global__ functions
• Can have recursion in *__device__ functions on cards with CC 2.x or higher
• No variable argument lists
• No dynamic memory allocation
• Function pointers to *__device__ functions in device code only supported on CC 2.x or higher
• No static variables inside kernels (except *__shared__)

注意：我们可以根据不同的GPU计算能力（CC）设置不同的code，
这里写图片描述

Launching kernels
• Launchable kernels must be declared as ‘*__global__ void’

__global__ void myKernel(paramList);

• Kernel calls must specify device execution environment

grid definition – number of blocks in grid
block definition – number of threads per block
optionally, may specify amount of shared memory per block (more on that later)

• Kernel launch syntax:

myKernel<<<GridDef, BlockDef>>>(paramList);

GridDef and BlockDef can be specified as dim3
objects
– grids can be 1D, 2D or 3D
– blocks can be 1D, 2D or 3D
• This makes it easy to set up different memory addressing for multi-dimensional data.

Thread addressing:

• 1D addressing example: 100 blocks with 256 threads per block:
dim3 gridDef1(100,1,1);
dim3 blockDef1(256,1,1);
kernel1<<<gridDef1, blockDef1>>>(paramList);

• 2D addressing example: 10x10 blocks with 16x16 threads per block:
dim3 gridDef2(10,10,1);
dim3 blockDef2(16,16,1);
kernel2<<<gridDef2, blockDef2>>>(paramList);

• Both examples launch the same number of threads, but block and thread
indexing is different
– kernel1 uses blockIdx.x, blockDim.x and threadIdx.x
– kernel2 uses blockIdx.[xy], blockDim.[xy], threadIdx.[xy]

one dimensional addressing example:

__global__ void kernel1(float *idata, float *odata)
{
    int i;
    i = blockIdx.x * blockDim.x + threadIdx.x;
    odata[i] = func(idata[i]);
}

two dimensional addressing example:

__global__ void kernel2(float *idata, float *odata, int pitch)
{
    int x, y, i;
    x = blockIdx.x * blockDim.x + threadIdx.x;
    y = blockIdx.y * blockDim.y + threadIdx.y;
    i = y * pitch + x;
    odata[i] = func(idata[i]);
}
...
dim3 gridDef2(10,10,1);
dim3 blockDef2(16,16,1);
kernel2<<<gridDef2, blockDef2>>>(paramList);

addressing 示意图：
这里写图片描述