CUDA 开发完全指南

CUDA开发核心指南与实践

原创已于 2025-09-12 10:43:01 修改 · 1k 阅读

13 ·

CC 4.0 BY-SA版权

文章标签：

#c++

于 2025-09-09 12:32:52 首次发布

部署运行你感兴趣的模型镜像

第一章 CUDA 核心库与组件

1.1 核心运行时库

CUDA 开发的基础库，提供设备管理、内存操作和内核调度等核心功能。

库文件	类型	用途	使用场景	核心功能
`cudart.lib`	动态链接库 (DLL)	CUDA Runtime API	需要灵活部署，多程序共享库文件	内存管理、设备查询、内核启动、流控制
`cudart_static.lib`	静态链接库	CUDA Runtime API	独立部署，避免 DLL 依赖	与动态库功能一致，直接嵌入可执行文件
`cuda.lib`	动态链接库	CUDA 驱动 API	高级开发、调试工具、直接与 GPU 驱动交互	底层设备控制、驱动版本查询、上下文管理

1.2 数学计算库

针对高性能数值计算场景优化的专业库，提供线性代数、矩阵运算等功能。

1.2.1 线性代数库 (cuBLAS)

库文件	类型	用途	典型应用	支持数据类型
`cublas.lib`	动态库	基本线性代数子程序	机器学习、科学计算、信号处理	单精度 / 双精度浮点、单精度 / 双精度复数
`cublas_static.lib`	静态库	基本线性代数子程序	独立部署的科学计算应用	同上

1.2.2 求解器库 (cuSolver)

库文件	功能分类	用途	使用场景
`cusolver.lib`	基础求解器	线性方程组、特征值、奇异值求解	中小型科学计算、工程模拟
`cusolverMg.lib`	多 GPU 求解器	分布式多 GPU 并行求解	大规模数值计算、超级计算
`cusolverDn.lib`	稠密矩阵求解器	稠密矩阵线性代数运算	全矩阵数据的数值分析
`cusolverSp.lib`	稀疏矩阵求解器	稀疏矩阵优化计算	有限元分析、网络分析、图论

1.2.3 稀疏矩阵库 (cuSparse)

库文件: cusparselib.lib
核心功能: 稀疏矩阵存储格式转换、稀疏矩阵 - 向量乘法、稀疏矩阵分解
使用场景: 处理大型稀疏矩阵（非零元素占比低），如电力网格分析、流体力学模拟
优势: 相比稠密矩阵计算节省 90% 以上内存空间

1.3 快速傅里叶变换库 (cuFFT)

库文件	类型	功能	支持维度	使用场景
`cufft.lib`	动态库	快速傅里叶变换	1D/2D/3D	信号处理、图像处理、频谱分析
`cufft_static.lib`	静态库	快速傅里叶变换	1D/2D/3D	独立部署的信号处理应用
`cufti.lib`	内部库	cuFFT 内部实现支持	-	通常无需直接使用

1.4 其他重要库

库文件	功能分类	核心功能	典型应用
`curand.lib`/`curand_static.lib`	随机数生成	高性能随机数生成	蒙特卡洛模拟、机器学习初始化
`nvrtc.lib`/`nvrtc_static.lib`	运行时编译	动态编译 CUDA 代码	JIT 编译、自定义内核生成
`nvml.lib`	GPU 管理	监控 GPU 状态、性能、功耗	系统监控工具、性能分析软件
`npp.lib`系列	性能原语	图像处理、视频处理	计算机视觉、多媒体处理
`magma.lib`/`magma_static.lib`	混合精度计算	混合精度加速数学运算	高性能计算、深度学习
`nvjpeg.lib`/`nvjpeg_static.lib`	JPEG 编解码	硬件加速 JPEG 处理	图像压缩、视频流处理
`OpenCL.lib`	跨平台计算	跨平台异构计算	多厂商 GPU 兼容应用

第二章静态库与动态库选择指南

2.1 两种链接方式对比

特性	静态库（带`_static`后缀）	动态库（无`_static`后缀）
部署方式	库代码嵌入可执行文件	运行时加载独立 DLL 文件
文件大小	可执行文件体积较大	可执行文件体积小
依赖管理	无外部依赖	需确保目标系统有对应 DLL
更新维护	需重新编译整个程序	单独更新 DLL 即可
内存占用	多个程序运行时重复加载	多个程序共享同一份库内存
编译速度	编译时间较长	编译时间较短

2.2 选择原则

简单应用: 使用 cudart.lib + cublas.lib 组合，平衡性能与部署复杂度
科学计算: 优先选择 cusolver.lib + cublas.lib，按需添加 cusparse.lib
图像处理: 推荐 npp.lib + cufft.lib，利用 GPU 硬件加速图像算法
机器学习: 核心组合 cublas.lib + curand.lib，配合深度学习框架
系统监控: 单独使用 nvml.lib，轻量级获取 GPU 状态信息
独立部署: 全部使用静态库（如 cudart_static.lib），避免 DLL 依赖问题
多程序共享: 采用动态库，减少整体内存占用

2.3 CMake 链接配置示例

2.3.1 基本 CUDA 应用配置

# 基本CUDA应用
target_link_libraries(my_app PRIVATE 
    CUDA::cudart_static
    CUDA::cublas_static)

2.3.2 图像处理应用配置

# 图像处理应用
target_link_libraries(my_app PRIVATE 
    CUDA::cudart_static
    CUDA::npp_static
    CUDA::cufft_static)

2.3.3 科学计算应用配置

# 科学计算应用
target_link_libraries(my_app PRIVATE 
    CUDA::cudart
    CUDA::cusolver
    CUDA::cusparse
    CUDA::cublas)

第三章 CUDA 内存模型与优化

3.1 内存类型与访问效率

CUDA 内存系统采用分层设计，不同内存类型的访问速度差异巨大：

内存类型	分配方式	访问速度	作用范围	典型用途
寄存器	编译器自动分配	最快（~1ns）	单个线程	线程私有变量、临时计算
共享内存	`__shared__`关键字	很快（~10ns）	线程块内共享	线程间数据交换、缓存
常量内存	`__constant__`关键字	较快（~50ns）	所有线程共享	只读常量参数、滤波器系数
全局内存	`cudaMalloc`	较慢（~200-400ns）	整个设备	大规模数据存储
固定内存 (Pinned)	`cudaMallocHost`	中速	主机 - 设备共享	主机 - 设备数据传输缓存
统一内存	`cudaMallocManaged`	中速	主机 - 设备自动迁移	简化内存管理
页内存 (Pageable)	`malloc`/`new`	较慢	主机端	主机端临时数据

3.2 内存访问效率排序

从快到慢:
寄存器 > 共享内存 > 常量内存 > 固定内存(Pinned) > 统一内存 > 全局内存 > 页内存(Pageable)

3.3 内存优化关键策略

优先使用 Pinned 内存: 主机与设备间数据传输时，Pinned 内存比 Pageable 内存快 2-3 倍

// 错误示例：使用Pageable内存
float* h_data = new float[N];  // Pageable内存

// 正确示例：使用Pinned内存
float* h_pinned_data;
cudaMallocHost(&h_pinned_data, N * sizeof(float));  // Pinned内存

合理利用共享内存: 减少全局内存访问次数，尤其适合块内线程协作计算

__global__ void sharedMemoryExample(float* d_data, int N) {
    __shared__ float s_data[256];  // 共享内存
    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + tid;
    
    // 加载数据到共享内存（1次全局内存访问）
    s_data[tid] = d_data[idx];
    __syncthreads();  // 等待所有线程加载完成
    
    // 多次共享内存访问（无需全局内存）
    s_data[tid] = s_data[tid] * 2.0f + s_data[(tid+1)%256];
    
    // 写回全局内存（1次全局内存访问）
    d_data[idx] = s_data[tid];
}

统一内存使用场景: 适合数据在主机和设备间频繁交换的场景，简化代码但可能有性能开销

// 统一内存分配
float* d_managed_data;
cudaMallocManaged(&d_managed_data, N * sizeof(float));

// 主机端直接访问
for (int i = 0; i < N; i++) {
    d_managed_data[i] = i;  // 自动在主机内存中
}

// 设备端直接访问（自动迁移数据）
kernel<<<grid, block>>>(d_managed_data, N);

// 主机端再次访问（自动迁移回主机内存）
for (int i = 0; i < 10; i++) {
    printf("%f ", d_managed_data[i]);
}

常量内存优化: 适合存储所有线程共享的只读数据，如滤波器系数、模型参数

// 常量内存声明
__constant__ float filter[5] = {0.1f, 0.2f, 0.4f, 0.2f, 0.1f};

__global__ void convolutionKernel(float* d_input, float* d_output, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        float sum = 0.0f;
        // 访问常量内存（比全局内存快）
        for (int i = 0; i < 5; i++) {
            int pos = idx + i - 2;
            if (pos >= 0 && pos < N) {
                sum += d_input[pos] * filter[i];
            }
        }
        d_output[idx] = sum;
    }
}

第四章 cuBLAS 库详解与实战

4.1 cuBLAS 功能模块划分

cuBLAS 遵循标准 BLAS（Basic Linear Algebra Subprograms）规范，分为三个层级：

4.1.1 Level 1 BLAS（向量运算）

处理向量之间的基本运算，如加法、点积、范数计算等。

函数	功能	代码示例
`cublasSaxpy`	向量加法: y = αx + y	`cublasSaxpy(handle, n, &alpha, x, 1, y, 1);`
`cublasSdot`	向量点积: result = xᵀy	`cublasSdot(handle, n, x, 1, y, 1, &result);`
`cublasSnrm2`	向量 2 范数: result =x₂	`cublasSnrm2(handle, n, x, 1, &result);`
`cublasSscal`	向量缩放: x = αx	`cublasSscal(handle, n, &alpha, x, 1);`
`cublasScopy`	向量复制: y = x	`cublasScopy(handle, n, x, 1, y, 1);`
`cublasSswap`	向量交换: x ↔ y	`cublasSswap(handle, n, x, 1, y, 1);`
`cublasSasum`	向量元素绝对值和	`cublasSasum(handle, n, x, 1, &result);`

4.1.2 Level 2 BLAS（矩阵 - 向量运算）

处理矩阵与向量之间的运算，如矩阵 - 向量乘法、秩 - 1 更新等。

函数	功能	适用矩阵类型
`cublasSgemv`	通用矩阵 - 向量乘法: y = αA x + βy	任意矩阵
`cublasSsymv`	对称矩阵 - 向量乘法: y = αA x + βy	对称矩阵
`cublasStrmv`	三角矩阵 - 向量乘法: x = A x	三角矩阵
`cublasSger`	通用矩阵秩 - 1 更新: A = αx yᵀ + A	任意矩阵
`cublasSsyr`	对称矩阵秩 - 1 更新: A = αx xᵀ + A	对称矩阵

代码示例 - 通用矩阵 - 向量乘法:

// 计算 y = alpha*A*x + beta*y
cublasStatus_t status = cublasSgemv(
    handle,                  // cuBLAS句柄
    CUBLAS_OP_N,             // 矩阵A是否转置（N=不转置，T=转置）
    m,                       // 矩阵A的行数
    n,                       // 矩阵A的列数
    &alpha,                  // 标量alpha
    d_A,                     // 设备端矩阵A
    m,                       // 矩阵A的前导维度（列优先存储）
    d_x,                     // 设备端向量x
    1,                       // 向量x的步长
    &beta,                   // 标量beta
    d_y,                     // 设备端向量y（输入输出）
    1                        // 向量y的步长
);

4.1.3 Level 3 BLAS（矩阵 - 矩阵运算）

处理矩阵之间的运算，是计算密集型应用的核心，如矩阵乘法、批量矩阵运算等。

函数	功能	适用场景
`cublasSgemm`	通用矩阵乘法: C = αAB + βC	基本矩阵乘法
`cublasSgemmBatched`	批量通用矩阵乘法	深度学习中的批量处理
`cublasSsymm`	对称矩阵乘法: C = αAB + βC	对称矩阵优化计算
`cublasStrmm`	三角矩阵乘法: B = αAB	三角矩阵优化计算
`cublasStrsm`	三角矩阵求解: AX = αB	线性方程组求解

代码示例 - 通用矩阵乘法:

// 计算 C = alpha*A*B + beta*C
cublasStatus_t status = cublasSgemm(
    handle,                  // cuBLAS句柄
    CUBLAS_OP_N,             // 矩阵A是否转置
    CUBLAS_OP_N,             // 矩阵B是否转置
    m,                       // 矩阵C的行数（=A的行数）
    n,                       // 矩阵C的列数（=B的列数）
    k,                       // 矩阵A的列数（=B的行数）
    &alpha,                  // 标量alpha
    d_A,                     // 设备端矩阵A (m x k)
    m,                       // 矩阵A的前导维度
    d_B,                     // 设备端矩阵B (k x n)
    k,                       // 矩阵B的前导维度
    &beta,                   // 标量beta
    d_C,                     // 设备端矩阵C (m x n)（输入输出）
    m                        // 矩阵C的前导维度
);

4.2 cuBLAS 完整使用流程

cuBLAS 使用遵循固定流程：初始化句柄 → 数据准备 → 执行计算 → 结果处理 → 资源释放。

4.2.1 封装类实现

#include <cublas_v2.h>
#include <cuda_runtime.h>
#include <vector>
#include <iostream>

class CublasProcessor {
private:
    cublasHandle_t handle;    // cuBLAS句柄
    bool isInitialized;       // 初始化状态标记

    // 检查CUDA错误
    void checkCudaError(cudaError_t error, const std::string& message) {
        if (error != cudaSuccess) {
            std::cerr << "[CUDA Error] " << message << ": " 
                      << cudaGetErrorString(error) << std::endl;
            throw std::runtime_error(message);
        }
    }

    // 检查cuBLAS错误
    void checkCublasError(cublasStatus_t error, const std::string& message) {
        if (error != CUBLAS_STATUS_SUCCESS) {
            std::cerr << "[cuBLAS Error] " << message << ": " 
                      << error << std::endl;
            throw std::runtime_error(message);
        }
    }

public:
    // 构造函数
    CublasProcessor() : isInitialized(false) {
        // 初始化cuBLAS句柄
        cublasStatus_t status = cublasCreate(&handle);
        checkCublasError(status, "Failed to create cuBLAS handle");
        isInitialized = true;
    }

    // 析构函数
    ~CublasProcessor() {
        if (isInitialized) {
            cublasDestroy(handle);
            isInitialized = false;
        }
    }

    // 向量点积计算
    float vectorDotProduct(const std::vector<float>& vecA, 
                          const std::vector<float>& vecB) {
        if (!isInitialized) {
            throw std::runtime_error("cuBLAS not initialized");
        }

        int n = vecA.size();
        if (n != vecB.size()) {
            throw std::invalid_argument("Vector sizes do not match");
        }

        float* d_A = nullptr;
        float* d_B = nullptr;
        float result = 0.0f;

        try {
            // 1. 分配设备内存
            checkCudaError(cudaMalloc(&d_A, n * sizeof(float)), 
                          "Failed to allocate device memory for vecA");
            checkCudaError(cudaMalloc(&d_B, n * sizeof(float)), 
                          "Failed to allocate device memory for vecB");

            // 2. 复制数据到设备
            checkCudaError(cudaMemcpy(d_A, vecA.data(), n * sizeof(float), 
                                     cudaMemcpyHostToDevice), 
                          "Failed to copy vecA to device");
            checkCudaError(cudaMemcpy(d_B, vecB.data(), n * sizeof(float), 
                                     cudaMemcpyHostToDevice), 
                          "Failed to copy vecB to device");

            // 3. 执行cuBLAS点积运算
            cublasStatus_t status = cublasSdot(handle, n, d_A, 1, d_B, 1, &result);
            checkCublasError(status, "Failed to compute dot product");

        } catch (...) {
            // 异常处理：释放已分配的内存
            if (d_A != nullptr) cudaFree(d_A);
            if (d_B != nullptr) cudaFree(d_B);
            throw;
        }

        // 4. 释放设备内存
        checkCudaError(cudaFree(d_A), "Failed to free device memory for vecA");
        checkCudaError(cudaFree(d_B), "Failed to free device memory for vecB");

        return result;
    }

    // 矩阵乘法计算 (C = A * B)
    void matrixMultiply(const std::vector<float>& matA,  // m x k
                       const std::vector<float>& matB,  // k x n
                       std::vector<float>& matC,        // 输出: m x n
                       int m, int k, int n) {
        if (!isInitialized) {
            throw std::runtime_error("cuBLAS not initialized");
        }

        // 检查矩阵维度一致性
        if (matA.size() != m * k || matB.size() != k * n) {
            throw std::invalid_argument("Matrix dimensions do not match");
        }
        matC.resize(m * n);  // 确保输出矩阵大小正确

        float* d_A = nullptr;
        float* d_B = nullptr;
        float* d_C = nullptr;
        const float alpha = 1.0f;
        const float beta = 0.0f;

        try {
            // 1. 分配设备内存
            checkCudaError(cudaMalloc(&d_A, m * k * sizeof(float)), 
                          "Failed to allocate device memory for matA");
            checkCudaError(cudaMalloc(&d_B, k * n * sizeof(float)), 
                          "Failed to allocate device memory for matB");
            checkCudaError(cudaMalloc(&d_C, m * n * sizeof(float)), 
                          "Failed to allocate device memory for matC");

            // 2. 复制数据到设备
            checkCudaError(cudaMemcpy(d_A, matA.data(), m * k * sizeof(float), 
                                     cudaMemcpyHostToDevice), 
                          "Failed to copy matA to device");
            checkCudaError(cudaMemcpy(d_B, matB.data(), k * n * sizeof(float), 
                                     cudaMemcpyHostToDevice), 
                          "Failed to copy matB to device");

            // 3. 执行矩阵乘法: C = alpha*A*B + beta*C
            cublasStatus_t status = cublasSgemm(
                handle,
                CUBLAS_OP_N,  // A不转置
                CUBLAS_OP_N,  // B不转置
                m,            // C的行数
                n,            // C的列数
                k,            // A的列数 = B的行数
                &alpha,
                d_A, m,       // A及其前导维度
                d_B, k,       // B及其前导维度
                &beta,
                d_C, m        // C及其前导维度
            );
            checkCublasError(status, "Failed to compute matrix multiplication");

            // 4. 复制结果回主机
            checkCudaError(cudaMemcpy(matC.data(), d_C, m * n * sizeof(float), 
                                     cudaMemcpyDeviceToHost), 
                          "Failed to copy matC to host");

        } catch (...) {
            // 异常处理：释放已分配的内存
            if (d_A != nullptr) cudaFree(d_A);
            if (d_B != nullptr) cudaFree(d_B);
            if (d_C != nullptr) cudaFree(d_C);
            throw;
        }

        // 5. 释放设备内存
        checkCudaError(cudaFree(d_A), "Failed to free device memory for matA");
        checkCudaError(cudaFree(d_B), "Failed to free device memory for matB");
        checkCudaError(cudaFree(d_C), "Failed to free device memory for matC");
    }

    // 向量加法 (y = alpha*x + y)
    void vectorAdd(float alpha, 
                  const std::vector<float>& vecX, 
                  std::vector<float>& vecY) {
        if (!isInitialized) {
            throw std::runtime_error("cuBLAS not initialized");
        }

        int n = vecX.size();
        if (n != vecY.size()) {
            throw std::invalid_argument("Vector sizes do not match");
        }

        float* d_X = nullptr;
        float* d_Y = nullptr;

        try {
            // 1. 分配设备内存
            checkCudaError(cudaMalloc(&d_X, n * sizeof(float)), 
                          "Failed to allocate device memory for vecX");
            checkCudaError(cudaMalloc(&d_Y, n * sizeof(float)), 
                          "Failed to allocate device memory for vecY");

            // 2. 复制数据到设备
            checkCudaError(cudaMemcpy(d_X, vecX.data(), n * sizeof(float), 
                                     cudaMemcpyHostToDevice), 
                          "Failed to copy vecX to device");
            checkCudaError(cudaMemcpy(d_Y, vecY.data(), n * sizeof(float), 
                                     cudaMemcpyHostToDevice), 
                          "Failed to copy vecY to device");

            // 3. 执行向量加法
            cublasStatus_t status = cublasSaxpy(
                handle, n, &alpha, d_X, 1, d_Y, 1
            );
            checkCublasError(status, "Failed to compute vector addition");

            // 4. 复制结果回主机
            checkCudaError(cudaMemcpy(vecY.data(), d_Y, n * sizeof(float), 
                                     cudaMemcpyDeviceToHost), 
                          "Failed to copy vecY to host");

        } catch (...) {
            if (d_X != nullptr) cudaFree(d_X);
            if (d_Y != nullptr) cudaFree(d_Y);
            throw;
        }

        // 5. 释放设备内存
        checkCudaError(cudaFree(d_X), "Failed to free device memory for vecX");
        checkCudaError(cudaFree(d_Y), "Failed to free device memory for vecY");
    }
};

// 使用示例
int main() {
    try {
        CublasProcessor cublasProc;

        // 1. 测试向量点积
        std::vector<float> vecA = {1.0f, 2.0f, 3.0f, 4.0f};
        std::vector<float> vecB = {5.0f, 6.0f, 7.0f, 8.0f};
        float dotProduct = cublasProc.vectorDotProduct(vecA, vecB);
        std::cout << "Vector Dot Product: " << dotProduct << std::endl;  // 预期: 70.0

        // 2. 测试矩阵乘法 (2x3 * 3x2 = 2x2)
        std::vector<float> matA = {1, 2, 3, 4, 5, 6};  // 2行3列
        std::vector<float> matB = {7, 8, 9, 10, 11, 12};  // 3行2列
        std::vector<float> matC;
        cublasProc.matrixMultiply(matA, matB, matC, 2, 3, 2);
        
        std::cout << "Matrix Multiplication Result:" << std::endl;
        std::cout << matC[0] << " " << matC[1] << std::endl;  // 58 64
        std::cout << matC[2] << " " << matC[3] << std::endl;  // 139 154

        // 3. 测试向量加法
        std::vector<float> vecX = {1.0f, 2.0f, 3.0f};
        std::vector<float> vecY = {4.0f, 5.0f, 6.0f};
        cublasProc.vectorAdd(2.0f, vecX, vecY);  // y = 2*x + y
        
        std::cout << "Vector Addition Result: ";
        for (float val : vecY) {
            std::cout << val << " ";  // 预期: 6 9 12
        }
        std::cout << std::endl;

    } catch (const std::exception& e) {
        std::cerr << "Error: " << e.what() << std::endl;
        return 1;
    }
    return 0;
}

4.2.2 异步计算与流处理

使用CUDA流可以实现计算与数据传输的并行，提高GPU利用率：

#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <vector>

class AsyncCublasProcessor {
private:
    cublasHandle_t handle;
    cudaStream_t stream;

public:
    AsyncCublasProcessor() {
        // 创建cuBLAS句柄
        cublasCreate(&handle);
        // 创建CUDA流
        cudaStreamCreate(&stream);
        // 将流与cuBLAS句柄关联
        cublasSetStream(handle, stream);
    }

    ~AsyncCublasProcessor() {
        cublasDestroy(handle);
        cudaStreamDestroy(stream);
    }

    // 异步向量加法
    void asyncVectorAdd(const std::vector<float>& h_X, 
                       std::vector<float>& h_Y, 
                       int n) {
        float *d_X, *d_Y;
        const float alpha = 1.0f;

        // 异步分配设备内存
        cudaMallocAsync(&d_X, n * sizeof(float), stream);
        cudaMallocAsync(&d_Y, n * sizeof(float), stream);

        // 异步复制数据到设备
        cudaMemcpyAsync(d_X, h_X.data(), n * sizeof(float),
                       cudaMemcpyHostToDevice, stream);
        cudaMemcpyAsync(d_Y, h_Y.data(), n * sizeof(float),
                       cudaMemcpyHostToDevice, stream);

        // 异步执行向量加法
        cublasSaxpy(handle, n, &alpha, d_X, 1, d_Y, 1);

        // 异步复制结果回主机
        cudaMemcpyAsync(h_Y.data(), d_Y, n * sizeof(float),
                       cudaMemcpyDeviceToHost, stream);

        // 等待流中所有操作完成
        cudaStreamSynchronize(stream);

        // 异步释放内存
        cudaFreeAsync(d_X, stream);
        cudaFreeAsync(d_Y, stream);
    }
};

4.3 数据类型支持

cuBLAS 支持多种数据类型，函数名后缀标识不同类型：

数据类型	函数名后缀	说明	示例函数
单精度浮点	S	32 位浮点数	`cublasSgemm`, `cublasSaxpy`
双精度浮点	D	64 位浮点数	`cublasDgemm`, `cublasDaxpy`
单精度复数	C	64 位复数（2 个 32 位浮点数）	`cublasCgemm`, `cublasCaxpy`
双精度复数	Z	128 位复数（2 个 64 位浮点数）	`cublasZgemm`, `cublasZaxpy`

4.4 cuBLAS 优势总结

性能卓越：由 NVIDIA 深度优化，通常比手工实现的 CUDA 内核快 2-10 倍
标准化接口：遵循 BLAS 标准，便于现有代码迁移
功能丰富：覆盖线性代数几乎所有常用运算
多 GPU 支持：通过 cuBLASXt 支持多 GPU 分布式计算
持续优化：随 CUDA 版本更新不断提升性能和功能

第五章 CUDA 排序解决方案

5.1 CUB 库排序（高性能首选）

CUB（CUDA Unbound Thread Block）是 NVIDIA 提供的高性能 CUDA 原语库，包含高度优化的排序算法。

5.1.1 单键排序

#include <cub/cub.cuh>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <iostream>

void cubSingleKeySortExample() {
    const int N = 1000000;  // 100万元素
    
    // 1. 创建并初始化随机数据
    thrust::host_vector<int> h_data(N);
    for (int i = 0; i < N; ++i) {
        h_data[i] = rand() % 1000000;
    }
    thrust::device_vector<int> d_data = h_data;  // 复制到设备
    
    // 2. 查询临时存储大小
    size_t temp_storage_bytes = 0;
    cub::DeviceRadixSort::SortKeys(
        nullptr,               // 临时存储指针（空表示查询）
        temp_storage_bytes,    // 输出所需临时存储大小
        d_data.data().get(),   // 输入键
        d_data.data().get(),   // 输出键（可以与输入相同）
        N                      // 元素数量
    );
    
    // 3. 分配临时存储
    thrust::device_vector<uint8_t> d_temp_storage(temp_storage_bytes);
    
    // 4. 执行排序
    cub::DeviceRadixSort::SortKeys(
        d_temp_storage.data().get(),  // 临时存储指针
        temp_storage_bytes,           // 临时存储大小
        d_data.data().get(),          // 输入键
        d_data.data().get(),          // 输出键
        N                             // 元素数量
    );
    
    // 5. 验证结果
    thrust::host_vector<int> h_result = d_data;
    bool is_sorted = true;
    for (int i = 1; i < N; ++i) {
        if (h_result[i] < h_result[i-1]) {
            is_sorted = false;
            break;
        }
    }
    std::cout << "CUB single key sort " << (is_sorted ? "succeeded" : "failed") << std::endl;
}

5.1.2 键值对排序

void cubKeyValueSortExample() {
    const int N = 1000000;
    
    // 1. 初始化键和值
    thrust::device_vector<int> d_keys(N);
    thrust::device_vector<int> d_values(N);
    
    thrust::host_vector<int> h_keys(N), h_values(N);
    for (int i = 0; i < N; ++i) {
        h_keys[i] = rand() % 1000000;
        h_values[i] = i;  // 值为原始索引
    }
    d_keys = h_keys;
    d_values = h_values;
    
    // 2. 查询临时存储大小
    size_t temp_storage_bytes = 0;
    cub::DeviceRadixSort::SortPairs(
        nullptr,
        temp_storage_bytes,
        d_keys.data().get(),
        d_keys.data().get(),  // 输出键
        d_values.data().get(),
        d_values.data().get(),  // 输出值
        N
    );
    
    // 3. 分配临时存储
    thrust::device_vector<uint8_t> d_temp_storage(temp_storage_bytes);
    
    // 4. 执行键值对排序
    cub::DeviceRadixSort::SortPairs(
        d_temp_storage.data().get(),
        temp_storage_bytes,
        d_keys.data().get(),
        d_keys.data().get(),
        d_values.data().get(),
        d_values.data().get(),
        N
    );
    
    // 5. 验证结果
    thrust::host_vector<int> h_sorted_keys = d_keys;
    thrust::host_vector<int> h_sorted_values = d_values;
    
    bool is_sorted = true;
    for (int i = 1; i < N; ++i) {
        if (h_sorted_keys[i] < h_sorted_keys[i-1]) {
            is_sorted = false;
            break;
        }
    }
    std::cout << "CUB key-value sort " << (is_sorted ? "succeeded" : "failed") << std::endl;
}

5.1.3 分段排序

对数组中多个独立段分别进行排序：

void cubSegmentedSortExample() {
    // 示例数据: [5, 2, 8, | 3, 1, 9, 4, | 7, 6]
    // 段划分:    [0, 0, 0, | 1, 1, 1, 1, | 2, 2]
    
    const int num_items = 9;
    const int num_segments = 3;
    
    // 1. 初始化数据和段偏移
    thrust::device_vector<int> d_keys = {5, 2, 8, 3, 1, 9, 4, 7, 6};
    thrust::device_vector<int> d_values = {0, 1, 2, 3, 4, 5, 6, 7, 8};
    thrust::device_vector<int> d_offsets = {0, 3, 7, 9};  // 段边界
    
    // 2. 查询临时存储大小
    size_t temp_storage_bytes = 0;
    cub::DeviceSegmentedRadixSort::SortPairs(
        nullptr,
        temp_storage_bytes,
        d_keys.data().get(),
        d_keys.data().get(),
        d_values.data().get(),
        d_values.data().get(),
        num_items,
        num_segments,
        d_offsets.data().get(),
        d_offsets.data().get() + 1  // 段结束偏移
    );
    
    // 3. 分配临时存储
    thrust::device_vector<uint8_t> d_temp_storage(temp_storage_bytes);
    
    // 4. 执行分段排序
    cub::DeviceSegmentedRadixSort::SortPairs(
        d_temp_storage.data().get(),
        temp_storage_bytes,
        d_keys.data().get(),
        d_keys.data().get(),
        d_values.data().get(),
        d_values.data().get(),
        num_items,
        num_segments,
        d_offsets.data().get(),
        d_offsets.data().get() + 1
    );
    
    // 5. 输出结果
    thrust::host_vector<int> h_result_keys = d_keys;
    std::cout << "Segmented sort result: ";
    for (int key : h_result_keys) {
        std::cout << key << " ";  // 预期: 2 5 8 1 3 4 9 6 7
    }
    std::cout << std::endl;
}

5.2 Thrust 库排序（易用性首选）

Thrust 是 CUDA 的标准模板库，提供类似 STL 的接口，底层可能使用 CUB 实现。

5.2.1 基本排序操作

#include <thrust/sort.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/random.h>

void thrustSortBasics() {
    const int N = 1000000;
    
    // 1. 生成随机数据
    thrust::device_vector<int> d_data(N);
    thrust::host_vector<int> h_data(N);
    
    thrust::default_random_engine rng;
    thrust::uniform_int_distribution<int> dist(1, 1000000);
    for (int i = 0; i < N; ++i) {
        h_data[i] = dist(rng);
    }
    d_data = h_data;
    
    // 2. 升序排序（默认）
    thrust::sort(d_data.begin(), d_data.end());
    
    // 3. 降序排序
    thrust::sort(d_data.begin(), d_data.end(), thrust::greater<int>());
    
    // 4. 自定义比较函数（按绝对值排序）
    auto abs_compare = [] __host__ __device__(int a, int b) {
        return abs(a) < abs(b);
    };
    thrust::sort(d_data.begin(), d_data.end(), abs_compare);
}

5.2.2 键值对排序

void thrustKeyValueSort() {
    const int N = 100000;
    
    // 1. 创建键和值
    thrust::device_vector<float> keys(N);
    thrust::device_vector<int> values(N);
    
    // 2. 初始化数据
    thrust::sequence(keys.begin(), keys.end());
    thrust::shuffle(keys.begin(), keys.end(), thrust::default_random_engine());
    thrust::sequence(values.begin(), values.end());
    
    // 3. 按键排序，值跟随键移动
    thrust::sort_by_key(keys.begin(), keys.end(), values.begin());
    
    // 4. 降序键值对排序
    thrust::sort_by_key(keys.begin(), keys.end(), values.begin(), 
                       thrust::greater<float>());
}

5.3 排序性能比较

#include <chrono>
#include <iostream>

class SortingBenchmark {
public:
    static void compareSortingMethods(int N) {
        // 生成随机数据
        thrust::host_vector<int> h_data(N);
        for (int i = 0; i < N; ++i) {
            h_data[i] = rand() % N;
        }
        
        // 1. Thrust排序性能测试
        thrust::device_vector<int> d_thrust = h_data;
        auto start = std::chrono::high_resolution_clock::now();
        thrust::sort(d_thrust.begin(), d_thrust.end());
        auto end = std::chrono::high_resolution_clock::now();
        auto thrust_time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
        std::cout << "Thrust sort (" << N << " elements): " 
                  << thrust_time.count() << " ms" << std::endl;
        
        // 2. CUB排序性能测试
        thrust::device_vector<int> d_cub = h_data;
        size_t temp_storage_bytes = 0;
        cub::DeviceRadixSort::SortKeys(nullptr, temp_storage_bytes, 
                                      d_cub.data().get(), d_cub.data().get(), N);
        thrust::device_vector<uint8_t> d_temp(temp_storage_bytes);
        
        start = std::chrono::high_resolution_clock::now();
        cub::DeviceRadixSort::SortKeys(d_temp.data().get(), temp_storage_bytes,
                                      d_cub.data().get(), d_cub.data().get(), N);
        cudaDeviceSynchronize();
        end = std::chrono::high_resolution_clock::now();
        auto cub_time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
        std::cout << "CUB sort (" << N << " elements): " 
                  << cub_time.count() << " ms" << std::endl;
    }
};

// 使用示例
int main() {
    SortingBenchmark::compareSortingMethods(1000000);    // 100万元素
    SortingBenchmark::compareSortingMethods(10000000);   // 1000万元素
    return 0;
}

5.4 排序库选择建议

场景	推荐库	理由
简单排序需求，代码简洁优先	Thrust	接口简单，类似 STL，学习成本低
高性能要求，大规模数据	CUB	手工优化的底层实现，性能最优
分段排序、复杂排序场景	CUB	提供丰富的分段排序功能
跨平台兼容性需求	Thrust	可在 CPU 和 GPU 上运行相同代码
与 STL 代码集成	Thrust	接口设计与 STL 兼容，易于集成

第六章 Thrust 库完全指南

Thrust 是 CUDA 的并行算法库，提供类似 C++ STL 的接口，可在 GPU 和 CPU 上执行高性能并行计算。

6.1 基础容器

Thrust 提供三种主要容器类型，用于管理不同内存空间的数据：

6.1.1 Host Vector（主机向量）

存储在主机内存中，类似标准 vector：

#include <thrust/host_vector.h>
#include <iostream>

void hostVectorExample() {
    // 创建主机向量
    thrust::host_vector<int> h_vec(10);             // 大小为10的向量
    thrust::host_vector<int> h_vec2(5, 42);         // 5个元素，值均为42
    thrust::host_vector<int> h_vec3(h_vec2);        // 拷贝构造
    
    // 元素访问
    h_vec[0] = 1;                                   // 下标访问
    h_vec.at(1) = 2;                                // 带边界检查的访问
    h_vec.push_back(3);                              // 添加元素
    h_vec.pop_back();                                // 删除最后一个元素
    
    // 迭代器访问
    std::cout << "Host vector elements: ";
    for (thrust::host_vector<int>::iterator it = h_vec.begin(); 
         it != h_vec.end(); ++it) {
        std::cout << *it << " ";
    }
    std::cout << std::endl;
}

6.1.2 Device Vector（设备向量）

存储在 GPU 设备内存中，自动管理设备内存：

#include <thrust/device_vector.h>

void deviceVectorExample() {
    // 创建设备向量
    thrust::device_vector<int> d_vec(10);           // 大小为10的设备向量
    thrust::device_vector<int> d_vec2(d_vec);       // 从设备向量拷贝
    
    // 数据传输
    thrust::host_vector<int> h_vec(10, 1);
    d_vec = h_vec;                                   // 主机到设备（隐式拷贝）
    h_vec = d_vec;                                   // 设备到主机（隐式拷贝）
    
    // 获取原始指针（用于传递给CUDA内核）
    int* raw_ptr = thrust::raw_pointer_cast(d_vec.data());
    
    // 设备向量初始化
    thrust::fill(d_vec.begin(), d_vec.end(), 42);   // 填充值
}

6.1.3 Universal Vector（通用向量）

可在主机和设备上透明访问，根据上下文自动选择内存空间：

#include <thrust/universal_vector.h>

void universalVectorExample() {
    thrust::universal_vector<int> u_vec(10);        // 通用向量
    
    // 主机端操作
    for (int i = 0; i < 10; ++i) {
        u_vec[i] = i;                               // 主机内存中
    }
    
    // 设备端操作（通过Thrust算法自动迁移数据）
    thrust::transform(u_vec.begin(), u_vec.end(), u_vec.begin(),
                     [] __host__ __device__(int x) { return x * 2; });
    
    // 主机端再次访问（数据自动迁回）
    std::cout << "Universal vector after transform: ";
    for (int val : u_vec) {
        std::cout << val << " ";
    }
    std::cout << std::endl;
}

6.2 生成与初始化函数

Thrust 提供多种便捷函数用于数据初始化和生成：

#include <thrust/sequence.h>
#include <thrust/fill.h>
#include <thrust/generate.h>
#include <thrust/random.h>

void dataGenerationExamples() {
    thrust::device_vector<int> d_vec(10);
    
    // 1. 生成连续序列
    thrust::sequence(d_vec.begin(), d_vec.end());          // 0, 1, 2, ..., 9
    thrust::sequence(d_vec.begin(), d_vec.end(), 5);       // 5, 6, 7, ..., 14
    thrust::sequence(d_vec.begin(), d_vec.end(), 1, 2);    // 1, 3, 5, ..., 17
    
    // 2. 填充常量
    thrust::fill(d_vec.begin(), d_vec.end(), 42);          // 全部填充为42
    thrust::fill_n(d_vec.begin(), 5, 100);                 // 前5个元素填充为100
    
    // 3. 生成随机数
    thrust::default_random_engine rng;
    thrust::uniform_int_distribution<int> dist(1, 100);
    auto rand_gen = [&rng, &dist]() mutable { return dist(rng); };
    thrust::generate(d_vec.begin(), d_vec.end(), rand_gen);
    
    // 4. 自定义生成器
    struct SquareGenerator {
        __host__ __device__ int operator()(int index) const {
            return index * index;
        }
    };
    thrust::transform(thrust::counting_iterator<int>(0),
                     thrust::counting_iterator<int>(10),
                     d_vec.begin(),
                     SquareGenerator());
}

6.3 排序与搜索算法

Thrust 提供全面的排序和搜索功能，接口与 STL 兼容：

#include <thrust/sort.h>
#include <thrust/binary_search.h>
#include <thrust/find.h>

void sortingAndSearchingExamples() {
    thrust::device_vector<int> d_vec = {5, 2, 8, 1, 9, 3, 7, 4, 6};
    
    // 1. 排序操作
    thrust::sort(d_vec.begin(), d_vec.end());                   // 升序排序
    thrust::sort(d_vec.begin(), d_vec.end(), thrust::greater<int>());  // 降序排序
    
    // 2. 键值对排序
    thrust::device_vector<int> keys = {3, 1, 4, 1, 5};
    thrust::device_vector<int> values = {0, 1, 2, 3, 4};
    thrust::sort_by_key(keys.begin(), keys.end(), values.begin());
    
    // 3. 查找操作
    auto it = thrust::find(d_vec.begin(), d_vec.end(), 4);      // 线性查找
    int index = it - d_vec.begin();                             // 找到元素的索引
    
    // 4. 计数
    int count = thrust::count(d_vec.begin(), d_vec.end(), 2);   // 计数等于2的元素
    int count_gt5 = thrust::count_if(d_vec.begin(), d_vec.end(),
                                    [] __host__ __device__(int x) {
                                        return x > 5;
                                    });
    
    // 5. 二分搜索（要求已排序）
    thrust::sort(d_vec.begin(), d_vec.end());
    bool found = thrust::binary_search(d_vec.begin(), d_vec.end(), 5);
    auto lower = thrust::lower_bound(d_vec.begin(), d_vec.end(), 5);  // 首个不小于5的元素
}

6.4 数值算法

Thrust 提供丰富的并行数值算法，如归约、扫描等：

#include <thrust/reduce.h>
#include <thrust/transform_reduce.h>
#include <thrust/scan.h>

void numericalAlgorithmsExamples() {
    thrust::device_vector<int> d_vec = {1, 2, 3, 4, 5};
    
    // 1. 归约操作（求和）
    int sum = thrust::reduce(d_vec.begin(), d_vec.end(), 0);
    
    // 2. 自定义归约（求积）
    int product = thrust::reduce(d_vec.begin(), d_vec.end(), 1,
                                thrust::multiplies<int>());
    
    // 3. 变换归约（平方和）
    int sum_of_squares = thrust::transform_reduce(
        d_vec.begin(), d_vec.end(),
        [] __host__ __device__(int x) { return x * x; },  // 变换函数
        0,                                                // 初始值
        thrust::plus<int>()                               // 归约操作
    );
    
    // 4. 扫描操作（前缀和）
    thrust::device_vector<int> inclusive_scan(5);
    thrust::inclusive_scan(d_vec.begin(), d_vec.end(),
                          inclusive_scan.begin());  // [1, 3, 6, 10, 15]
    
    thrust::device_vector<int> exclusive_scan(5);
    thrust::exclusive_scan(d_vec.begin(), d_vec.end(),
                          exclusive_scan.begin(), 0);  // [0, 1, 3, 6, 10]
}

6.5 变换与映射

Thrust 提供多种变换算法，用于元素级操作：

#include <thrust/transform.h>
#include <thrust/functional.h>

void transformExamples() {
    thrust::device_vector<int> vec1(5, 1);
    thrust::device_vector<int> vec2(5, 2);
    thrust::device_vector<int> result(5);
    
    // 1. 一元变换（每个元素乘以2）
    thrust::transform(vec1.begin(), vec1.end(), result.begin(),
                     [] __host__ __device__(int x) { return x * 2; });
    
    // 2. 二元变换（两个向量相加）
    thrust::transform(vec1.begin(), vec1.end(),
                     vec2.begin(),
                     result.begin(),
                     thrust::plus<int>());
    
    // 3. 原地变换
    thrust::transform(result.begin(), result.end(), result.begin(),
                     [] __host__ __device__(int x) { return x + 1; });
    
    // 4. 条件变换
    thrust::device_vector<int> input = {1, -2, 3, -4, 5};
    thrust::device_vector<int> output(5);
    
    thrust::transform(input.begin(), input.end(), output.begin(),
                     [] __host__ __device__(int x) {
                         return x > 0 ? x : -x;  // 取绝对值
                     });
}

6.6 集合操作

Thrust 提供标准集合运算，如并集、交集等：

#include <thrust/set_operations.h>
#include <thrust/merge.h>

void setOperationsExamples() {
    // 集合操作要求输入已排序
    thrust::device_vector<int> set1 = {1, 3, 5, 7, 9};
    thrust::device_vector<int> set2 = {2, 3, 5, 8, 9};
    
    // 1. 集合并集
    thrust::device_vector<int> union_result(set1.size() + set2.size());
    auto union_end = thrust::set_union(set1.begin(), set1.end(),
                                      set2.begin(), set2.end(),
                                      union_result.begin());
    
    // 2. 集合交集
    thrust::device_vector<int> intersection_result(std::min(set1.size(), set2.size()));
    auto intersection_end = thrust::set_intersection(set1.begin(), set1.end(),
                                                    set2.begin(), set2.end(),
                                                    intersection_result.begin());
    
    // 3. 集合差集
    thrust::device_vector<int> difference_result(set1.size());
    auto difference_end = thrust::set_difference(set1.begin(), set1.end(),
                                                set2.begin(), set2.end(),
                                                difference_result.begin());
    
    // 4. 合并两个已排序序列
    thrust::device_vector<int> merged_result(set1.size() + set2.size());
    thrust::merge(set1.begin(), set1.end(),
                 set2.begin(), set2.end(),
                 merged_result.begin());
}

6.7 完整示例程序

#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
#include <thrust/reduce.h>
#include <thrust/transform.h>
#include <thrust/sequence.h>
#include <thrust/random.h>
#include <iostream>

void thrustCompleteExample() {
    std::cout << "=== Thrust Library Complete Example ===" << std::endl;
    
    // 1. 创建并初始化设备向量
    const int N = 10;
    thrust::device_vector<int> d_vec(N);
    thrust::sequence(d_vec.begin(), d_vec.end(), 1);  // 1, 2, ..., 10
    
    std::cout << "Initial sequence: ";
    for (int i = 0; i < N; ++i) {
        std::cout << d_vec[i] << " ";
    }
    std::cout << std::endl;
    
    // 2. 变换操作：每个元素乘以2
    thrust::transform(d_vec.begin(), d_vec.end(), d_vec.begin(),
                     [] __host__ __device__(int x) { return x * 2; });
    
    std::cout << "After doubling: ";
    for (int i = 0; i < N; ++i) {
        std::cout << d_vec[i] << " ";
    }
    std::cout << std::endl;
    
    // 3. 归约操作：求和
    int sum = thrust::reduce(d_vec.begin(), d_vec.end(), 0);
    std::cout << "Sum: " << sum << std::endl;
    
    // 4. 排序操作
    thrust::sort(d_vec.begin(), d_vec.end(), thrust::greater<int>());
    
    std::cout << "After sorting in descending order: ";
    for (int i = 0; i < N; ++i) {
        std::cout << d_vec[i] << " ";
    }
    std::cout << std::endl;
    
    // 5. 扫描操作：前缀积
    thrust::device_vector<int> scan_result(N);
    thrust::inclusive_scan(d_vec.begin(), d_vec.end(), scan_result.begin(),
                          thrust::multiplies<int>());
    
    std::cout << "Inclusive product scan: ";
    for (int i = 0; i < N; ++i) {
        std::cout << scan_result[i] << " ";
    }
    std::cout << std::endl;
}

int main() {
    thrustCompleteExample();
    return 0;
}

第七章 CUDA 关键字详解

CUDA 提供多个特殊关键字来定义函数、变量和内存的执行位置和访问方式，是 CUDA 编程的基础。

7.1 函数执行空间限定符

这些关键字指定函数在主机还是设备上执行，以及调用规则。

7.1.1 `global`（全局函数 / 内核函数）

特性：在主机代码中调用，在设备上执行
地位：作为 CUDA 内核的入口点
限制：不能有返回值，必须返回 void
调用方式：使用特殊语法 kernel<<<grid, block>>>(args)

#include <cuda_runtime.h>
#include <iostream>

// 向量加法内核
__global__ void vectorAdd(const float* A, const float* B, float* C, int N) {
    // 计算全局线程索引
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    
    // 边界检查
    if (idx < N) {
        C[idx] = A[idx] + B[idx];
    }
}

int main() {
    const int N = 1024;
    size_t size = N * sizeof(float);
    
    // 分配主机内存
    float *h_A = (float*)malloc(size);
    float *h_B = (float*)malloc(size);
    float *h_C = (float*)malloc(size);
    
    // 分配设备内存
    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, size);
    cudaMalloc(&d_B, size);
    cudaMalloc(&d_C, size);
    
    // 初始化主机数据
    for (int i = 0; i < N; i++) {
        h_A[i] = i;
        h_B[i] = i * 2;
    }
    
    // 复制数据到设备
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
    
    // 启动内核
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
    
    // 复制结果回主机
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
    
    // 验证结果
    for (int i = 0; i < 10; i++) {
        std::cout << h_A[i] << " + " << h_B[i] << " = " << h_C[i] << std::endl;
    }
    
    // 清理资源
    free(h_A); free(h_B); free(h_C);
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    
    return 0;
}

7.1.2 `device`（设备函数）

特性：只能在设备上调用和执行
用途：供其他设备函数或全局函数调用
限制：不能从主机代码直接调用

// 设备函数：向量加法辅助函数
__device__ float deviceAdd(float a, float b) {
    return a + b;
}

// 设备函数：向量乘法辅助函数
__device__ float deviceMultiply(float a, float b) {
    return a * b;
}

// 内核函数使用设备函数
__global__ void complexKernel(float* data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        // 调用设备函数
        float temp = deviceAdd(data[idx], 1.0f);
        data[idx] = deviceMultiply(temp, 2.0f);
    }
}

7.1.3 `host`（主机函数）

特性：只能在主机上执行
用途：普通的 CPU 函数
组合使用：可与__device__组合，创建同时在主机和设备上编译的函数

// 仅主机函数
__host__ void hostOnlyFunction(int* data, int size) {
    for (int i = 0; i < size; i++) {
        data[i] = i * i;
    }
}

// 同时在主机和设备上可用的函数
__host__ __device__ int square(int x) {
    return x * x;
}

// 内核中使用host+device函数
__global__ void kernelUsingSharedFunction(int* data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        data[idx] = square(data[idx]);  // 设备上调用
    }
}

int main() {
    int x = 5;
    int result = square(x);  // 主机上调用
    printf("Square of %d is %d\n", x, result);  // 输出: 25
    return 0;
}

7.2 变量内存空间限定符

这些关键字指定变量存储的内存类型，影响访问速度和作用范围。

7.2.1 `shared`（共享内存变量）

特性：在线程块内所有线程间共享
生命周期：限于线程块执行期间
访问速度：比全局内存快得多（~10x）
用途：线程块内数据共享、全局内存访问缓存

// 静态共享内存声明（编译时确定大小）
__global__ void staticSharedMemoryKernel(float* input, float* output, int N) {
    __shared__ float sdata[256];  // 静态分配256个float
    
    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    
    // 加载数据到共享内存
    sdata[tid] = (idx < N) ? input[idx] : 0.0f;
    __syncthreads();  // 同步所有线程，确保数据加载完成
    
    // 共享内存中的归约操作
    for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
        if (tid < stride) {
            sdata[tid] += sdata[tid + stride];
        }
        __syncthreads();  // 每次迭代后同步
    }
    
    // 线程0负责将块结果写入全局内存
    if (tid == 0) {
        output[blockIdx.x] = sdata[0];
    }
}

// 动态共享内存声明（运行时确定大小）
__global__ void dynamicSharedMemoryKernel(float* input, float* output, int N) {
    extern __shared__ float sdata[];  // 外部共享内存声明
    
    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    
    // 加载数据到共享内存
    sdata[tid] = (idx < N) ? input[idx] : 0.0f;
    __syncthreads();
    
    // 共享内存计算（示例：简单加倍）
    sdata[tid] *= 2.0f;
    __syncthreads();
    
    // 写回全局内存
    if (idx < N) {
        output[idx] = sdata[tid];
    }
}

// 调用示例
int main() {
    const int N = 1024;
    const int blockSize = 256;
    
    // 动态共享内存需要在启动时指定大小
    dynamicSharedMemoryKernel<<<N/blockSize, blockSize, blockSize * sizeof(float)>>>(
        d_input, d_output, N);
    
    return 0;
}

7.2.2 `managed`（统一内存变量）

特性：自动在主机和设备间迁移
优势：简化内存管理，无需显式数据传输
用途：数据在 CPU 和 GPU 间频繁交换的场景

#include <cuda_runtime.h>
#include <iostream>

// 全局托管内存变量
__managed__ float managedArray[1000];

// 内核函数操作托管内存
__global__ void processManagedMemory() {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < 1000) {
        managedArray[idx] *= 2.0f;  // 设备端修改
    }
}

int main() {
    // 主机端初始化托管内存
    for (int i = 0; i < 1000; i++) {
        managedArray[i] = i;
    }
    
    // 启动内核处理数据
    processManagedMemory<<<(1000 + 255) / 256, 256>>>();
    cudaDeviceSynchronize();  // 等待内核完成
    
    // 主机端直接访问修改后的数据
    for (int i = 0; i < 10; i++) {
        std::cout << managedArray[i] << " ";  // 输出: 0 2 4 ... 18
    }
    std::cout << std::endl;
    
    // 动态分配托管内存
    float* dynamicManaged;
    cudaMallocManaged(&dynamicManaged, 1000 * sizeof(float));
    
    // 使用动态托管内存...
    
    // 释放动态托管内存
    cudaFree(dynamicManaged);
    
    return 0;
}

7.2.3 `constant`（常量内存变量）

特性：存储只读数据，所有线程共享
访问速度：比全局内存快（~5x）
用途：存储常量参数、滤波器系数等

// 常量内存声明
__constant__ float filterCoeffs[5] = {0.1f, 0.2f, 0.4f, 0.2f, 0.1f};

// 使用常量内存的内核
__global__ void convolutionKernel(float* input, float* output, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        float result = 0.0f;
        // 访问常量内存（比全局内存快）
        for (int i = 0; i < 5; i++) {
            int pos = idx + i - 2;
            if (pos >= 0 && pos < N) {
                result += input[pos] * filterCoeffs[i];
            }
        }
        output[idx] = result;
    }
}

int main() {
    const int N = 1024;
    float* d_input, *d_output;
    cudaMalloc(&d_input, N * sizeof(float));
    cudaMalloc(&d_output, N * sizeof(float));
    
    // 初始化输入数据...
    
    // 可选：更新常量内存（从主机）
    float newCoeffs[5] = {0.2f, 0.2f, 0.2f, 0.2f, 0.2f};
    cudaMemcpyToSymbol(filterCoeffs, newCoeffs, 5 * sizeof(float));
    
    // 启动内核
    convolutionKernel<<<(N + 255)/256, 256>>>(d_input, d_output, N);
    
    // 清理...
    return 0;
}

7.3 关键字使用规则总结

关键字	类型	作用范围	访问限制	主要用途
`__global__`	函数	设备执行，主机调用	无返回值，必须异步调用	内核入口点
`__device__`	函数	设备执行，设备调用	不能被主机直接调用	设备端辅助函数
`__host__`	函数	主机执行，主机调用	不能被设备调用	主机端函数
`__host__ __device__`	函数	主机和设备均可执行	分别在主机和设备调用	共享逻辑的函数
`__shared__`	变量	线程块内共享	读写，块内可见	线程块内数据交换
`__constant__`	变量	设备全局，所有线程	只读，设备可见	存储常量参数
`__managed__`	变量	主机和设备共享	读写，自动迁移	简化内存管理

正确使用这些关键字是 CUDA 编程的基础，能够帮助开发者充分利用 GPU 的并行计算能力，同时避免常见的内存访问错误和性能问题。

第八章 CUDA核函数内同步机制详解

CUDA核函数中的同步主要用于协调同一线程块内不同线程之间的执行，确保数据一致性并实现正确的并行计算。主要的同步机制包括：

8.1 Warp级同步

8.1.1 __syncwarp()

__syncwarp() 是CUDA 9.0引入的warp级同步原语，用于同步同一个warp中的所有线程。

// 示例：在warp内进行同步
__global__ void syncwarp_example() {
    int tid = threadIdx.x;
    
    // 执行一些计算
    if (tid % 32 < 16) {
        // 前16个线程执行某些操作
        // ...
    }
    
    // 同步整个warp
    __syncwarp();
    
    // 现在所有warp内的线程都已完成前面的操作
    // 继续执行后续计算
}

8.1.2 使用掩码的__syncwarp()

可以使用掩码来指定需要同步的线程子集：

__global__ void masked_syncwarp_example() {
    int tid = threadIdx.x;
    unsigned mask = 0x0000FFFF; // 只同步前16个线程
    
    // 执行一些计算
    // ...
    
    // 只同步掩码指定的线程
    __syncwarp(mask);
    
    // 掩码指定的线程已完成同步
}

8.2 线程块级同步

8.2.1 __syncthreads()

__syncthreads() 是最常用的线程块级同步原语，用于同步同一个线程块内的所有线程。

__global__ void syncthreads_example(float *data) {
    __shared__ float shared_data[256];
    int tid = threadIdx.x;
    
    // 每个线程将数据写入共享内存
    shared_data[tid] = data[tid];
    
    // 同步确保所有线程都完成了写入操作
    __syncthreads();
    
    // 现在可以安全地读取其他线程写入的数据
    data[tid] = shared_data[(tid + 1) % blockDim.x];
}

8.3 Cooperative Groups同步

CUDA 9.0引入了Cooperative Groups API，提供了更灵活的同步机制。

#include <cooperative_groups.h>
using namespace cooperative_groups;

__global__ void cooperative_groups_example() {
    thread_block block = this_thread_block();
    
    // 执行一些计算
    // ...
    
    // 使用Cooperative Groups进行线程块同步
    block.sync();
}

8.3.1 Warp同步

#include <cooperative_groups.h>
using namespace cooperative_groups;

__global__ void tiled_partition_example() {
    thread_block block = this_thread_block();
    thread_block_tile<32> tile = tiled_partition<32>(block);
    
    // 执行一些计算
    // ...
    
    // 只同步当前tile(warp)内的线程
    tile.sync();
}

8.4 实际应用示例

8.4.1 归约操作中的同步

__global__ void reduction_with_sync(float *input, float *output, int n) {
    extern __shared__ float sdata[];
    
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
    
    // 每个线程加载一个元素
    sdata[tid] = (i < n) ? input[i] : 0;
    
    // 同步确保所有线程都完成了数据加载
    __syncthreads();
    
    // 归约操作
    for (unsigned int s = 1; s < blockDim.x; s *= 2) {
        if (tid % (2 * s) == 0) {
            sdata[tid] += sdata[tid + s];
        }
        // 同步确保这一轮归约完成
        __syncthreads();
    }
    
    // 将结果写回全局内存
    if (tid == 0) {
        output[blockIdx.x] = sdata[0];
    }
}

8.4.2 使用Cooperative Groups的高效归约

#include <cooperative_groups.h>
#include <cooperative_groups/reduce.h>
using namespace cooperative_groups;

__global__ void cooperative_reduction(float *input, float *output, int n) {
    thread_block block = this_thread_block();
    thread_block_tile<32> tile = tiled_partition<32>(block);
    
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
    
    // 加载数据
    float val = (i < n) ? input[i] : 0;
    
    // 使用Cooperative Groups进行归约
    float aggregate = reduce(tile, val, plus<float>{});
    
    // 将每个warp的结果存储在共享内存中
    if (tile.thread_rank() == 0) {
        // 存储warp级归约结果
    }
    
    block.sync();
    
    // 继续进行块级归约...
}

8.5 注意事项和最佳实践

避免warp分化：尽量确保同一个warp内的所有线程执行相同的代码路径，以避免性能下降。
合理使用同步原语：
- __syncthreads()只能在同一线程块内的线程间使用
- __syncwarp()用于warp内同步，更加轻量级
- Cooperative Groups提供了更灵活的同步方式
避免死锁：确保同步调用在所有执行路径上都能被调用，避免条件判断导致部分线程无法到达同步点。
性能考虑：
- __syncthreads()开销较大，应尽量减少调用次数
- __syncwarp()相对轻量，适用于warp内同步
- 在现代GPU上，使用shuffle指令配合__syncwarp()可以实现高效的warp级通信

通过合理使用这些同步机制，可以确保CUDA核函数的正确执行并获得最佳性能。

第九章 CUDA流(Streams)的使用和流同步

CUDA流是按顺序执行的一系列操作队列，包括内存拷贝和核函数执行。使用流可以实现操作的并行化，提高GPU的利用率。

9.1 CUDA流的基本概念

CUDA流允许我们将操作组织成不同的队列，这些队列中的操作会按照顺序执行。通过使用多个流，我们可以实现操作之间的并行执行，特别是内存传输和计算之间的重叠。

9.2 创建和销毁流

// 创建流
cudaStream_t stream;
cudaStreamCreate(&stream);

// 创建非阻塞流
cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);

// 销毁流
cudaStreamDestroy(stream);

9.3 在流中执行操作

// 在特定流中执行内存拷贝
cudaMemcpyAsync(dst, src, size, cudaMemcpyHostToDevice, stream);

// 在特定流中执行核函数
kernel<<<grid, block, 0, stream>>>(args);

9.4 流同步

CUDA提供了多种流同步机制：

cudaStreamSynchronize()

等待特定流中的所有操作完成：

cudaStreamSynchronize(stream);

cudaStreamWaitEvent()

cudaEvent_t event;
cudaEventCreate(&event);
// ... 执行一些操作并记录事件 ...
cudaEventRecord(event, stream1);

// 让stream2等待event完成
cudaStreamWaitEvent(stream2, event, 0);

9.5 实际示例：simpleStreams

在NVIDIA的simpleStreams示例中，展示了如何使用多个流来重叠计算和内存传输：

cuda
// 创建多个流
cudaStream_t *streams = (cudaStream_t *)malloc(nstreams * sizeof(cudaStream_t));
for (int i = 0; i < nstreams; i++) {
    cudaStreamCreate(&(streams[i]));
}

// 使用多个流并行执行操作
for (int k = 0; k < nreps; k++) {
    // 在不同流中并行执行内核
    for (int i = 0; i < nstreams; i++) {
        init_array<<<blocks, threads, 0, streams[i]>>>(d_a + i * n / nstreams, d_c, niterations);
    }

    // 在对应流中执行内存拷贝
    for (int i = 0; i < nstreams; i++) {
        cudaMemcpyAsync(hAligned_a + i * n / nstreams,
                        d_a + i * n / nstreams,
                        nbytes / nstreams,
                        cudaMemcpyDeviceToHost,
                        streams[i]);
    }
}

// 等待所有操作完成
cudaEventRecord(stop_event, 0);
cudaEventSynchronize(stop_event);

9.6 CUDA事件用于时间测量和同步

CUDA事件提供了一种精确测量GPU操作时间的方法：

// 创建事件
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

// 记录事件
cudaEventRecord(start, 0);
// 执行一些CUDA操作
kernel<<<grid, block>>>(data);
cudaEventRecord(stop, 0);

// 等待事件完成并计算时间
cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);

在asyncAPI示例中，展示了如何使用事件进行异步操作：

// 创建事件
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

// 记录事件
cudaEventRecord(start, 0);
// 执行一些CUDA操作
kernel<<<grid, block>>>(data);
cudaEventRecord(stop, 0);

// 等待事件完成并计算时间
cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);