CUDA Programming Interface 编程接口

黄昏贩卖机

已于 2022-10-22 19:42:38 修改

阅读量346

点赞数

分类专栏： CUDA 文章标签：网络分布式

于 2022-10-21 12:04:20 首次发布

原文链接：https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture

版权

CUDA 专栏收录该内容

2 篇文章

订阅专栏

本文详细介绍了CUDA编程接口，包括nvcc编译器的工作流程，如离线编译和即时编译，以及二进制和PTX的兼容性。重点讨论了设备内存管理，如主机内存与设备内存的交互、共享内存的使用，以及页锁定主机内存的特性。此外，还阐述了异步并发执行、流和事件的概念，展示了如何实现数据传输和内核执行的重叠。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Programming Interface 编程接口

1 用nvcc编译

nvcc是一个编译器驱动，它简化了C或PTX的编译流程：它提供了简单熟悉的命令行选项，同时通过调用一系列实现了不同编译步骤的工具集来执行它们

1.1 Compliation Workflow 编译工作流程

1.1.1 Offline Compliation 离线编译

ncvv 可以编译混合 host code 和 device code 的代码，nvcc的基本工作流程包括从host code 中分离 device ode 然后：

将设备代码编译成汇编形式（PTX代码）或者二进制形式（cubin对象）
将执行配置节引入的<<<, >>>语法转化为必要的CUDA C运行时函数调用以加载和启动每个已编译的内核（来自PTX代码或者cubin对象）

修改后的主机代码要么被输出为C代码供其它工具编译，要么在编译的最后阶段被nvcc调用主机编译器输出为目标代码。

1.1.2 Just-in-Time Compilation 即时编译

环境变量可用于控制即时编译

1.2 Binary Compatibility 二进制兼容性

使用 -code
用-code=sm_13编译时，为计算能力1.3的设备生成二进制代码
保证向后兼容性，但不保证向前兼容

1.3 PTX Compatibility PTX 兼容性

-arch编译器选项指定预设的计算能力
-arch=sm_13
PTX保证完全的向后兼容，而二进制只保证主修订号相同的向后兼容

1.4 Application Compatibility 应用兼容性

2 CUDA Runtime

cudart动态库是运行时的实现，它包含在应用的安装包里，所有的函数前
缀都是cuda。

2.1 Initialization

没有明显的初始化函数
在第一次调用runtime函数时初始化
在初始化时，runtime 位系统的欸一个device 创建上下文 contest，被 host中所有线程共享
当主机线程调用cudaDeviceReset()时，这销毁了主机线程操作的设备的主要上下文。任何以这个设备为当前设备的主机线程调用的运行时函数将为设备重新建立一个主要上下文。

2.2 Device Memory

kernal 不能操作Device Memory，所以运行时提供了分配，释放，拷贝Device Memory和在 device和host间传输数据的函数
device memory 可以被分配为线程内存或CUDA 数组
线性内存分配同城使用 cudaMalloc() 释放使用 cudaFree()
cudaMemcpy() 在host和device之间转移数据

// Device code
__global__ void VecAdd(float* A, float* B, float* C, int N)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < N)
        C[i] = A[i] + B[i];
}
            
// Host code
int main()
{
    int N = ...;
    size_t size = N * sizeof(float);

    // Allocate input vectors h_A and h_B in host memory
    float* h_A = (float*)malloc(size);
    float* h_B = (float*)malloc(size);
    float* h_C = (float*)malloc(size);

    // Initialize input vectors
    ...

    // Allocate vectors in device memory
    float* d_A;
    cudaMalloc(&d_A, size);
    float* d_B;
    cudaMalloc(&d_B, size);
    float* d_C;
    cudaMalloc(&d_C, size);

    // Copy vectors from host memory to device memory
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // Invoke kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =
            (N + threadsPerBlock - 1) / threadsPerBlock;
    VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

    // Copy result from device memory to host memory
    // h_C contains the result in host memory
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    // Free device memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
            
    // Free host memory
    ...
}

使用 cudaMallocPitch()分配2D数组，使用cudaMalloc3D()分配3D数组，他们会适当填充以满足对齐要求

// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&devPtr, &pitch,
                width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);

// Device code
__global__ void MyKernel(float* devPtr,
                         size_t pitch, int width, int height)
{
    for (int r = 0; r < height; ++r) {
        float* row = (float*)((char*)devPtr + r * pitch);
        for (int c = 0; c < width; ++c) {
            float element = row[c];
        }
    }
}

// Host code
int width = 64, height = 64, depth = 64;
cudaExtent extent = make_cudaExtent(width * sizeof(float),
                                    height, depth);
cudaPitchedPtr devPitchedPtr;
cudaMalloc3D(&devPitchedPtr, extent);
MyKernel<<<100, 512>>>(devPitchedPtr, width, height, depth);

// Device code
__global__ void MyKernel(cudaPitchedPtr devPitchedPtr,
                         int width, int height, int depth)
{
    char* devPtr = devPitchedPtr.ptr;
    size_t pitch = devPitchedPtr.pitch;
    size_t slicePitch = pitch * height;
    for (int z = 0; z < depth; ++z) {
        char* slice = devPtr + z * slicePitch;
        for (int y = 0; y < height; ++y) {
            float* row = (float*)(slice + y * pitch);
            for (int x = 0; x < width; ++x) {
                float element = row[x];
            }
        }
    }
}

2.3 Device Memory L2 Access Management

当一个CUDA kernal 重复读取一个 global memory 的数据区域时，这种数据访问可以被认为是持久的。
如果数据之访问一次，这种数据访问被认为streanming

2.3.1 L2 cache Set-Aside for Persisting Accesses

L2 cache的一部分用作持续数据访问。持续数据访问将优先使用L2cache的这一部分
L2 cache 用于持续数据访问的这一部分大小可以调整

cudaGetDeviceProperties(&prop, device_id);                
size_t size = min(int(prop.l2CacheSize * 0.75), prop.persistingL2CacheMaxSize);
cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, size); /* set-aside 3/4 of L2 cache for persisting accesses or the max allowed*/

剩下的先不看了

2.4 Shared Memory 共享内存

共享内存使用 _shared_ 关键词
共享内存比global memory快，它可以被用作高速暂存寄存器，来减少从线程块的global memory的访问

以下的例子是矩阵相乘,没有使用共享内存，A的每一行与矩阵B的每一列做计算，得到矩阵C的一个元素。矩阵A被读取了B.width 次，矩阵B被读取了 A.height 次

// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.width + col)
typedef struct {
    int width;
    int height;
    float* elements;
} Matrix;

// Thread block size
#define BLOCK_SIZE 16

// Forward declaration of the matrix multiplication kernel
__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);

// Matrix multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK_SIZE
void MatMul(const Matrix A, const Matrix B, Matrix C)
{
    // Load A and B to device memory
    Matrix d_A;
    d_A.width = A.width; d_A.height = A.height;
    size_t size = A.width * A.height * sizeof(float);
    cudaMalloc(&d_A.elements, size);
    cudaMemcpy(d_A.elements, A.elements, size,
               cudaMemcpyHostToDevice);
    Matrix d_B;
    d_B.width = B.width; d_B.height = B.height;
    size = B.width * B.height * sizeof(float);
    cudaMalloc(&d_B.elements, size);
    cudaMemcpy(d_B.elements, B.elements, size,
               cudaMemcpyHostToDevice);

    // Allocate C in device memory
    Matrix d_C;
    d_C.width = C.width; d_C.height = C.height;
    size = C.width * C.height * sizeof(float);
    cudaMalloc(&d_C.elements, size);

    // Invoke kernel
    dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
    dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
    MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);

    // Read C from device memory
    cudaMemcpy(C.elements, d_C.elements, size,
               cudaMemcpyDeviceToHost);

    // Free device memory
    cudaFree(d_A.elements);
    cudaFree(d_B.elements);
    cudaFree(d_C.elements);
}

// Matrix multiplication kernel called by MatMul()
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
    // Each thread computes one element of C
    // by accumulating results into Cvalue
    float Cvalue = 0;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    for (int e = 0; e < A.width; ++e)
        Cvalue += A.elements[row * A.width + e]
                * B.elements[e * B.width + col];
    C.elements[row * C.width + col] = Cvalue;
}

20221020113316

在以下的例子中使用了共享内存来做矩阵相乘。每个线程块负责计算矩阵C的一小块 $C_sub$ ，每个线程负责计算 $C_sub$ 的一个元素。 $C_sub$ 相当于A的子矩阵（A.width,block_size）和B的子矩阵 (block_size, A.width )相乘的结果。
为了使用device 资源，将两个矩形矩阵分成多规格 block_size 大小的方阵， $C_{sub}$ 由方阵相乘的和计算出。
每次乘法计算中，将两个对应的方阵载入 shared memory，每个相乘载入对应的方阵的一个元素。
在global memory 中，A只被读了（B.width/block size）次，同时B读了（A.height/block size）次。

// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.stride + col)
typedef struct {
    int width;
    int height;
    int stride; 
    float* elements;
} Matrix;

// Get a matrix element
__device__ float GetElement(const Matrix A, int row, int col)
{
    return A.elements[row * A.stride + col];
}

// Set a matrix element
__device__ void SetElement(Matrix A, int row, int col,
                           float value)
{
    A.elements[row * A.stride + col] = value;
}

// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is
// located col sub-matrices to the right and row sub-matrices down
// from the upper-left corner of A
 __device__ Matrix GetSubMatrix(Matrix A, int row, int col) 
{
    Matrix Asub;
    Asub.width    = BLOCK_SIZE;
    Asub.height   = BLOCK_SIZE;
    Asub.stride   = A.stride;
    Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row
                                         + BLOCK_SIZE * col];
    return Asub;
}

// Thread block size
#define BLOCK_SIZE 16

// Forward declaration of the matrix multiplication kernel
__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);

// Matrix multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK_SIZE
void MatMul(const Matrix A, const Matrix B, Matrix C)
{
    // Load A and B to device memory
    Matrix d_A;
    d_A.width = d_A.stride = A.width; d_A.height = A.height;
    size_t size = A.width * A.height * sizeof(float);
    cudaMalloc(&d_A.elements, size);
    cudaMemcpy(d_A.elements, A.elements, size,
               cudaMemcpyHostToDevice);
    Matrix d_B;
    d_B.width = d_B.stride = B.width; d_B.height = B.height;
    size = B.width * B.height * sizeof(float);
    cudaMalloc(&d_B.elements, size);
    cudaMemcpy(d_B.elements, B.elements, size,
    cudaMemcpyHostToDevice);

    // Allocate C in device memory
    Matrix d_C;
    d_C.width = d_C.stride = C.width; d_C.height = C.height;
    size = C.width * C.height * sizeof(float);
    cudaMalloc(&d_C.elements, size);

    // Invoke kernel
    dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
    dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
    MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);

    // Read C from device memory
    cudaMemcpy(C.elements, d_C.elements, size,
               cudaMemcpyDeviceToHost);

    // Free device memory
    cudaFree(d_A.elements);
    cudaFree(d_B.elements);
    cudaFree(d_C.elements);
}

// Matrix multiplication kernel called by MatMul()
 __global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
    // Block row and column
    int blockRow = blockIdx.y;
    int blockCol = blockIdx.x;

    // Each thread block computes one sub-matrix Csub of C
    Matrix Csub = GetSubMatrix(C, blockRow, blockCol);

    // Each thread computes one element of Csub
    // by accumulating results into Cvalue
    float Cvalue = 0;

    // Thread row and column within Csub
    int row = threadIdx.y;
    int col = threadIdx.x;

    // Loop over all the sub-matrices of A and B that are
    // required to compute Csub
    // Multiply each pair of sub-matrices together
    // and accumulate the results
    for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) {

        // Get sub-matrix Asub of A
        Matrix Asub = GetSubMatrix(A, blockRow, m);

        // Get sub-matrix Bsub of B
        Matrix Bsub = GetSubMatrix(B, m, blockCol);

        // Shared memory used to store Asub and Bsub respectively
        __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
        __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

        // Load Asub and Bsub from device memory to shared memory
        // Each thread loads one element of each sub-matrix
        As[row][col] = GetElement(Asub, row, col);
        Bs[row][col] = GetElement(Bsub, row, col);

        // Synchronize to make sure the sub-matrices are loaded
        // before starting the computation
        __syncthreads();
        // Multiply Asub and Bsub together
        for (int e = 0; e < BLOCK_SIZE; ++e)
            Cvalue += As[row][e] * Bs[e][col];

        // Synchronize to make sure that the preceding
        // computation is done before loading two new
        // sub-matrices of A and B in the next iteration
        __syncthreads();
    }

    // Write Csub to device memory
    // Each thread writes one element
    SetElement(Csub, row, col, Cvalue);
}

2.5 Distributed Shared Memory 分布式共享内存

在同一个 thread block cluster 中的线程可以访问所有线程块的共享内存。这种划分开的共享内存被称为 ditibuted shred memory，相应的地址空间称为分布式共享存储器地址空间。
属于同一个thread block cluster 的线程可以读写，原子执行分布式地址空间，不论他们是否属于同一个线程块。
通过分布式共享内存访问数据，要保证所有的线程块都存在。使用cluster.sync()能保证所有线程块都开始执行
用户还需要确保在线程块退出之前完成所有分布式共享内存操作

2.6 Page-Locked Host Memory 页锁定主机内存

运行时提供了使用分页锁定主机存储器（也称为pinned）的函数（与常规
的使用malloc()分配的可分页的主机存储器不同）
cudaHostAlloc() and cudaFreeHost() 分配也锁定主机内存
cudaHostRegister()分页锁定一段使用malloc()分配的存储器

使用 page-locked 主机内存由几个好处;

在某些设备上，设备存储器和分页锁定主机存储器间数据拷贝可与内核执行并发进行
在一些设备上，分页锁定主机内存可映射到设备地址空间，减少了和设
备间的数据拷贝
在有前端总线的系统上，如果主机存储器是分页锁定的，主机存储器和设备存储器间的带宽会高些

页锁定内存是稀缺资源，容易分配失败。另外由于减少了系统可分页的物理存储器数量，分
配太多的分页锁定内存会降低系统的整体性能。

2.6.1 Portable Memory 可分享存储器

一块分页锁定存储器可被系统中的所有设备使用，但是默认的情况下，上面说的使用分布锁定存储器的好处只有分配它时，正在使用的设备可以享有。
为了让所有线程可以使用分布锁定共享存储器的好处，可以在使用cudaHostAlloc()分配时传入cudaHostAllocPortable标签，或者在使用cudaHostRegister()分布锁定存储器时，传入cudaHostRegisterPortable标签。

2.6.2 Write-Combining Memory 写结合存储器

默认情况下，分页锁定主机存储器是可缓存的。可以在使用cudaHostAlloc()分配时传入cudaHostAllocWriteCombined标签使其被分配为写结合的。写结合存储器没有一级和二级缓存资源，所以应用的其它部分就有更多的缓存可用。
从主机读取写结合存储器极其慢，所以写结合存储器应当只用于那些主机
只写的存储器。

2.6.3 Mapped Memory 映射存储器

将 page-locked host memory 映射到 device 地址空间
page-locked host memory将会有两个地址，一个在主机存储器上，一个在设备存储器上。主机指针是从cudaHostAlloc（）或malloc()返回的，设备指针可通cudaHostGetDevicePointer（）函数检索到，可以使用这个设备指针在内核中访问这块存储器。

从内核中直接访问主机内存有许多优点:

无须在设备上分配存储器，也不用在这块存储器和主机存储器间显式传输数据；数据传输是在内核需要的时候隐式进行的。
无须使用流 stream 重叠数据传输和内核执行；数据传输和内核执
行自动重叠。

由于被映射分页锁定存储器在主机和设备间共享，应用必须使用流或事件来同步存储器访问以避免任何潜在的读后写，写后读，或写后写危害。
为了在给定的主机线程中能够检索到被映射分页锁定存储器的设备指针，必须在调用任何CUDA运行时函数前调用cudaSetDeviceFlags()，并传入cudaDeviceMapHost标签。否则，cudaHostGetDevicePointer()将会返回错误。

2.7 Asynchronous Concurrent Execution 异步并发执行

CUDA 执行下列操作时，认为是独立的task，因此可以并发执行

Computation on the host;
Computation on the device;
Memory transfers from the host to the device;
Memory transfers from the device to the host;
Memory transfers within the memory of a given device;
Memory transfers among devices.

2.7.1 Concurrent Execution between Host and Device 主机和设备并发执行

为了易于使用主机和设备间的异步执行，一些函数是异步的：在设备完全
完成任务前，控制已经返回给主机线程了。他们是:

Kernel launches;
Memory copies within a single device’s memory;
Memory copies from host to device of a memory block of 64 KB or less;
Memory copies performed by functions that are suffixed with Async; 存储器拷贝函数中带有Async后缀的；
Memory set function calls.设置设备存储器的函数调用

通过将 CUDA _ LAUNCH _ BLOCKING 环境变量设置为1，程序员可以在全局范围内禁用在系统上运行的所有 CUDA 应用程序的内核启动异步性。

2.7.2 Concurrent Kernel Execution 并发内核执行

些计算能力2.x的设备可并发执行多个内核。应用可以检查concurrentKernels属性以查询这种能力）（参见3.2.6），如果等于1，说明支持。
计算能力3.5的设备最大可并发执行的内核数目是32，其余的是16。

2.7.3 Overlap of Data Transfer and Kernel Execution 数据传输和内核执行的重叠

一些设备可以在内核执行的同时从 GPU 执行异步内存复制。如果副本中涉及到主机内存，则它必须是页锁定的。

2.7.4 Concurrent Data Transfers 并发数据传输

2.7.5 Streams 流

流是一个命令序列，可能是由不同的主机线程发射。不同流之间相对无序的或并发的执行它们的命令；这种行为是没有保证的，而且不能作为正确性的的保证（如内核间的通信没有定义）

2.7.5.1 Creation and Destruction

创建一个流对象，并指定他为一系列 kermel launch 和 host与device内存copy 的流参数

下面创建了两个流，并分配了一个页锁定内存

cudaStream_t stream[2];
for (int i = 0; i < 2; ++i)
    cudaStreamCreate(&stream[i]);
float* hostPtr;
cudaMallocHost(&hostPtr, 2 * size);

下面的代码定义的每个流是一个由一次主机到设备的传输，一次内核发射，一次设备到主机的传输组成的系列。

for (int i = 0; i < 2; ++i) {
    cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size,
                    size, cudaMemcpyHostToDevice, stream[i]);
    MyKernel <<<100, 512, 0, stream[i]>>>
          (outputDevPtr + i * size, inputDevPtr + i * size, size);
    cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size,
                    size, cudaMemcpyDeviceToHost, stream[i]);
}

必须注意为了使用重叠, hostPtr必须指向分页锁定主机存储器。

调用cudaStreamDestroy()来释放流

for (int i = 0; i < 2; ++i)
    cudaStreamDestroy(stream[i]);

cudaStreamDestory() 会立刻返回，但是直到流中的任务都完成后，流才会 destroy。

2.7.5.2 Default Stream 默认流

没有使用流参数的内核启动和主机设备间数据拷贝，或者等价地
将流参数设为0，此时发射到默认流。因此它们顺序执行。

2.7.5.3 Explicit Synchronization 显式同步

有很多方法显式的在流之间同步。

cudaDeviceSynchronize()直到前面所有流中的命令都执行完
cudaStreamSynchronize()以某个流为参数，强制运行时等待该流中的任务都完成。可用于同步主机和特定流，同时允许其它流继续执行。
cudaStreamWaitEvent()以一个流和一个事件为参数（参见事件节），使得
在调用cudaStreamWaitEvent()后加入到指定流的所有命令暂缓执行直到事件
完成。流可以是0，此时在调用cudaStreamWaitEvent()后加入到所有流的所有
命令等待事件完成。
cudaStreamQuery()用于查询流中的所有之前的命令是否已经完成。

2.7.5.4 Implicit Synchronization 隐式同步

如果是下面中的任何一种操作在来自不同流的两个命令之间，这两个命令也不能并发：

分页锁定主机存储器分配，
设备存储器分配，
设备存储器设置，
设备内两个不同地址间的存储器拷贝函数；
默认流中调用的任何CUDA命令
一级缓存/共享存储器之间配置切换。

2.7.5.5 Overlapping Behavior 重叠行为

两个流的并行度取决于，指令发射到两个流中的顺序，以及设备是否支持数据传输和 kernel执行的重叠、内核并发执行、数据并发传输

如果设备是不执行数据并发传输的，在上例中只能等 strea[0] 中的数据传输完成后，才能执行stream[1]的数据传输

如果代码重写成如下方式（同时假设设备支持数据传输和内核执行重叠）

for (int i = 0; i < 2; ++i)
    cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size,
                    size, cudaMemcpyHostToDevice, stream[i]);
for (int i = 0; i < 2; ++i)
    MyKernel<<<100, 512, 0, stream[i]>>>
          (outputDevPtr + i * size, inputDevPtr + i * size, size);
for (int i = 0; i < 2; ++i)
    cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size,
                    size, cudaMemcpyDeviceToHost, stream[i]);

此时发射到流1的从主机到设备的存储器拷贝和发射到流0的内核执行重叠

2.7.5.6 Host Functions (Callbacks)

runtime 通过cudaLaunchHostFunc() 将cpu 调用插入流中，一旦流中之前的命令执行完，就会执行 CPU调用

下面的代码例子将回调函数MyCallbak插入到两个流中发射的主机到设备存储器的拷贝、内核执行和设备到主机的存储器拷贝操作之后。在每个设备到主机的存储器拷贝完成后该回调将会在主机上执行。

void CUDART_CB MyCallback(cudaStream_t stream, cudaError_t status, void *data){
    printf("Inside callback %d\n", (size_t)data);
}
...
for (size_t i = 0; i < 2; ++i) {
    cudaMemcpyAsync(devPtrIn[i], hostPtr[i], size, cudaMemcpyHostToDevice, stream[i]);
    MyKernel<<<100, 512, 0, stream[i]>>>(devPtrOut[i], devPtrIn[i], size);
    cudaMemcpyAsync(hostPtr[i], devPtrOut[i], size, cudaMemcpyDeviceToHost, stream[i]);
    cudaLaunchHostFunc(stream[i], MyCallback, (void*)i);
}

A host function 必须不能直接或间接的调用CUDA API，因为此时回调会等待自己，这导致死锁。

2.7.7 Events 事件

运行时还提供了一种密切监视设备进度以及执行精确计时的方法，允许应用程序异步记录程序中任何一点的事件，并在这些事件完成时进行查询。当事件之前的所有任务(或可选地，给定流中的所有命令)都已完成时，事件已完成。

Creation and Destruction

创建两个事件：

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

destroy

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

Elapsed Time 经过时间

cudaEventRecord(start, 0);
for (int i = 0; i < 2; ++i) {
    cudaMemcpyAsync(inputDev + i * size, inputHost + i * size,
                    size, cudaMemcpyHostToDevice, stream[i]);
    MyKernel<<<100, 512, 0, stream[i]>>>
               (outputDev + i * size, inputDev + i * size, size);
    cudaMemcpyAsync(outputHost + i * size, outputDev + i * size,
                    size, cudaMemcpyDeviceToHost, stream[i]);
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);

cudaEventRecord(start); 将start放到默认stream中，因为我们没创建stream，所以是在默认stream。

cudaEventSynchronize(stop);会阻塞CPU，直到特定的event被记录。也就是这里会阻塞，直到stop在stream中被记录才会向下执行。不使用这句话的话，kernel是异步的，还没执行完，CPU就继续往下走了。那么cudaEventElapsedTime就记录不到时间了。因为stop还没加入到device中。