关于何时使用cudaDeviceSynchronize

最新推荐文章于 2024-10-15 23:20:32 发布

原创最新推荐文章于 2024-10-15 23:20:32 发布 · 1w 阅读

3 ·

CC 4.0 BY-SA版权

文章标签：

#cuda

@CUDA/MPI 专栏收录该内容

9 篇文章

订阅专栏

CUDA内核启动虽是异步的，但在同一流中的GPU任务按序执行。cudaDeviceSynchronize()用于在CPU主机线程中等待GPU完成所有先前请求的任务。cudaThreadSynchronize()已被废弃，推荐使用cudaDeviceSynchronize。cudaStreamSynchronize()则针对指定流进行同步。

部署运行你感兴趣的模型镜像

When to call cudaDeviceSynchronize

why do we need cudaDeviceSynchronize(); in kernels with device-printf?

Although CUDA kernel launches are asynchronous, all GPU-related tasks placed in one stream (which is default behaviour) are executed sequentially.

So, for example,

kernel1<<<X,Y>>>(...); // kernel start execution, CPU continues to next statement
kernel2<<<X,Y>>>(...); // kernel is placed in queue and will start after kernel1 finishes, CPU continues to next statement
cudaMemcpy(...); // CPU blocks until ememory is copied, memory copy starts only

而GOOGLE中文排名第二的解释是不太完整的：

哪些情况下应当使用cudaDeviceSynchronize()？-优快云论坛-优快云

cudaStreamSynchronize vs CudaDeviceSynchronize vs cudaThreadSynchronize

These are all barriers. Barriers prevent code execution beyond the barrier until some condition is met.

cudaDeviceSynchronize() halts execution in the CPU/host thread (that the cudaDeviceSynchronize was issued in) until the GPU has finished processing all previously requested cuda tasks (kernels, data copies, etc.)
cudaThreadSynchronize() as you've discovered, is just a deprecated version of cudaDeviceSynchronize. Deprecated just means that it still works for now, but it's recommended not to use it (use cudaDeviceSynchronize instead) and in the future, it may become unsupported. But cudaThreadSynchronize() and cudaDeviceSynchronize() are basically identical.
cudaStreamSynchronize() is similar to the above two functions, but it prevents further execution in the CPU host thread until the GPU has finished processing all previously requested cuda tasks that were issued in the referenced stream. So cudaStreamSynchronize() takes a stream id as it's only parameter. cuda tasks issued in other streams may or may not be complete when CPU code execution continues beyond this barrier.