cuda编程之 __syncthreads()

最新推荐文章于 2025-01-28 15:41:04 发布

原创最新推荐文章于 2025-01-28 15:41:04 发布 · 6.4k 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#cuda #__syncthreads

CUDA编程专栏收录该内容

4 篇文章

订阅专栏

本文详细介绍了CUDA中线程同步函数__syncthreads()及其变种的使用方法，包括如何确保同一block内的所有线程到达指定同步点以及保证内存访问的一致性。此外，还介绍了__syncthreads_count()、__syncthreads_and()、__syncthreads_or()和__syncwarp()等高级同步函数的功能和应用场景。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

__syncthreads( ) 对一个thread block中的线程进行同步。

B.6. Synchronization Functions

void __syncthreads();

waits until all threads in the thread block have reached this point and all global andshared memory accesses made by these threads prior to __syncthreads() are visibleto all threads in the block.

__syncthreads() is used to coordinate communication between the threads of thesame block. When some threads within a block access the same addresses in shared
or global memory, there are potential read-after-write, write-after-read, or write-after-write hazards for some of these memory accesses. These data hazards can be avoided bysynchronizing threads in-between these accesses.

__syncthreads() is allowed in conditional code but only if the conditional evaluatesidentically across the entire thread block, otherwise the code execution is likely to hangor produce unintended side effects.

Devices of compute capability 2.x and higher support three variations of__syncthreads() described below.
int __syncthreads_count(int predicate);

is identical to __syncthreads() with the additional feature that it evaluates predicatefor all threads of the block and returns the number of threads for which predicateevaluates to non-zero.
int __syncthreads_and(int predicate);

is identical to __syncthreads() with the additional feature that it evaluates predicatefor all threads of the block and returns non-zero if and only if predicate evaluates to non-zero for all of them.
int __syncthreads_or(int predicate);

 void __syncwarp(unsigned mask=0xffffffff);

will cause the executing thread to wait until all warp lanes named in mask haveexecuted a __syncwarp() (with the same mask) before resuming execution. All non-exited threads named in mask must execute a corresponding __syncwarp() with thesame mask, or the result is undefined.

Executing __syncwarp() guarantees memory ordering among threads participating inthe barrier. Thus, threads within a warp that wish to communicate via memory can storeto memory, execute __syncwarp(), and then safely read values stored by other threadsin the warp.