在 CUDA 中提高 parallel reduction 类程序性能的一个技巧就是 unroll last warp ,这在官方给出的示例 CUDA Radix Sort (Thrust Library) ,CUDA Parallel Reduction,scan 中都有涉及,在 CUDA_sample 中提到:
The included RadixSort class can sort either keyvalue pairs (with float or unsigned integer keys) or keys only. The optimized code in this sample (and also in reduction and scan) uses a technique known as warp-synchronous programming, which relies on the fact that within a warp of threads running on a CUDA GPU, all threads execute instructions synchronously. The code uses this to avoid __syncthreads() when threads within a warp are sharing data via __shared__ memory. It is important to note that for this to work correctly without race conditions on all GPUs, the shared memory used in these warp-synchronous expressions must be declared volatile. If it is not declared volatile, then in the absence of __syncthreads(), the compiler is free to delay stores to __shared__ memory and keep the data in registers (an optimization

CUDA程序中,unroll last warp技术用于优化parallel reduction,利用warp内线程同步避免__syncthreads()。但为确保正确执行,必须将共享内存声明为volatile,防止编译器优化导致race condition。本文讨论了这个问题,包括Fermi架构的限制以及在Tesla K10和GTX 850M显卡上的表现。
最低0.47元/天 解锁文章
447

被折叠的 条评论
为什么被折叠?



