CUDA：向量化加载提升性能

最新推荐文章于 2025-09-23 10:37:31 发布

转载最新推荐文章于 2025-09-23 10:37:31 发布 · 1.3k 阅读

CUDA 专栏收录该内容

11 篇文章

订阅专栏

本文讨论了CUDA中向量化内存访问的重要性及其对性能的影响。向量化加载通常优于标量加载，但会增加寄存器压力并降低整体并行性。如果内核已经受到寄存器限制或并行度很低，则可能需要使用标量加载。文章还介绍了如何构造向量类型。

部署运行你感兴趣的模型镜像

转载自 https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-increase-performance-with-vectorized-memory-access/

注意下
In almost all cases vectorized loads are preferable to scalar loads. Note however that using vectorized loads increases register pressure and reduces overall parallelism. So if you have a kernel that is already register limited or has very low parallelism, you may want to stick to scalar loads. Also, as discussed earlier, if your pointer is not aligned or your data type size in bytes is not a power of two you cannot use vectorized loads.

需要自己trade-off了

内建类型,内建类型自动对齐。
These are vector types derived from the basic integer and floating-point types. They are structures and the 1st, 2nd, 3rd, and 4th components are accessible through the fields x, y, z, and w, respectively. They all come with a constructor function of the form
make_; for example