有关Data alignmet(__align__)

本文介绍了GPU内存对齐的重要性及其实现方式。通过对齐内存访问,可以提高GPU的执行效率,减少缓存访问次数。文章详细阐述了如何通过16字节对齐等手段优化内存访问,并展示了具体的结构体对齐示例。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

有关Data alignmet(align)

参考:
https://www.quora.com/In-CUDA-programming-what-are-Alignment-Requirements

memory alignment on GPUs is a bit strict when it comes to optimising, although the first approach towards optimising is to make memory access optimum. This is similar to how it’s done on CPUs where optimised memory access yields better timings, subsequently on GPUs 16 byte memory alignments enhances optimised memory access. Moreover to make it possible several steps are applied such as creating a structure type or creating class with memory alignment equivalent to 16 byte memory placements. Furthermore array of objects or structure of arrays is used to initialise data structures, ideally 8 ints, four floats and on Nvidia GPUs 2 doubles could be access at one clock cycle.

Some processors require that objects must be stored in memory at an address that is evenly divisible by some number, which is called the alignment of that object.

For example an array of 32-bit integers with 4-byte alignment must be stored at an address that is evenly divisible by 4.

Many CPUs support a small alignment size for all objects, maybe 1 byte.

GPUs often have more stringent restrictions, for example that primitive objects have natural alignment (the alignment is equal to the size of the object).

Alignment restrictions exist because they can be implemented more efficiently in hardware. In particular, accesses to objects with natural alignment that are smaller than the cache line size can always be satisfied with one cache access. If the alignment is smaller than the object size, then the object can span cache lines, and the hardware required to access it becomes more complex.

我们平时使用的变量(initialise data structures)其在内存中的分布都是连续的,对齐的,而我们自定义的变量或结构体,其成员在内存中的分布是否连续就不确定了,我们可以显示的使用标识符 _ _ align_ _ 来确定一个对齐连续的内存存放我们定义的结构体对象。
对于结构、大小和队列要求可以通过编译器强制使用队列指定的__align__(8)__align__(16)如:

struct __align(16)__
{ 
    float a; 
    float b;
    float c; 
    float d;
};

变量a,b,c,d的在内存中是连续的。
作用:
合并访问内存。尤其是在GPU端,合并访问很重要

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值