Bank conflict for shared memory

reference:
http://cuda-programming.blogspot.ca/2013/02/bank-conflicts-in-shared-memory-in-cuda.html

http://www.cnblogs.com/1024incn/p/4605502.html

Bank conflicts in GPUs are specific to shared memory and it is one of the many reasons to slow down the GPU kernel. Bank conflicts arise because of some specific access pattern of data in shared memory. It also depends on the hardware. For example, a bank conflict on a GPU device with compute capability 1.x may not be a bank conflict on a device with compute capability 2.x.

Since fast shared memory access is restricted to threads in a block. The shared memory is divided into multiple banks (similar to banks in DRAM modules). Each bank can service only one request at a time. The shared memory is therefore interleaved to increase the throughput. If the shared memory is interleaved by 32 bits, then the bandwidth of each bank is 32 bits or one float data type. The total number of banks is fixed. It is 16 on older GPUs (with compute capability 1.x ) and 32 on modern GPUs (with compute capability 2.x).
Because it is on-chip, shared memory is much faster than local and global memory. Shared memory latency is roughly 100x lower than global memory latency.

GPU上的memory有两种:

· On-board memory

· On-chip memory

global memory就是一块很大的on-board memory,并且有很高的latency。而shared memory正好相反,是一块很小,低延迟的on-chip memory,比global memory拥有高得多的带宽。我们可以把他当做可编程的cache,其主要作用有:

· An intra-block thread communication channel 线程间交流通道

· A program-managed cache for global memory data可编程cache

· Scratch pad memory for transforming data to improve global memory
access patterns

Memory Banks
为了获得高带宽,shared Memory被分成32(对应warp中的thread)个相等大小的内存块,他们可以被同时访问。不同的CC版本,shared memory以不同的模式映射到不同的块(稍后详解)。如果warp访问shared Memory,对于每个bank只访问不多于一个内存地址,那么只需要一次内存传输就可以了,否则需要多次传输,因此会降低内存带宽的使用。

这种分法的初衷有种说法是,

To facilitate high memory bandwidth, the shared memory on each
multiprocessor is organized into equally-sized banks which can be
accessed simultaneously。 And also, the banks are organized such that
consecutive 32-bits words are assigned to consecutive banks.

所以其实bank的数量是固定的,bank number是在循环出现,bank的bandwidth是根据系统的CC来定的,可能是4 bytes或者是8 bytes.

根据不同的CC版本,bank的配置也不同,具体为:
· 4 bytes for devices of CC 2.x
· 8 bytes for devices of CC 3.x

同样,如果是8 bytes, 那么则是为了对应连续的64-bits words。给出32-bits下的bank number计算公式如下:

bank index = (byte address ÷ 4 bytes/bank) % 32 banks

给出存储空间映射关系如下。下图是Fermi的地址映射关系,注意到,bank中每个地址相差32,相

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值