reference:
http://cuda-programming.blogspot.ca/2013/02/bank-conflicts-in-shared-memory-in-cuda.html
http://www.cnblogs.com/1024incn/p/4605502.html
Bank conflicts in GPUs are specific to shared memory and it is one of the many reasons to slow down the GPU kernel. Bank conflicts arise because of some specific access pattern of data in shared memory. It also depends on the hardware. For example, a bank conflict on a GPU device with compute capability 1.x may not be a bank conflict on a device with compute capability 2.x.
Since fast shared memory access is restricted to threads in a block. The shared memory is divided into multiple banks (similar to banks in DRAM modules). Each bank can service only one request at a time. The shared memory is therefore interleaved to increase the throughput. If the shared memory is interleaved by 32 bits, then the bandwidth of each bank is 32 bits or one float data type. The total number of banks is fixed. It is 16 on older GPUs (with compute capability 1.x ) and 32 on modern GPUs (with compute capability 2.x).
Because it is on-chip, shared memory is much faster than local and global memory. Shared memory latency is roughly 100x lower than global memory latency.
GPU上的memory有两种:
· On-board memory
· On-chip memory
global memory就是一块很大的on-board memory,并且有很高的latency。而shared memory正好相反,是一块很小,低延迟的on-chip memory,比global memory拥有高得多的带宽。我们可以把他当做可编程的cache,其主要作用有:
· An intra-block thread communication channel 线程间交流通道
· A program-managed cache for global memory data可编程cache
· Scratch pad memory for transforming data to improve global memory
access patterns
Memory Banks
为了获得高带宽,shared Memory被分成32(对应warp中的thread)个相等大小的内存块,他们可以被同时访问。不同的CC版本,shared memory以不同的模式映射到不同的块(稍后详解)。如果warp访问shared Memory,对于每个bank只访问不多于一个内存地址,那么只需要一次内存传输就可以了,否则需要多次传输,因此会降低内存带宽的使用。
这种分法的初衷有种说法是,
To facilitate high memory bandwidth, the shared memory on each
multiprocessor is organized into equally-sized banks which can be
accessed simultaneously。 And also, the banks are organized such that
consecutive 32-bits words are assigned to consecutive banks.
所以其实bank的数量是固定的,bank number是在循环出现,bank的bandwidth是根据系统的CC来定的,可能是4 bytes或者是8 bytes.
根据不同的CC版本,bank的配置也不同,具体为:
· 4 bytes for devices of CC 2.x
· 8 bytes for devices of CC 3.x
同样,如果是8 bytes, 那么则是为了对应连续的64-bits words。给出32-bits下的bank number计算公式如下:
bank index = (byte address ÷ 4 bytes/bank) % 32 banks
给出存储空间映射关系如下。下图是Fermi的地址映射关系,注意到,bank中每个地址相差32,相