CUDA优化技巧详解-优快云博客

本文链接：https://blog.youkuaiyun.com/fb_help/article/details/79436993

笔记–关于CUDA

Why GPU is faster than CPU?

One well known reason is that GPU has more threads and it can parallel process rather than serial.
Another significant reason is the GPU programming Model – CUDA. Why is CUDA? Because CPU has much underlying optimization such as pipeline,Order of Order,Branch Prediction and Optimization ,and multi-core. But those method is so underlying which could not face all questions. They could only performance well in common or few situation. How about CUDA, they put the optimization work to the programmers who can optimization the work using CUDA at the upper layer or saying at the design layer. What is the advantage? Programmers can design their code and consideration the efficiency simultaneously using CUDA. In a word, programmer can program their code to control GPU’activity for efficiency.

Share memory is the 片上内存 whose speed is very quickly.
The key point of shared memory is the 32 个内存存储器.
We should reduce or avoid the memory conflict as much as possible.
We can fill the shared memory to change the order of shared memory to reduce or avoid the memory conflict.

共享内存擅长的是乱序的内存请求，因为32个内存存储体，而且还可以通过填充的方法，减少或避免存储体冲突。所以其擅长乱序执行。在书中举了归约和矩阵转置的例子，这都是乱序执行内存请求的例子。

Programming ideas:

1.who is the current unit?
When we optimization our code, some time we should consider the block as a unit(for logic), some time we need consider the warp as a unit (for physical).
2.the parallel thinking
when you code the kennel, the parallel thinking should be in you brain all the time. Because kennel is for parallel programing, the code in kennel is for every threads.

Optimization model:

CPU:

C++ code ————–> machine language —————–> CPU execution (
CPU low-level itself optimization such as OoO 分支预测)
C++ code ————–> CUDA code ——-> CPU execution (programmer explicit optimization by CUDA code using GPU)

The method of CUDA to improve performance:

1.block size to increase occupancy (调整块的大小提升占用率)
2.对齐和合并 increase the 内存事物使用率.
3.减少分支化
4.展开 unrolling
5.使用共享内存，并避免共享内存的冲突，即正确使用共享内存
6.对于有局部内存请求的程序使用纹理内存
7.只读缓存，独立的只读内存访问模式
8.线程束洗牌
9.流式CUDA程序控制并发
10.openMP配合异步执行
11.openACC简化编码

胖瘦块

瘦块：8×32
胖块：32×8
从硬件角度讲，块都是一维的，因为它们的乘积是一样的，所以块的大小，线程束的多少也是一样的，即占有率是一样的，但是，块和内存的访问有在逻辑上存在映射关系：（idx = blockid.x*blockdim.x+blockidx.x;idy = … ; index = idy* (blockdim.x*girddim.x)+idx，即块在逻辑上可能影响内存的访问）因此，瘦块有相对较多的连续内存，利于对齐，所以有较高的带宽。
注意：最优的块的大小的乘积一般是128，因为，3以上的计算能力：2048/16 = 128
对于胖瘦块影响内存访问效率的特点，它最主要和重要的应用是：当使用一个二维的块时，其第一维要和内存事物的颗粒大小一致，如在禁用一级缓存的情况下，内存事物的颗粒大小是32，此时如用到二维块，一般情况下（正常的idx）第一维是32的性能是最好的。

吞吐量和带宽(Bandwidth and Throughput)

带宽是理论峰值，而吞吐量是以达到的值。
带宽通常是描述单位时间内最大可能的数据传输量，而吞吐量是用来描述单位时间内任何形式的信息或操作的执行速度。
例如：高速公路收费站有8个车道，那其的带宽是8车/次，而实际车不是陆陆续续进收费站的，3个周期或说3次，过了18辆车，那这段时间的吞吐量是6车/次。
即理论速度，和瞬时速度或平均速度的区别。
bandwidth常用来将内存，而throughput常用来描述，数据或指令的速度。