- Where is the parallelism,which variable is used as the variable in parallel for
- Load balance
- Use atomic operations instead of mutex, signal whenever possible
- Try to use Map-reduce, parallel sort to organize the data
ForGPU
- Check shared memory per thread to see whether we can fully utilize the GPU SM processors
- Check number of registers and shared memory
- Optimize memory storage: packing your data structure; block storage for large uniform data structure (1D - nD matrix); if two variables are frequently read together, put them in the closest position in the memory.
- Using memory pool to reduce the cost of the memory allocation costs
- Bit operations are important
本文探讨了负载均衡下的并行编程策略,包括使用原子操作替代互斥锁,尽可能利用信号机制,并推荐采用Map-Reduce及并行排序来组织数据。针对GPU优化,文章详细介绍了如何检查每个线程的共享内存利用率以充分利用GPU的SM处理器,优化内存存储结构,如数据打包、块存储等,以及使用内存池减少内存分配开销。
3735

被折叠的 条评论
为什么被折叠?



