What MPS is?
When to use MPS? (The Benefits of MPS)
GPU utilization
A single process may not utilize all the compute and memory-bandwidth capacity available on the GPU. MPS allows kernel and memcopy operations from different processes to overlap on the GPU, achieving higher utilization and shorter running times.
Reduced on-GPU context storage
Without MPS each CUDA processes using a GPU allocates separate storage and scheduling resources on the GPU. In contrast, the MPS server allocates one copy of GPU storage and scheduling resources shared by all its clients. Volta MPS supports increased isolation between MPS clients, so the resource reduction is to a much lesser degree.
Reduced GPU context switching
Without MPS, when processes share the GPU their scheduling resources must be swapped on and off the GPU. The MPS server shares one set of scheduling resources between all of its clients, eliminating the overhead of swapping when the GPU is scheduling between those clients.
(没有换入换出就意味着所有进程的相关内容(包括什么呀)需要常驻显存,会不会有很大的占用开销?)
Identifying Candidate applications
MPS is useful when each application process does not generate enough work to saturate the GPU. Multiple processes can be run per node using MPS to enable more concurrency. Applications like this are identified by having a small number of blocks-per-grid.
Further, if the application shows a low GPU occupancy because of a small number of threads-per-grid, performance improvements may be achievable with MPS.Using fewer blocks-per-grid in the kernel invocation and more threads-per-block to increase the occupancy per block is recommended. MPS allows the leftover GPU capacity to be occupied with CUDA kernels running from other processes.
问:如果kernel的blocks-per-grid小且threads-per-block多,比较适合使用MPS( Tensorflow有没有提供控制grid,block,thread的接口)
答:The optimal launch configuration will depend on the specifics of the compute kernel. As such, there is no single value. You’ll need to identify a particular operation of interest to investigate. As far as the sources, some TF native kernels will use functions from tensorflow/core/util/gpu_launch_config.h when determining their launch configuration. For XLA, you might start in tensorflow/compiler/xla/service/gpu/partition_assignment.h.<

本文介绍了MPS(Multi-Process Service)在CUDA编程中的角色,包括控制守护进程、客户端运行时、服务器进程,以及其如何提升GPU利用率、减少存储和切换成本。探讨了MPS适用于的工作场景,并揭示了MPS的限制和最佳实践,如GPU Compute Modes的选择。最后提到了MPS与Hyper-Q的区别和硬件要求。
最低0.47元/天 解锁文章
2189

被折叠的 条评论
为什么被折叠?



