nvidia Multiple Process Service (MPS)

最新推荐文章于 2024-12-11 16:13:58 发布

原创

最新推荐文章于 2024-12-11 16:13:58 发布 · 2.3k 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#硬件工程

本文介绍了MPS（Multi-Process Service）在CUDA编程中的角色，包括控制守护进程、客户端运行时、服务器进程，以及其如何提升GPU利用率、减少存储和切换成本。探讨了MPS适用于的工作场景，并揭示了MPS的限制和最佳实践，如GPU Compute Modes的选择。最后提到了MPS与Hyper-Q的区别和硬件要求。

What MPS is?

MPS is a binary-compatible client-server runtime implementation of the CUDA API which consists of several components.

Control Daemon Process – The control daemon is responsible for starting and

stopping the server, as well as coordinating connections between clients and servers.

C lient Runtime – The MPS client runtime is built into the CUDA Driver library and may be used transparently by any CUDA application. (对CPU版本有没有要求）

Server Process – The server is the clients' shared connection to the GPU and provides concurrency between clients.

When to use MPS? (The Benefits of MPS)

GPU utilization

A single process may not utilize all the compute and memory-bandwidth capacity available on the GPU. MPS allows kernel and memcopy operations from different processes to overlap on the GPU, achieving higher utilization and shorter running times.

Reduced on-GPU context storage

Without MPS each CUDA processes using a GPU allocates separate storage and scheduling resources on the GPU. In contrast, the MPS server allocates one copy of GPU storage and scheduling resources shared by all its clients. Volta MPS supports increased isolation between MPS clients, so the resource reduction is to a much lesser degree.

Reduced GPU context switching

Without MPS, when processes share the GPU their scheduling resources must be swapped on and off the GPU. The MPS server shares one set of scheduling resources between all of its clients, eliminating the overhead of swapping when the GPU is scheduling between those clients.

(没有换入换出就意味着所有进程的相关内容（包括什么呀）需要常驻显存，会不会有很大的占用开销？)

Identifying Candidate applications

MPS is useful when each application process does not generate enough work to saturate the GPU. Multiple processes can be run per node using MPS to enable more concurrency. Applications like this are identified by having a small number of blocks-per-grid.

Further, if the application shows a low GPU occupancy because of a small number of threads-per-grid, performance improvements may be achievable with MPS.Using fewer blocks-per-grid in the kernel invocation and more threads-per-block to increase the occupancy per block is recommended. MPS allows the leftover GPU capacity to be occupied with CUDA kernels running from other processes.

问：如果kernel的blocks-per-grid小且threads-per-block多，比较适合使用MPS( Tensorflow有没有提供控制grid，block，thread的接口）

答：The optimal launch configuration will depend on the specifics of the compute kernel. As such, there is no single value. You’ll need to identify a particular operation of interest to investigate. As far as the sources, some TF native kernels will use functions from tensorflow/core/util/gpu_launch_config.h when determining their launch configuration. For XLA, you might start in tensorflow/compiler/xla/service/gpu/partition_assignment.h.<