pytorch初始化时占用过多内存问题

最新推荐文章于 2025-04-22 14:10:25 发布

原创最新推荐文章于 2025-04-22 14:10:25 发布 · 6k 阅读

4 ·

CC 4.0 BY-SA版权

pytorch技巧专栏收录该内容

15 篇文章

订阅专栏

问题来源：

使用pytorch部署代码，我们会发现一个问题，就是如果只是使用cpu不适用gpu，pytorch加载进来的时候并不会占用多大的内存，一切都是正常的。

下面是问题描述：
There is a huge RAM overhead for using the GPU even for processing small tensors.

Here’s a standalone script:

# test.py
import torch
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('size', type=int)
args = parser.parse_args()

torch.set_grad_enabled(False)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = torch.nn.Conv2d(1, 1, 1).to(device)
x = torch.rand(1, 1, args.size, args.size).to(device)
y = model(x)

Recording using GNU time:

$ /usr/bin/time -v python test.py 100
        Command being timed: "python test.py 100"
        User time (seconds): 0.26
        System time (seconds): 0.03
        Percent of CPU this job got: 114%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.26
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1904088
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 16238
        Voluntary context switches: 40
        Involuntary context switches: 19
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

The line to pay attention here is: Maximum resident set size (kbytes): 1904088. It takes roughly 2 GB of RAM in order to simply use the GPU to process a 100x100 image. In contrast, doing the same for CPU:

$ CUDA_VISIBLE_DEVICES='' /usr/bin/time -v python test.py 100
        Command being timed: "python test.py 100"
        User time (seconds): 0.29
        System time (seconds): 0.04
        Percent of CPU this job got: 116%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.29
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 149352
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 16432
        Voluntary context switches: 39
        Involuntary context switches: 19
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

takes only ~150 MB. Using the following script, I constructed a plot of RAM usage vs image size:

Additional Notes
I’ve observed stranger behavior in the curve on the CPU where for small images the memory consumption grows exponentially up to ~2 GB then drops and grows linearly. I’m attempting to reproduce this behavior in a small, standalone script like the above.

解决方案参考来源:

https://github.com/pytorch/pytorch/issues/12873

解释

This is probably caused by the cuda runtime loading the kernel images.

I traced the library calls and found a large amount of memory is allocated on the heap by the cuda runtime when the runtime initializes itself. Since the initialization is implicit, it only happens after you call a cuda runtime function (not every function though. You can trigger it by e.g., using cudaAllocHost). You don’t even have to create a cuda tensor. Create a CPU tensor and call pin_memory(). You will get the same result.

You will find cudnn has similar behavior. As a simple test, you can write a C++ file:

#include "cudnn.h"
int main(int argc, const char* argv[]) {
  cudnnHandle_t cudnn;
  cudnnCreate(&cudnn);
  while(1);
  return 0;
}

Compile it (note that you need to link to libcudnn), you will find it consumes 750M RAM (my environment: cudnn v7, GTX 1070).

PyTorch has its own cuda kernels. From my measurement the cuda runtime allocates ~1GB memory for them. If you compile pytorch with cudnn enabled the total memory usage is 1GB + 750M + others = 2GB+

Note that this is just my speculation as there is no official documentation about this. What puzzles me is that the cuda runtime allocates much more memory than the actual code size (they are approx. linearly correlated. If I remove half of pytorch’s kernels the memory usage is also reduced by half). I suspect either the kernel binaries have been compressed or they have to be post-processed by the runtime.