pytorch初始化时占用过多内存问题

问题来源:

使用pytorch部署代码,我们会发现一个问题,就是如果只是使用cpu不适用gpu,pytorch加载进来的时候并不会占用多大的内存,一切都是正常的。

下面是问题描述:
There is a huge RAM overhead for using the GPU even for processing small tensors.

Here’s a standalone script:

# test.py
import torch
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('size', type=int)
args = parser.parse_args()

torch.set_grad_enabled(False)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = torch.nn.Conv2d(1, 1, 1).to(device)
x = torch.rand(1, 1, args.size, args.size).to(device)
y = model(x)

Recording using GNU time:

$ /usr/bin/time -v python test.py 100
        Command being timed: "python test.py 100"
        User time (seconds): 0.26
        System time (seconds): 0.03
        Percent of CPU this job got: 114%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.26
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1904088
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 16238
        Voluntary context switches: 40
        Involuntary context switches: 19
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

The line to pay attention here is: Maximum resident set size (kbytes): 1904088. It takes roughly 2 GB of RAM in order to simply use the GPU to process a 100x100 image. In contrast, doing the same for CPU:

$ CUDA_VISIBLE_DEVICES='' /usr/bin/time -v python test.py 100
        Command being timed: "python test.py 100"
        User time (seconds): 0.29
        System time (seconds): 0.04
        Percent of CPU this job got: 116%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.29
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 149352
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 16432
        Voluntary context switches: 39
        Involuntary context switches: 19
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

takes only ~150 MB. Using the following script, I constructed a plot of RAM usage vs image size:

Additional Notes
I’ve observed stranger behavior in the curve on the CPU where for small images the memory consumption grows exponentially up to ~2 GB then drops and grows linearly. I’m attempting to reproduce this behavior in a small, standalone script like the above.

解决方案参考来源:

https://github.com/pytorch/pytorch/issues/12873

解释

This is probably caused by the cuda runtime loading the kernel images.

I traced the library calls and found a large amount of memory is allocated on the heap by the cuda runtime when the runtime initializes itself. Since the initialization is implicit, it only happens after you call a cuda runtime function (not every function though. You can trigger it by e.g., using cudaAllocHost). You don’t even have to create a cuda tensor. Create a CPU tensor and call pin_memory(). You will get the same result.

You will find cudnn has similar behavior. As a simple test, you can write a C++ file:

#include "cudnn.h"
int main(int argc, const char* argv[]) {
  cudnnHandle_t cudnn;
  cudnnCreate(&cudnn);
  while(1);
  return 0;
}

Compile it (note that you need to link to libcudnn), you will find it consumes 750M RAM (my environment: cudnn v7, GTX 1070).

PyTorch has its own cuda kernels. From my measurement the cuda runtime allocates ~1GB memory for them. If you compile pytorch with cudnn enabled the total memory usage is 1GB + 750M + others = 2GB+

Note that this is just my speculation as there is no official documentation about this. What puzzles me is that the cuda runtime allocates much more memory than the actual code size (they are approx. linearly correlated. If I remove half of pytorch’s kernels the memory usage is also reduced by half). I suspect either the kernel binaries have been compressed or they have to be post-processed by the runtime.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值