问题来源:
使用pytorch部署代码,我们会发现一个问题,就是如果只是使用cpu不适用gpu,pytorch加载进来的时候并不会占用多大的内存,一切都是正常的。
下面是问题描述:
There is a huge RAM overhead for using the GPU even for processing small tensors.
Here’s a standalone script:
# test.py
import torch
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('size', type=int)
args = parser.parse_args()
torch.set_grad_enabled(False)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = torch.nn.Conv2d(1, 1, 1).to(device)
x = torch.rand(1, 1, args.size, args.size).to(device)
y = model(x)
Recording using GNU time:
$ /usr/bin/time -v python test.py 100
Command being timed: "python test.py 100"
User time (seconds): 0.26
System time (seconds): 0.03
Percent of CPU this job got: 114%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.26
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1904088
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 16238
Voluntary context switches: 40
Involuntary context switches: 19
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
The line to pay attention here is: Maximum resident set size (kbytes): 1904088. It takes roughly 2 GB of RAM in order to simply use the GPU to process a 100x100 image. In contrast, doing the same for CPU:
$ CUDA_VISIBLE_DEVICES='' /usr/bin/time -v python test.py 100
Command being timed: "python test.py 100"
User time (seconds): 0.29
System time (seconds): 0.04
Percent of CPU this job got: 116%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.29
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 149352
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 16432
Voluntary context switches: 39
Involuntary context switches: 19
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
takes only ~150 MB. Using the following script, I constructed a plot of RAM usage vs image size:
Additional Notes
I’ve observed stranger behavior in the curve on the CPU where for small images the memory consumption grows exponentially up to ~2 GB then drops and grows linearly. I’m attempting to reproduce this behavior in a small, standalone script like the above.
解决方案参考来源:
https://github.com/pytorch/pytorch/issues/12873
解释
This is probably caused by the cuda runtime loading the kernel images.
I traced the library calls and found a large amount of memory is allocated on the heap by the cuda runtime when the runtime initializes itself. Since the initialization is implicit, it only happens after you call a cuda runtime function (not every function though. You can trigger it by e.g., using cudaAllocHost). You don’t even have to create a cuda tensor. Create a CPU tensor and call pin_memory(). You will get the same result.
You will find cudnn has similar behavior. As a simple test, you can write a C++ file:
#include "cudnn.h"
int main(int argc, const char* argv[]) {
cudnnHandle_t cudnn;
cudnnCreate(&cudnn);
while(1);
return 0;
}
Compile it (note that you need to link to libcudnn), you will find it consumes 750M RAM (my environment: cudnn v7, GTX 1070).
PyTorch has its own cuda kernels. From my measurement the cuda runtime allocates ~1GB memory for them. If you compile pytorch with cudnn enabled the total memory usage is 1GB + 750M + others = 2GB+
Note that this is just my speculation as there is no official documentation about this. What puzzles me is that the cuda runtime allocates much more memory than the actual code size (they are approx. linearly correlated. If I remove half of pytorch’s kernels the memory usage is also reduced by half). I suspect either the kernel binaries have been compressed or they have to be post-processed by the runtime.