模型运行报 RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0； 31.75 GiB

原创于 2023-03-29 14:07:00 发布 · 1.7k 阅读

CC 4.0 BY-SA版权

文章标签：

在构建多分类模型时遇到CUDA内存不足的问题，即使有空闲GPU也无法运行。错误提示显示内存分配失败。解决方法是设置环境变量CUDA_VISIBLE_DEVICES指定使用GPU，以及在使用nn.DataParallel时正确指定设备_ids。当指定GPU已被占用时，模型会尝试在默认的GPU0上分配内存导致失败。应确保未指定或指定的GPU处于空闲状态。

部署运行你感兴趣的模型镜像

最近在做一个多分类模型时，遇到一个下面bug，明明服务器多块GPU上有空闲GPU，却无法运行模型。

RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 31.75 GiB total capacity; 20.64 GiB already allocated; 265.75 MiB free; 20.75 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

后来发现原因是没有指定GPU，但是有时明明指定了GPU还是无法运行，后来才知道，如果你指定的GPU被占用时，它会报默认0号GPU被占用无法分配运行内存。

# 使用8块GPU中的2，3，4，5号（从0开始）
os.environ['CUDA_VISIBLE_DEVICES'] = '2,3,6,7'
device = torch.device('cuda')

if torch.cuda.device_count() > 1:
    print('Let\'s use', torch.cuda.device_count(), 'GPUs!')
    model = nn.DataParallel(model, device_ids=[0,1]) # 设置使用的GPU为0和1号
model.to(device)

您可能感兴趣的与本文相关的镜像