Ubuntu18.04，两块GPU，踩坑记录。

最新推荐文章于 2024-03-20 14:26:38 发布

原创最新推荐文章于 2024-03-20 14:26:38 发布 · 4k 阅读

8 ·

CC 4.0 BY-SA版权

文章标签：

#pytorch #深度学习 #人工智能

本文探讨了如何在PyTorch中升级CUDA版本至11，并处理'imbalance between GPUs'警告，优化多GPU使用中的batch_size设置，同时解决了lr_scheduler和optimizer顺序问题，以及如何避免GPU过热和指定使用特定GPU。针对不同型号GPU的协作，提供了配置建议和注意事项。

部署运行你感兴趣的模型镜像

升级CUDA版本

3090只支持CUDA11的，之前一直用的10.2版本，需要更新。

报"imbalance between your GPUs."的警告

报以下warning：

There is an imbalance between your GPUs. You may want to exclude GPU 0 which has less than 75% of the memory or cores of GPU 1. You can do so by setting the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES environment variable.

可以参考这个链接，用下面这行代码。不过我是换了个代码跑，就正常了……

net = nn.DataParallel(model.cuda(), device_ids=[0,1])

pytorch，设置多GPU

有关模型训练，参考这个链接
有关模型读取，参考这个链接

关于batch_size的设置

有了多GPU后，batch_size指的是每一块GPU上分配的，所以在设置超参数的时候，只要输入正常的batch_size，不需要batch_size * GPU数，不然可能会报下面这个错：

RuntimeError: CUDA out of memory. Tried to allocate 1.30 GiB (GPU 1; 10.76 GiB total capacity; 6.51 GiB already allocated; 1.18 GiB free; 8.12 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

报UserWarning: Detected call of lr_scheduler.step() before optimizer.step().

报以下warning:

UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate

参考这个链接，调换optimizer.step()和scheduler.step()的顺序即可

GPU温度

65-75摄氏度为宜，可以用nvidia-smi命令看温度。

指定使用某一块GPU

参考这个链接，在文件开头用下面这行代码：

import os
os.environ["CUDA_VISIBLE_DEVICES"] = 'gpu_id'

gpu_id是GPU的ID号，可以在终端里输入nvidia-smi查看。

两块不同型号的GPU是否冲突

参考这个链接，只用来跑深度学习的代码话，框架的版本能匹配两块卡就行。

您可能感兴趣的与本文相关的镜像

PyTorch 2.5

PyTorch

Cuda

PyTorch 是一个开源的 Python 机器学习库，基于 Torch 库，底层由 C++ 实现，应用于人工智能领域，如计算机视觉和自然语言处理