pytorch 细节 多GPU卡训练 一个可在windows单GPU独立运行的DDP

本文提供了一个使用DistributedDataParallel(DDP)在单GPU环境下进行训练的完整示例,包括模型定义、数据加载、训练流程等关键部分,并解决了在Windows及Linux环境下可能遇到的问题。

an vanilla example for DDP on lunix

# main.py
import torch
import torch.distributed as dist
import torch.multiprocessing as mp

mp.spawn(main_worker, nprocs=4, args=(4, myargs))
def main_worker(proc, nprocs, args):
   dist.init_process_group(backend='nccl', init_method='tcp://127.0.0.1:23456', world_size=4, rank=gpu)
   torch.cuda.set_device(args.local_rank)
   train_dataset = ...
   train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)

   train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler)

   model = ...
   model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank])

   optimizer = optim.SGD(model.parameters())

   for epoch in range(100):
      for batch_idx, (data, target) in enumerate(train_loader):
          images = images.cuda(non_blocking=True)
          target = target.cuda(non_blocking=True)
          ...
          output = model(images)
          loss = criterion(output, target)
          ...
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()

windows下的错误

1.‘CUDA_VISIBLE_DEVICES‘ 不是内部或外部命令

import os
os.environ["CUDA_VISIBLE_DEVICES"]='0'
# os.environ["CUDA_VISIBLE_DEVICES"]='0,1,2,3'

2.EOFError: Ran out of input
解决方案:torch.utils.data.DataLoader这个方法的地方,修改这个方法的一个参数即可:修改其中的num_workers = 0

# num_workers=0,
    train_loader = torch.utils.data.DataLoader(train_dataset,
                                               batch_size=args.batch_size,
                                               num_workers=0,
                                               pin_memory=False,
                                               sampler=train_sampler)

3.RuntimeError: cannot pin ‘torch.cuda.FloatTensor’ only dense CPU tensors can be pinned在本文末尾,单个GPU dataloader pin_memory=True报错

# pin_memory=False,
    train_loader = torch.utils.data.DataLoader(train_dataset,
                                               batch_size=args.batch_size,
                                               num_workers=0,
                                               pin_memory=False,
                                               sampler=train_sampler)
  1. windows不支持NCCL backend=‘gloo’。 这个报错还可以通过不启用ddp解决,但好像不适用于本问题。
    dist.init_process_group(backend='gloo',
                            init_method='tcp://127.0.0.1:23456',
                            world_size=args.nprocs,
                            rank=local_rank)

一个可在windows单GPU独立运行的DDP demo

注:本例因为只有单GPU,使用args.nprocs = torch.cuda.device_count() 得到的int 数值1决定了init_process_group的world_size 即执行训练的所有的节点数
[init_process_group的参数意义](https://blog.youkuaiyun.com/m0_37400316/article/details/107225030)
import torch.multiprocessing as mp
import torch.utils.data.distributed

import argparse
import os
os.environ["CUDA_VISIBLE_DEVICES"]='0'
import random
import shutil
import time
import warnings
import torch
import torch.nn as nn
import torch.nn.parallel
import torch.backends.cudnn as cudnn
import torch.distributed as dist
import torch.optim
import torch.utils.data
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models

import numpy as np
import os, imageio
xb = torch.randn(65536, 2).cuda()
yb = torch.randn(65536, 3).cuda()

from torch.utils.data.dataset import Dataset
class MyDataSet(Dataset):
    def __init__(self, data, label):#传入参数是我们的数据集(data)和标签集(label)
        self.data = data
        self.label = label
        self.length = data.shape[0]
    def __getitem__(self, mask):# 获取返回数据的方法,传入参数是一个index,也被叫做mask,就是我们对数据集的选择索引。在调用DataLoader时就会自己生成index,所以我们只需要写好方法即可。
        label = self.label[mask]
        data = self.data[mask]
        return data, label # 注意返回的顺序也重要
    def __len__(self):
        return self.length

class MLP(nn.Module):
    def __init__(self,depth=4,mapping_size=2,hidden_size=256):
        super().__init__()
        layers = []
        layers.append(nn.Linear(mapping_size,hidden_size))
        layers.append(nn.ReLU(inplace=True))
        for _ in range(depth-2):
            layers.append(nn.Linear(hidden_size,hidden_size))
            layers.append(nn.ReLU(inplace=True))
        layers.append(nn.Linear(hidden_size,3))
        self.layers = nn.Sequential(*layers)
    def forward(self,x):
        return torch.sigmoid(self.layers(x))



model_names = sorted(name for name in models.__dict__
                     if name.islower() and not name.startswith("__")
                     and callable(models.__dict__[name]))

parser = argparse.ArgumentParser(description='PyTorch ImageNet Training')
parser.add_argument('--data',metavar='DIR',default='/home/zhangzhi/Data/exports/ImageNet2012',help='path to dataset')
parser.add_argument('-a','--arch',metavar='ARCH',default='resnet18',choices=model_names,help='model architecture: ' + ' | '.join(model_names) +
' (default: resnet18)')
parser.add_argument('-j','--workers',default=4,type=int,metavar='N',help='number of data loading workers (default: 4)')
parser.add_argument('--epochs',default=90,type=int,metavar='N',help='number of total epochs to run')
parser.add_argument('--start-epoch',default=0,type=int,metavar='N',help='manual epoch number (useful on restarts)')
parser.add_argument('-b','--batch-size',default=3200,type=int,metavar='N',help='mini-batch size (default: 256), this is the total '
                    'batch size of all GPUs on the current node when '
                    'using Data Parallel or Distributed Data Parallel')
parser.add_argument('--lr','--learning-rate',default=0.1,type=float,metavar='LR',help='initial learning rate',dest='lr')
parser.add_argument('--momentum',default=0.9,type=float,metavar='M',help='momentum')
parser.add_argument('--wd','--weight-decay',default=1e-4,type=float,metavar='W',help='weight decay (default: 1e-4)',dest='weight_decay')
parser.add_argument('-p','--print-freq',default=10,type=int,metavar='N',help='print frequency (default: 10)')
parser.add_argument('-e','--evaluate',dest='evaluate',action='store_true',help='evaluate model on validation set')
parser.add_argument('--pretrained',dest='pretrained',action='store_true',help='use pre-trained model')
parser.add_argument('--seed',default=None,type=int,help='seed for initializing training. ')


def reduce_mean(tensor, nprocs):
    rt = tensor.clone()
    dist.all_reduce(rt, op=dist.ReduceOp.SUM)
    rt /= nprocs
    return rt


def main():
    args = parser.parse_args()
    args.nprocs = torch.cuda.device_count()
    mp.spawn(main_worker, nprocs=args.nprocs, args=(args.nprocs, args))


def main_worker(local_rank, nprocs, args):
    args.local_rank = local_rank

    if args.seed is not None:
        random.seed(args.seed)
        torch.manual_seed(args.seed)
        cudnn.deterministic = True
        warnings.warn('You have chosen to seed training. '
                      'This will turn on the CUDNN deterministic setting, '
                      'which can slow down your training considerably! '
                      'You may see unexpected behavior when restarting '
                      'from checkpoints.')

    best_acc1 = .0
    # dist.init_process_group(backend='nccl',
    #                         init_method='tcp://127.0.0.1:23456',
    #                         world_size=args.nprocs,
    #                         rank=local_rank)
    dist.init_process_group(backend='gloo',
                            init_method='tcp://127.0.0.1:23456',
                            world_size=args.nprocs,
                            rank=local_rank)
    # create model
    model = MLP()
    torch.cuda.set_device(local_rank)
    model.cuda(local_rank)
    # When using a single GPU per process and per DistributedDataParallel,
    # we need to divide the batch size ourselves based on the total number of GPUs we have
    args.batch_size = 1204
    model = torch.nn.parallel.DistributedDataParallel(model,
                                                      device_ids=[local_rank])

    # define loss function (criterion) and optimizer
    criterion = nn.CrossEntropyLoss().cuda(local_rank)
    optimizer = torch.optim.SGD(model.parameters(),
                                args.lr,
                                momentum=args.momentum,
                                weight_decay=args.weight_decay)

    cudnn.benchmark = True

    train_dataset =  MyDataSet(xb,yb)
    train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
    train_loader = torch.utils.data.DataLoader(train_dataset,
                                               batch_size=args.batch_size,
                                               num_workers=0,
                                               pin_memory=False,
                                               sampler=train_sampler)

    for epoch in range(args.start_epoch, args.epochs):

        train_sampler.set_epoch(epoch)
        adjust_learning_rate(optimizer, epoch, args)
        # train for one epoch
        train(train_loader, model, criterion, optimizer, epoch, local_rank,
              args)

def train(train_loader, model, criterion, optimizer, epoch, local_rank, args):
    batch_time = AverageMeter('Time', ':6.3f')
    data_time = AverageMeter('Data', ':6.3f')
    losses = AverageMeter('Loss', ':.4e')
    top1 = AverageMeter('Acc@1', ':6.2f')
    top5 = AverageMeter('Acc@5', ':6.2f')
    progress = ProgressMeter(len(train_loader),
                             [batch_time, data_time, losses, top1, top5],
                             prefix="Epoch: [{}]".format(epoch))

    # switch to train mode
    model.train()

    end = time.time()
    for i, (images, target) in enumerate(train_loader):
        # measure data loading time
        data_time.update(time.time() - end)

        images = images.cuda(local_rank, non_blocking=True)
        target = target.cuda(local_rank, non_blocking=True)

        # compute output
        output = model(images)
        loss = criterion(output, target)
        print(loss)

        # compute gradient and do SGD step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()



class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self, name, fmt=':f'):
        self.name = name
        self.fmt = fmt
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

    def __str__(self):
        fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
        return fmtstr.format(**self.__dict__)


class ProgressMeter(object):
    def __init__(self, num_batches, meters, prefix=""):
        self.batch_fmtstr = self._get_batch_fmtstr(num_batches)
        self.meters = meters
        self.prefix = prefix

    def display(self, batch):
        entries = [self.prefix + self.batch_fmtstr.format(batch)]
        entries += [str(meter) for meter in self.meters]
        print('\t'.join(entries))

    def _get_batch_fmtstr(self, num_batches):
        num_digits = len(str(num_batches // 1))
        fmt = '{:' + str(num_digits) + 'd}'
        return '[' + fmt + '/' + fmt.format(num_batches) + ']'


def adjust_learning_rate(optimizer, epoch, args):
    """Sets the learning rate to the initial LR decayed by 10 every 30 epochs"""
    lr = args.lr * (0.1**(epoch // 30))
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr



if __name__ == '__main__':
    main()

参考

DDP教程:https://github.com/rentainhe/pytorch-distributed-training
DDP模式
pytorch分布式系列3——分布式训练时,torch.utils.data.distributed.DistributedSampler做了什么?
https://zhuanlan.zhihu.com/p/98535650

在lunix运行时的警告

计算交叉熵是出现异常提示:RuntimeError: multi-target not supported at /opt/conda/conda-bld/pytorch_1549635019666/work/aten/src/THNN/generic/ClassNLLCriterion.c:21

亚马逊云科技:多GPU训练

lunix 上修改一点就能运行

import torch.multiprocessing as mp
import torch.utils.data.distributed

import argparse
import os
os.environ["CUDA_VISIBLE_DEVICES"]='0,1'
import random
import shutil
import time
import warnings
import torch
import torch.nn as nn
import torch.nn.parallel
import torch.backends.cudnn as cudnn
import torch.distributed as dist
import torch.optim
import torch.utils.data
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models

import numpy as np
import os, imageio

xb = torch.randn(65536, 2).cuda()
yb = torch.randn(65536, 3).cuda()

from torch.utils.data.dataset import Dataset
class MyDataSet(Dataset):
    def __init__(self, data, label):#传入参数是我们的数据集(data)和标签集(label)
        self.data = data
        self.label = label
        self.length = data.shape[0]
    def __getitem__(self, mask):# 获取返回数据的方法,传入参数是一个index,也被叫做mask,就是我们对数据集的选择索引。在调用DataLoader时就会自己生成index,所以我们只需要写好方法即可。
        label = self.label[mask]
        data = self.data[mask]
        return data, label # 注意返回的顺序也重要
    def __len__(self):
        return self.length

class MLP(nn.Module):
    def __init__(self,depth=4,mapping_size=2,hidden_size=256):
        super().__init__()
        layers = []
        layers.append(nn.Linear(mapping_size,hidden_size))
        layers.append(nn.ReLU(inplace=True))
        for _ in range(depth-2):
            layers.append(nn.Linear(hidden_size,hidden_size))
            layers.append(nn.ReLU(inplace=True))
        layers.append(nn.Linear(hidden_size,3))
        self.layers = nn.Sequential(*layers)
    def forward(self,x):
        return torch.sigmoid(self.layers(x))



model_names = sorted(name for name in models.__dict__
                     if name.islower() and not name.startswith("__")
                     and callable(models.__dict__[name]))

parser = argparse.ArgumentParser(description='PyTorch ImageNet Training')
parser.add_argument('--data',metavar='DIR',default='/home/zhangzhi/Data/exports/ImageNet2012',help='path to dataset')
parser.add_argument('-a','--arch',metavar='ARCH',default='resnet18',choices=model_names,help='model architecture: ' + ' | '.join(model_names) +
' (default: resnet18)')
parser.add_argument('-j','--workers',default=4,type=int,metavar='N',help='number of data loading workers (default: 4)')
parser.add_argument('--epochs',default=90,type=int,metavar='N',help='number of total epochs to run')
parser.add_argument('--start-epoch',default=0,type=int,metavar='N',help='manual epoch number (useful on restarts)')
parser.add_argument('-b','--batch-size',default=3200,type=int,metavar='N',help='mini-batch size (default: 256), this is the total '
                    'batch size of all GPUs on the current node when '
                    'using Data Parallel or Distributed Data Parallel')
parser.add_argument('--lr','--learning-rate',default=0.1,type=float,metavar='LR',help='initial learning rate',dest='lr')
parser.add_argument('--momentum',default=0.9,type=float,metavar='M',help='momentum')
parser.add_argument('--wd','--weight-decay',default=1e-4,type=float,metavar='W',help='weight decay (default: 1e-4)',dest='weight_decay')
parser.add_argument('-p','--print-freq',default=10,type=int,metavar='N',help='print frequency (default: 10)')
parser.add_argument('-e','--evaluate',dest='evaluate',action='store_true',help='evaluate model on validation set')
parser.add_argument('--pretrained',dest='pretrained',action='store_true',help='use pre-trained model')
parser.add_argument('--seed',default=None,type=int,help='seed for initializing training. ')


def reduce_mean(tensor, nprocs):
    rt = tensor.clone()
    dist.all_reduce(rt, op=dist.ReduceOp.SUM)
    rt /= nprocs
    return rt


def main():
    args = parser.parse_args()
    args.nprocs = torch.cuda.device_count()
    mp.spawn(main_worker, nprocs=args.nprocs, args=(args.nprocs, args))


def main_worker(local_rank, nprocs, args):
    args.local_rank = local_rank

    if args.seed is not None:
        random.seed(args.seed)
        torch.manual_seed(args.seed)
        cudnn.deterministic = True
        warnings.warn('You have chosen to seed training. '
                      'This will turn on the CUDNN deterministic setting, '
                      'which can slow down your training considerably! '
                      'You may see unexpected behavior when restarting '
                      'from checkpoints.')

    best_acc1 = .0
    dist.init_process_group(backend='nccl',
                            init_method='tcp://127.0.0.1:23456',
                            world_size=args.nprocs,
                            rank=local_rank)

    # create model
    model = MLP()
    torch.cuda.set_device(local_rank)
    model.cuda(local_rank)
    # When using a single GPU per process and per DistributedDataParallel,
    # we need to divide the batch size ourselves based on the total number of GPUs we have
    args.batch_size = 1204
    model = torch.nn.parallel.DistributedDataParallel(model,
                                                      device_ids=[local_rank])

    # define loss function (criterion) and optimizer
    criterion = nn.CrossEntropyLoss().cuda(local_rank)
    criterion = nn.MSELoss().cuda(local_rank)
    optimizer = torch.optim.SGD(model.parameters(),
                                args.lr,
                                momentum=args.momentum,
                                weight_decay=args.weight_decay)

    cudnn.benchmark = True

    train_dataset =  MyDataSet(xb,yb)
    train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
    train_loader = torch.utils.data.DataLoader(train_dataset,
                                               batch_size=args.batch_size,
                                               num_workers=0,
                                               pin_memory=False,
                                               sampler=train_sampler)

    for epoch in range(args.start_epoch, args.epochs):

        train_sampler.set_epoch(epoch)
        adjust_learning_rate(optimizer, epoch, args)
        # train for one epoch
        train(train_loader, model, criterion, optimizer, epoch, local_rank,
              args)

def train(train_loader, model, criterion, optimizer, epoch, local_rank, args):
    batch_time = AverageMeter('Time', ':6.3f')
    data_time = AverageMeter('Data', ':6.3f')
    losses = AverageMeter('Loss', ':.4e')
    top1 = AverageMeter('Acc@1', ':6.2f')
    top5 = AverageMeter('Acc@5', ':6.2f')
    progress = ProgressMeter(len(train_loader),
                             [batch_time, data_time, losses, top1, top5],
                             prefix="Epoch: [{}]".format(epoch))

    # switch to train mode
    model.train()

    end = time.time()
    for i, (images, target) in enumerate(train_loader):
        # measure data loading time
        data_time.update(time.time() - end)

        images = images.cuda(local_rank, non_blocking=True)
        target = target.cuda(local_rank, non_blocking=True)

        # compute output
        output = model(images)
        loss = criterion(output, target)
        print(loss)

        # compute gradient and do SGD step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()



class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self, name, fmt=':f'):
        self.name = name
        self.fmt = fmt
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

    def __str__(self):
        fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
        return fmtstr.format(**self.__dict__)


class ProgressMeter(object):
    def __init__(self, num_batches, meters, prefix=""):
        self.batch_fmtstr = self._get_batch_fmtstr(num_batches)
        self.meters = meters
        self.prefix = prefix

    def display(self, batch):
        entries = [self.prefix + self.batch_fmtstr.format(batch)]
        entries += [str(meter) for meter in self.meters]
        print('\t'.join(entries))

    def _get_batch_fmtstr(self, num_batches):
        num_digits = len(str(num_batches // 1))
        fmt = '{:' + str(num_digits) + 'd}'
        return '[' + fmt + '/' + fmt.format(num_batches) + ']'


def adjust_learning_rate(optimizer, epoch, args):
    """Sets the learning rate to the initial LR decayed by 10 every 30 epochs"""
    lr = args.lr * (0.1**(epoch // 30))
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr



if __name__ == '__main__':
    main()
<think>首先,用户的问题是关于PyTorchGPU训练时程序住的可能原因和解决方法。用户提到了一些引用,但这些都是示例,不是直接相关的。我需要基于我的知识来回答。 关键点: - 用户说:"我想在使用PyTorch进行GPU训练时遇到程序住的问题,希望找到原因和解决方案" - 引用[1]提到NCCL操作失败或超时,导致整个进程终止。 - 引用[2]提到在GPU上计算loss时住,可能由于显存不足。 - 引用[3]提到线程并行导致GPU利用率100%并住。 所以,我需要覆盖常见原因和解决方案。 常见原因: 1. **NCCL通信问题**:NCCL(NVIDIA Collective Communications Library)用于GPU通信。问题可能包括网络问题、驱动程序不兼容或超时。 2. **显存不足**:如果GPU内存不足,程序可能住或崩溃。 3. **死锁或同步问题**:在线程或进程环境中,代码可能有死锁。 4. **驱动程序或库版本问题**:PyTorchCUDA、cuDNN或NCCL版本不兼容。 5. **硬件问题**:如GPU故障。 6. **数据加载瓶颈**:如果数据加载跟不上,GPU可能闲置或住。 解决方案: 1. **检查显存使用**:使用`nvidia-smi`监控GPU内存。如果不足,减少批次大小或使用梯度累积。 2. **更新驱动程序和库**:确保PyTorchCUDA、NCCL等是最新兼容版本。 3. **设置NCCL超时**:在PyTorch中,可以设置环境变量如`NCCL_BLOCKING_WAIT=1`或`NCCL_ASYNC_ERROR_HANDLING=1`来调试。 4. **使用较小的模型或数据**:测试简化版本以隔离问题。 5. **检查死锁**:确保代码中没有不正确的同步点。 6. **GPU测试**:先在GPU上运行以确认问题是否GPU特定。 7. **日志和调试**:添加日志语句或使用调试工具。 回答结构: - 以清晰的方式组织,逐步解决问题。 - 使用中文。 - 在行内数学表达式使用$...$,独立公式使用$$...$$(但在这个问题中可能不需要太数学)。 - 生成相关问题。 - 自然地添加引用标识(基于提供的引用)。 引用标识: - 引用[^1]:NCCL操作失败。 - 引用[^2]:显存不足导致住。 - 引用[^3]:线程问题。 在回答中,我可以提到这些作为参考。 最终回答结构: 1. **引言**:确认问题。 2. **可能原因**:列出并解释常见原因。 3. **解决方案**:提供逐步解决方法。 4. **生成相关问题**:在末尾添加相关问题。 生成相关问题: - 基于主题,如: 1. 如何监控PyTorch训练中的GPU使用情况? 2. PyTorchGPU训练的最佳实践是什么? 3. 如何处理显存溢出错误?</think>### PyTorchGPU训练程序住的原因及解决方法 在使用PyTorch进行GPU训练时,程序住是一个常见问题,通常由通信问题、资源不足或代码同步错误引起。以下我将基于常见原因和解决方案,逐步分析并提供可靠建议。回答参考了相关引用(如NCCL失败、显存不足和线程问题),并结合PyTorch最佳实践。 --- ### 一、可能原因分析 程序住的核心问题往往源于GPU环境下的异步操作或资源竞争。以下是常见原因: 1. **NCCL通信失败或超时**:PyTorch使用NCCL库处理GPU通信。如果网络不稳定、驱动程序不兼容或操作超时,程序可能住并最终报错(如引用[^1]所述)。数学上,NCCL通信延迟可建模为随机变量$T_d$,其中$P(T_d > \tau) > 0$表示超时概率($\tau$为超时阈值)[^1]。 2. **GPU显存不足**:当个进程同时竞争有限显存时,程序可能在计算密集操作(如loss计算)时住(如引用[^2]所述)。显存需求可表示为$M_{\text{total}} = B \times (M_{\text{model}} + M_{\text{data}})$,其中$B$是批次大小,$M_{\text{model}}$是模型参数大小,$M_{\text{data}}$是数据大小。如果$M_{\text{total}} > \text{可用显存}$,程序会停滞[^2]。 3. **线程死锁或同步问题**:PyTorch进程并行(如`DataParallel`或`DistributedDataParallel`)可能导致线程竞争。例如,资源锁未正确释放时,GPU利用率可能达100%并住(如引用[^3]所述)。这类似于死锁条件:$\exists \text{线程 } T_i, T_j \text{ 满足 } T_i \text{ 等待 } T_j \text{ 释放资源 } R_k$[^3]。 4. **其他原因**: - **库版本不兼容**:PyTorchCUDA、cuDNN或NCCL版本冲突。 - **数据加载瓶颈**:数据预处理(如DataLoader)速度慢于GPU计算,导致GPU闲置。 - **硬件问题**:GPU故障或散热不良。 --- ### 二、解决方法 按照从简到复杂的顺序,逐步排查和解决: #### 步骤1: 检查显存和资源使用 - **监控工具**:运行`nvidia-smi -l 1`实时查看GPU显存和利用率。如果显存接近100%,减少批次大小或使用梯度累积(累积个小批次再更新梯度)。 ```python # 示例:梯度累积代码 optimizer.zero_grad() for i, (inputs, labels) in enumerate(data_loader): outputs = model(inputs) loss = criterion(outputs, labels) loss.backward() # 累积梯度 if (i+1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad() ``` - **优化显存**:使用混合精度训练(`torch.cuda.amp`)或激活检查点(`torch.utils.checkpoint`)来降低$M_{\text{total}}$[^2]。 #### 步骤2: 调试NCCL通信问题 - **设置超时环境变量**:在训练脚本中添加以下行,避免无限等待。 ```bash export NCCL_BLOCKING_WAIT=1 # 强制同步操作,便于检测超时 export NCCL_ASYNC_ERROR_HANDLING=1 # 启用异步错误处理 ``` - **更新库**:确保PyTorchCUDA和NCCL版本兼容。例如,使用PyTorch官方推荐的组合(如PyTorch 1.9+ with CUDA 11.x)。 - **简化测试**:先在GPU上运行代码,确认无问题后,再扩展到GPU。这能隔离问题是否源于通信[^1]。 #### 步骤3: 修复死锁和同步问题 - **避免阻塞操作**:确保`DataLoader`的`num_workers`设置合理(通常设为CPU核心数)。例如: ```python train_loader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=4) ``` - **使用DistributedDataParallel (DDP)**:相比`DataParallel`,DDP更高效且不易死锁。示例代码: ```python import torch.distributed as dist dist.init_process_group(backend='nccl', init_method='env://') model = torch.nn.parallel.DistributedDataParallel(model) ``` - **添加日志**:在关键位置(如loss计算前后)打印日志,定位住点[^3]。 #### 步骤4: 其他通用措施 - **降低模型复杂度**:临时使用更小模型或数据集测试。 - **检查硬件**:确保GPU散热良好,无硬件故障。 - **更新PyTorch**:使用最新稳定版(如PyTorch 2.0+),修复已知bug。 --- ### 三、预防措施 - **最佳实践**:始终在代码中添加异常处理(如`try-except`块),并使用`torch.cuda.synchronize()`确保操作完成。 - **性能分析**:工具如PyTorch Profiler或Nsight Systems可分析GPU利用率。 - **参考文档**:PyTorch官方GPU训练指南提供了详细优化建议。 通过以上步骤,大住问题可解决。如果问题持续,提供更上下文(如错误日志)可进一步诊断[^1][^2][^3]。
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值