调参问题记录--显存下降&bs调整

最近接触一个生成代码,在每个epoch的train后放入了torch.cuda.empty_cache(),然后进入eval,模型整体bs为1,lr为0.0001。
我要查看这个模型在我的数据上效果,所以需要进行简单的超参调节,以下记录了步骤:

  • 具体操作:bs较小,调参起来周期长,修改bs和对应的lr

  • 发现问题1:这个步骤中,bs调整为32,lr调整为0.01,此时在watch中gpu会在一个epoch的训练结尾突然降低到6%-8%浮动

  • 解决思路

    • 是否为gpu温度过高问题,因为我发现gpu随着运行逐渐会升高到80C,通过记录发现,并不是温度导致的显存突然下降
    • 是否为dataloader的num_workers=8参数导致训练一段时间后内存满了,导致显存利用率下降,调整num_workers=0,增加pin_memory=True 选项,仍不能解决
    • 是否为torch.cuda.empty_cache()问题,移除之后发现并没有太大变化,因为我的显存没有完全跑满,并不担心会爆显存
    • 最后发现是eval的验证过程,bs还是为1且只有前向过程,所以使用现存很少且由于数据比较多又显得很慢
  • 发现问题2:bs调整为32,lr调整为0.01,按理说应该会比bs为1,lr为0.0001能够更快到收敛值,有一个较好的结果,但是我发现训练后期会让整个生成模型变得不稳定,甚至loss逐渐增大

  • 解决思路2
    我觉得可能是因为针对专门的数据集,基于GAN的生成模型在大bs中学习会很困难,导致bs增大后,模型能力不足,导致模式崩溃

YOLO模型:16系卡用户关闭amp和half后指标恢复(引用[4])。 关闭了,没用啊 (ultralytics) C:\Users\86187\ultralytics-8.1.0>C:/Users/86187/miniconda3/envs/ultralytics/python.exe c:/Users/86187/ultralytics-8.1.0/tarin.py from n params module arguments 0 -1 1 464 ultralytics.nn.modules.conv.Conv [3, 16, 3, 2] 1 -1 1 4672 ultralytics.nn.modules.conv.Conv [16, 32, 3, 2] 2 -1 1 7360 ultralytics.nn.modules.block.C2f [32, 32, 1, True] 3 -1 1 18560 ultralytics.nn.modules.conv.Conv [32, 64, 3, 2] 4 -1 2 49664 ultralytics.nn.modules.block.C2f [64, 64, 2, True] 5 -1 1 73984 ultralytics.nn.modules.conv.Conv [64, 128, 3, 2] 6 -1 2 197632 ultralytics.nn.modules.block.C2f [128, 128, 2, True] 7 -1 1 295424 ultralytics.nn.modules.conv.Conv [128, 256, 3, 2] 8 -1 1 460288 ultralytics.nn.modules.block.C2f [256, 256, 1, True] 9 -1 1 8210 ultralytics.nn.Attention.CBAM.CBAM [256, 3] 10 -1 1 164608 ultralytics.nn.modules.block.SPPF [256, 256, 5] 11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 12 [-1, 6] 1 0 ultralytics.nn.modules.conv.Concat [1] 13 -1 1 148224 ultralytics.nn.modules.block.C2f [384, 128, 1] 14 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 15 [-1, 4] 1 0 ultralytics.nn.modules.conv.Concat [1] 16 -1 1 37248 ultralytics.nn.modules.block.C2f [192, 64, 1] 17 -1 1 36992 ultralytics.nn.modules.conv.Conv [64, 64, 3, 2] 18 [-1, 13] 1 0 ultralytics.nn.modules.conv.Concat [1] 19 -1 1 123648 ultralytics.nn.modules.block.C2f [192, 128, 1] 20 -1 1 147712 ultralytics.nn.modules.conv.Conv [128, 128, 3, 2] 21 [-1, 10] 1 0 ultralytics.nn.modules.conv.Concat [1] 22 -1 1 493056 ultralytics.nn.modules.block.C2f [384, 256, 1] 23 [16, 19, 22] 1 751702 ultralytics.nn.modules.head.Detect [2, [64, 128, 256]] YOLOv8n-CBAM summary: 236 layers, 3019448 parameters, 3019432 gradients, 8.2 GFLOPs New https://pypi.org/project/ultralytics/8.3.174 available 😃 Update with 'pip install -U ultralytics' Ultralytics YOLOv8.1.0 🚀 Python-3.9.23 torch-1.12.1+cu116 CUDA:0 (NVIDIA GeForce GTX 1650, 4096MiB) WARNING ⚠️ Upgrade to torch>=2.0.0 for deterministic training. engine\trainer: task=detect, mode=train, model=C:\Users\86187\ultralytics-8.1.0\ultralytics\cfg\models\v8\yolov8n-CBAM.yaml, data=C:\Users\86187\ultralytics-8.1.0\tomato_train.yaml, epochs=5, time=None, patience=50, batch=16, imgsz=640, save=True, save_period=-1, cache=False, device=None, workers=0, project=None, name=train17, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=None, amp=False, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=runs\detect\train17 from n params module arguments 0 -1 1 464 ultralytics.nn.modules.conv.Conv [3, 16, 3, 2] 1 -1 1 4672 ultralytics.nn.modules.conv.Conv [16, 32, 3, 2] 2 -1 1 7360 ultralytics.nn.modules.block.C2f [32, 32, 1, True] 3 -1 1 18560 ultralytics.nn.modules.conv.Conv [32, 64, 3, 2] 4 -1 2 49664 ultralytics.nn.modules.block.C2f [64, 64, 2, True] 5 -1 1 73984 ultralytics.nn.modules.conv.Conv [64, 128, 3, 2] 6 -1 2 197632 ultralytics.nn.modules.block.C2f [128, 128, 2, True] 7 -1 1 295424 ultralytics.nn.modules.conv.Conv [128, 256, 3, 2] 8 -1 1 460288 ultralytics.nn.modules.block.C2f [256, 256, 1, True] 9 -1 1 8210 ultralytics.nn.Attention.CBAM.CBAM [256, 3] 10 -1 1 164608 ultralytics.nn.modules.block.SPPF [256, 256, 5] 11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 12 [-1, 6] 1 0 ultralytics.nn.modules.conv.Concat [1] 13 -1 1 148224 ultralytics.nn.modules.block.C2f [384, 128, 1] 14 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 15 [-1, 4] 1 0 ultralytics.nn.modules.conv.Concat [1] 16 -1 1 37248 ultralytics.nn.modules.block.C2f [192, 64, 1] 17 -1 1 36992 ultralytics.nn.modules.conv.Conv [64, 64, 3, 2] 18 [-1, 13] 1 0 ultralytics.nn.modules.conv.Concat [1] 19 -1 1 123648 ultralytics.nn.modules.block.C2f [192, 128, 1] 20 -1 1 147712 ultralytics.nn.modules.conv.Conv [128, 128, 3, 2] 21 [-1, 10] 1 0 ultralytics.nn.modules.conv.Concat [1] 22 -1 1 493056 ultralytics.nn.modules.block.C2f [384, 256, 1] 23 [16, 19, 22] 1 751702 ultralytics.nn.modules.head.Detect [2, [64, 128, 256]] YOLOv8n-CBAM summary: 236 layers, 3019448 parameters, 3019432 gradients, 8.2 GFLOPs TensorBoard: Start with 'tensorboard --logdir runs\detect\train17', view at http://localhost:6006/ Freezing layer 'model.23.dfl.conv.weight' train: Scanning C:\Users\86187\ultralytics-8.1.0\datasets\tomato\labels\train.cache... 392 images, 0 backgrounds, 0 corrupt: 100%|██████████| val: Scanning C:\Users\86187\ultralytics-8.1.0\datasets\tomato\labels\val.cache... 128 images, 0 backgrounds, 0 corrupt: 100%|██████████| 128/ Plotting labels to runs\detect\train17\labels.jpg... optimizer: 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically... optimizer: AdamW(lr=0.001667, momentum=0.9) with parameter groups 57 weight(decay=0.0), 67 weight(decay=0.0005), 63 bias(decay=0.0) 5 epochs... Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 1/5 3.65G 4.558 4.098 4.315 411 640: 4%|▍ | 1/25 [00:08<03:35, 8.98s/it] Traceback (most recent call last): File "c:\Users\86187\ultralytics-8.1.0\tarin.py", line 17, in <module> results = model.train(data=r'C:\Users\86187\ultralytics-8.1.0\tomato_train.yaml', File "c:\Users\86187\ultralytics-8.1.0\ultralytics\engine\model.py", line 390, in train self.trainer.train() File "c:\Users\86187\ultralytics-8.1.0\ultralytics\engine\trainer.py", line 208, in train self._do_train(world_size) File "c:\Users\86187\ultralytics-8.1.0\ultralytics\engine\trainer.py", line 379, in _do_train self.loss, self.loss_items = self.model(batch) File "C:\Users\86187\miniconda3\envs\ultralytics\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "c:\Users\86187\ultralytics-8.1.0\ultralytics\nn\tasks.py", line 81, in forward return self.loss(x, *args, **kwargs) File "c:\Users\86187\ultralytics-8.1.0\ultralytics\nn\tasks.py", line 260, in loss return self.criterion(preds, batch) File "c:\Users\86187\ultralytics-8.1.0\ultralytics\utils\loss.py", line 220, in __call__ _, target_bboxes, target_scores, fg_mask, _ = self.assigner( File "C:\Users\86187\miniconda3\envs\ultralytics\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "C:\Users\86187\miniconda3\envs\ultralytics\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "c:\Users\86187\ultralytics-8.1.0\ultralytics\utils\tal.py", line 72, in forward mask_pos, align_metric, overlaps = self.get_pos_mask( File "c:\Users\86187\ultralytics-8.1.0\ultralytics\utils\tal.py", line 92, in get_pos_mask mask_in_gts = self.select_candidates_in_gts(anc_points, gt_bboxes) File "c:\Users\86187\ultralytics-8.1.0\ultralytics\utils\tal.py", line 227, in select_candidates_in_gts bbox_deltas = torch.cat((xy_centers[None] - lt, rb - xy_centers[None]), dim=2).view(bs, n_boxes, n_anchors, -1) RuntimeError: CUDA out of memory. Tried to allocate 158.00 MiB (GPU 0; 4.00 GiB total capacity; 3.25 GiB already allocated; 0 bytes free; 3.45 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF (ultralytics) C:\Users\86187\ultralytics-8.1.0>C:/Users/86187/miniconda3/envs/ultralytics/python.exe c:/Users/86187/ultralytics-8.1.0/tarin.py from n params module arguments 0 -1 1 464 ultralytics.nn.modules.conv.Conv [3, 16, 3, 2] 1 -1 1 4672 ultralytics.nn.modules.conv.Conv [16, 32, 3, 2] 2 -1 1 7360 ultralytics.nn.modules.block.C2f [32, 32, 1, True] 3 -1 1 18560 ultralytics.nn.modules.conv.Conv [32, 64, 3, 2] 4 -1 2 49664 ultralytics.nn.modules.block.C2f [64, 64, 2, True] 5 -1 1 73984 ultralytics.nn.modules.conv.Conv [64, 128, 3, 2] 6 -1 2 197632 ultralytics.nn.modules.block.C2f [128, 128, 2, True] 7 -1 1 295424 ultralytics.nn.modules.conv.Conv [128, 256, 3, 2] 8 -1 1 460288 ultralytics.nn.modules.block.C2f [256, 256, 1, True] 9 -1 1 8210 ultralytics.nn.Attention.CBAM.CBAM [256, 3] 10 -1 1 164608 ultralytics.nn.modules.block.SPPF [256, 256, 5] 11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 12 [-1, 6] 1 0 ultralytics.nn.modules.conv.Concat [1] 13 -1 1 148224 ultralytics.nn.modules.block.C2f [384, 128, 1] 14 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 15 [-1, 4] 1 0 ultralytics.nn.modules.conv.Concat [1] 16 -1 1 37248 ultralytics.nn.modules.block.C2f [192, 64, 1] 17 -1 1 36992 ultralytics.nn.modules.conv.Conv [64, 64, 3, 2] 18 [-1, 13] 1 0 ultralytics.nn.modules.conv.Concat [1] 19 -1 1 123648 ultralytics.nn.modules.block.C2f [192, 128, 1] 20 -1 1 147712 ultralytics.nn.modules.conv.Conv [128, 128, 3, 2] 21 [-1, 10] 1 0 ultralytics.nn.modules.conv.Concat [1] 22 -1 1 493056 ultralytics.nn.modules.block.C2f [384, 256, 1] 23 [16, 19, 22] 1 751702 ultralytics.nn.modules.head.Detect [2, [64, 128, 256]] YOLOv8n-CBAM summary: 236 layers, 3019448 parameters, 3019432 gradients, 8.2 GFLOPs New https://pypi.org/project/ultralytics/8.3.174 available 😃 Update with 'pip install -U ultralytics' Ultralytics YOLOv8.1.0 🚀 Python-3.9.23 torch-1.12.1+cu116 CUDA:0 (NVIDIA GeForce GTX 1650, 4096MiB) WARNING ⚠️ Upgrade to torch>=2.0.0 for deterministic training. engine\trainer: task=detect, mode=train, model=C:\Users\86187\ultralytics-8.1.0\ultralytics\cfg\models\v8\yolov8n-CBAM.yaml, data=C:\Users\86187\ultralytics-8.1.0\tomato_train.yaml, epochs=5, time=None, patience=50, batch=8, imgsz=640, save=True, save_period=-1, cache=False, device=None, workers=0, project=None, name=train18, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=None, amp=False, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=runs\detect\train18 from n params module arguments 0 -1 1 464 ultralytics.nn.modules.conv.Conv [3, 16, 3, 2] 1 -1 1 4672 ultralytics.nn.modules.conv.Conv [16, 32, 3, 2] 2 -1 1 7360 ultralytics.nn.modules.block.C2f [32, 32, 1, True] 3 -1 1 18560 ultralytics.nn.modules.conv.Conv [32, 64, 3, 2] 4 -1 2 49664 ultralytics.nn.modules.block.C2f [64, 64, 2, True] 5 -1 1 73984 ultralytics.nn.modules.conv.Conv [64, 128, 3, 2] 6 -1 2 197632 ultralytics.nn.modules.block.C2f [128, 128, 2, True] 7 -1 1 295424 ultralytics.nn.modules.conv.Conv [128, 256, 3, 2] 8 -1 1 460288 ultralytics.nn.modules.block.C2f [256, 256, 1, True] 9 -1 1 8210 ultralytics.nn.Attention.CBAM.CBAM [256, 3] 10 -1 1 164608 ultralytics.nn.modules.block.SPPF [256, 256, 5] 11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 12 [-1, 6] 1 0 ultralytics.nn.modules.conv.Concat [1] 13 -1 1 148224 ultralytics.nn.modules.block.C2f [384, 128, 1] 14 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 15 [-1, 4] 1 0 ultralytics.nn.modules.conv.Concat [1] 16 -1 1 37248 ultralytics.nn.modules.block.C2f [192, 64, 1] 17 -1 1 36992 ultralytics.nn.modules.conv.Conv [64, 64, 3, 2] 18 [-1, 13] 1 0 ultralytics.nn.modules.conv.Concat [1] 19 -1 1 123648 ultralytics.nn.modules.block.C2f [192, 128, 1] 20 -1 1 147712 ultralytics.nn.modules.conv.Conv [128, 128, 3, 2] 21 [-1, 10] 1 0 ultralytics.nn.modules.conv.Concat [1] 22 -1 1 493056 ultralytics.nn.modules.block.C2f [384, 256, 1] 23 [16, 19, 22] 1 751702 ultralytics.nn.modules.head.Detect [2, [64, 128, 256]] YOLOv8n-CBAM summary: 236 layers, 3019448 parameters, 3019432 gradients, 8.2 GFLOPs TensorBoard: Start with 'tensorboard --logdir runs\detect\train18', view at http://localhost:6006/ Freezing layer 'model.23.dfl.conv.weight' train: Scanning C:\Users\86187\ultralytics-8.1.0\datasets\tomato\labels\train.cache... 392 images, 0 backgrounds, 0 corrupt: 100%|██████████| val: Scanning C:\Users\86187\ultralytics-8.1.0\datasets\tomato\labels\val.cache... 128 images, 0 backgrounds, 0 corrupt: 100%|██████████| 128/ Plotting labels to runs\detect\train18\labels.jpg... optimizer: 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically... optimizer: AdamW(lr=0.001667, momentum=0.9) with parameter groups 57 weight(decay=0.0), 67 weight(decay=0.0005), 63 bias(decay=0.0) 5 epochs... Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 1/5 2.02G 4.28 3.798 4.17 231 640: 100%|██████████| 49/49 [01:28<00:00, 1.81s/it] Class Images Instances Box(P R mAP50 mAP50-95): 100%|██████████| 8/8 [00:29<00:00, 3.71s/it] all 128 1976 0 0 0 0 Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 2/5 2G 3.658 2.893 3.676 213 640: 100%|██████████| 49/49 [01:25<00:00, 1.74s/it] Class Images Instances Box(P R mAP50 mAP50-95): 100%|██████████| 8/8 [00:28<00:00, 3.58s/it] all 128 1976 0 0 0 0 Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 3/5 2.01G 2.895 2.494 3.126 109 640: 100%|██████████| 49/49 [01:26<00:00, 1.76s/it] Class Images Instances Box(P R mAP50 mAP50-95): 100%|██████████| 8/8 [00:30<00:00, 3.87s/it] all 128 1976 0 0 0 0 Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 4/5 2.25G 2.666 2.218 2.636 240 640: 100%|██████████| 49/49 [01:26<00:00, 1.76s/it] Class Images Instances Box(P R mAP50 mAP50-95): 100%|██████████| 8/8 [00:28<00:00, 3.61s/it] all 128 1976 0 0 0 0
08-07
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值