深度学习tips

crasyter

已于 2023-02-23 15:20:12 修改

阅读量2.8k

点赞数

分类专栏：深度学习文章标签：深度学习

于 2023-02-13 17:07:02 首次发布

本文链接：https://blog.youkuaiyun.com/weixin_41169280/article/details/129012919

版权

深度学习专栏收录该内容

6 篇文章

订阅专栏

1、datasets_make函数中最后全部转化为numpy形式

data=np.array(data)

否则会出现问题，比如数据是103216，经过trainloader生成tensor后（batch_size为30），发现生成的数据为：

data.shape #(10,)
data[0].shape #(30,32,16)

而不是(30,10, 32,16)。

2、模型推导时数据首先进行处理，移到cuda上

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
input_batch = input_batch.cuda()

否则会报错：

RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same

3、保存模型时注意路径

torch.save(model.state_dict(), f'{checkpoints_dir}/{epoch:003}.pth')

当时没有新建checkpoints_dir文件夹，所以报错：

FileNotFoundError: [Errno 2] No such file or directory:

注意：epoch:003中后面的意思是保留三位，即000,001…

4、报错AttributeError: ‘tuple’ object has no attribute ‘permute’

搜索解决方案发现可以重新安装numpy，但是这里没有解决问题，后来打印数据发现：

fl: (tensor([[[[[ 1.6029e+00, -9.7684e-01],
[ 1.5993e+00, -9.7150e-01],
[ 1.5962e+00, -9.6179e-01],
…, …,
[ 8.8066e+00, -9.4304e-01],
[ 8.8033e+00, -9.5545e-01],
[ 8.8245e+00, -9.7233e-01]]]]], device=‘cuda:0’),)

只有第一维是tuple，可以通过 fl = fl[0]解决，为了弄明白原因，查看代码发现

fl = input_batch["data"].cuda(),

最后出现了一个逗号，属实是粗心问题

5、numpy.ndarray转tensor问题

模型输入数据必须是tensor数据，测试时如果想自己设计test_loader数据，而不使用torch.utils.data.DataLoader函数，就需要手动转tensor，否则会报错：

AttributeError: ‘numpy.ndarray’ object has no attribute ‘cuda’

这里就是因为模型中有一步骤是转到cuda上，必须操作对象是tensor。但是不能直接对取出的字典执行torch.tensor() 操作，只能对numpy.ndarray进行，否则会报错：

RuntimeError: Could not infer dtype of dict

应该为：

f = torch.tensor(sample["f"])

6、报错RuntimeError: CUDA error: device-side assert triggered**

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

这里主要是设置了model.cuda(gpu_ids[0])，而CUDA没有把错误详细显示出来，添加以下命令进行详细显示：

import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

参考

7、报错：unable to open shared memory object </torch_> in read-write mode

RuntimeError: falseINTERNAL ASSERT FAILED at "../aten/src/ATen/MapAllocator.cpp":263, please report a bug to PyTorch. unable to open shared memory object </torch_786_1> in read-write mode

解决方法
1、open file限制空间太小，直接使用ulimit -SHn 51200\。
2、在github上找到相关issue，这是因为torch.utils.data.DataLoader(…)的num_workers设置大于0，将其改为0即可。