城俊BLOG
从此好好码代码。。
展开
-
mmcv NCCL 报错 mmcv/_ext.cpython-37m-x86_64-linux-gnu.so: undefined symbol, RuntimeError: NCCL error i
【代码】mmcv报错 mmcv/_ext.cpython-37m-x86_64-linux-gnu.so: undefined symbol。原创 2023-01-12 02:05:22 · 1925 阅读 · 0 评论 -
pytorch 1.7.0 torchvision 0.8.1 torch.cuda.amp gradscaler DDP 训练卡死
报错:pytorch DDP 模型卡住代码# 具体卡住的代码yolov5训练代码 train.py 中有一句: scaler.step(optimizer) # optimizer.step程序运行到第二个epoch的时候,卡住了,具体卡在调用语句:/home/xxx/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py中的 if not sum(v.item() for v原创 2022-05-08 22:10:18 · 1143 阅读 · 0 评论 -
【进程丢失】pytorch DDP分布式训练10个epoch就丢失1个GPU进程
现象:已经重复发生过,而且掉进程的卡bus id不一样Pytorch 版本 1.7.0 ,卡:titan rtx x 8很奇怪的bug,目前分析的原因:batch size过大,每张卡显存占的太满,导致训练的时候显存爆了?问题:那为什么一开始不爆呢?程序有显存泄漏?显卡过热(这个目前应该不是引起的原因,因为散热风挡已经开到最大,温度最多70度左右)...原创 2021-05-30 13:02:45 · 1073 阅读 · 2 评论 -
pytorch RuntimeError: one of the variables needed for gradient computation has been modified by an i
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [25088, 512]], which is output 0 of TBackward, is at version 4; expected version 3 instead. Hint: enable anomaly detection原创 2021-05-24 14:03:25 · 767 阅读 · 1 评论 -
pytorch BN层报错:ValueError: expected 4D input (got 2D input)
报错:Traceback (most recent call last): File "train_noPfc.py", line 201, in <module> main(args_) File "train_noPfc.py", line 160, in main f_masked, focc_masked, output, output_occ, f_diff, out = backbone(img1, img2) File "/home/user1/min翻译 2021-05-20 14:47:37 · 9172 阅读 · 0 评论 -
pytorch高版本(如1.7.0)RuntimeError: Legacy autograd function with non-static forward method is deprecate
就一个前向推理,也报错了。。。代码:from torchvision import modelsmodel = models.vgg19(pretrained=True)output = model(input.cuda())报错:RuntimeError: Legacy autograd function with non-static forward method is deprecated.完整报错:Traceback (most recent call last): File翻译 2021-05-18 17:13:41 · 898 阅读 · 0 评论 -
pytorch resnet 全连接层linear报错:RuntimeError: mat1 dim 1 must match mat2 dim 0
Traceback (most recent call last): File "/home/user1/pjs/frvt_pytorch/batch_run/2branch_alter_1update_2pfc_MMD_ori_auto/recognition/arcface_torch/tools/visualize.py", line 276, in <module> mask = grad_cam(input, target_index) File "/home/user翻译 2021-05-18 10:14:57 · 7020 阅读 · 0 评论 -
pytorch retain_graph=True 训练导致GPU显存泄漏 OOM (out of memory)
训练过程中多个loss回传产生了GPU显存不够用的情况(即使是设置batch_size最小也不行),在backward函数中去掉retain_graph=True之后,情况没有出现。我这里出现这个情况的原因:因为不同loss求完之后没有算均值,可能返回的是一个tensor,要通过 .mean() 把它变成标量。解决:criterion = torch.nn.CrossEntropyLoss()output = module_a(fc1Features,label)arcLoss = criteri翻译 2021-05-13 16:53:00 · 2696 阅读 · 2 评论 -
pytorch分布式 RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]
报错:RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1解决:把DP改成DDP解决了报错的代码:# DP模式, module_a是一个分类器module_a = torch.nn.DataParallel(module_a)改完的代码:local_rank = 0# DDP模式mod原创 2021-05-13 11:26:46 · 1801 阅读 · 0 评论 -
pytorch RuntimeError: one of the variables needed for gradient computation has been modified by an i
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [25088, 512]], which is output 0 of TBackward, is at version 4; expected version 3 instead. Hint: enable anomaly detection原创 2021-05-07 14:38:03 · 517 阅读 · 1 评论 -
pytorch RuntimeError: Expected to have finished reduction in the prior iteration before starting a n
报错:Traceback (most recent call last): File "train.py", line 166, in <module> main(args_) File "train.py", line 118, in main f_clean_masked, f_occ_masked, fc, fc_occ = backbone(img1, img2) File "/home/user1/miniconda3/envs/py377/lib/pyt原创 2021-04-23 23:48:16 · 6823 阅读 · 0 评论 -
pytorch RuntimeError: running_mean should contain 1048576 elements not 25088
sss原创 2021-04-23 23:08:01 · 1070 阅读 · 0 评论 -
Pytorch RuntimeError: Expected 4-dimensional input for 4-dimensional weight [512, 512, 3, 3], but go
报错:RuntimeError: Expected 4-dimensional input for 4-dimensional weight [512, 512, 3, 3], but got 2-dimensional input of size [4, 512] instead原创 2021-04-23 22:56:37 · 5946 阅读 · 5 评论 -
pytorch mxnet ValueError: too many dimensions ‘NDArray‘
报错:Traceback (most recent call last): File "topFAR_COX_py2_fc1_pytorch_PY3.py", line 199, in <module> IDimage_features_dict = getfeatures_dict(model, IDimage_list, IDimage_path, featurelen) File "topFAR_COX_py2_fc1_pytorch_PY3.py", line 106,原创 2021-04-09 11:36:38 · 633 阅读 · 0 评论 -
Pytorch 分布式dist.init_process_group报错NCCL 找不到GPU
完整报错:Traceback (most recent call last): File "topFAR_COX_py2_fc1_pytorch.py", line 181, in <module> model = constructmodel(args) File "topFAR_COX_py2_fc1_pytorch.py", line 36, in constructmodel dist.init_process_group(backend='nccl', ini原创 2021-04-01 23:00:22 · 7910 阅读 · 2 评论 -
Pytorch RuntimeError: The NVIDIA driver on your system is too old (found version 10010).
完整报错:Traceback (most recent call last): File "topFAR_COX_py2_fc1_pytorch.py", line 181, in <module> model = constructmodel(args) File "topFAR_COX_py2_fc1_pytorch.py", line 37, in constructmodel torch.cuda.set_device(rank) File "/home/u原创 2021-04-01 22:52:40 · 6003 阅读 · 2 评论 -
pytorch分布式RuntimeError: Tensors must be CUDA and dense
你的模型或者参数没有放到GPU上,解决:backbone = eval("backbones.{}".format(args.network))(False, dropout=dropout, fp16=args.fp16).to(rank)最后的 .to(rank) 做到了这一点。在我的代码中rank=0原创 2021-04-01 14:16:28 · 8844 阅读 · 0 评论 -
mxnet使用mxboard可视化模型权重参数报错:No handlers could be found for logger “mxboard.event_file_writer“
No handlers could be found for logger “mxboard.event_file_writer”解决:$ pip install tensorboard在代码中加入:import logginglogging.basicConfig(level=logging.DEBUG)https://github.com/reminisce/mxboard-demo/blob/master/train_mnist.py原创 2021-03-29 17:11:58 · 152 阅读 · 0 评论 -
NN训练问题debug,看loss知问题
1, loss很快下降到0附近,后面不再动弹。验证集准确率始终保持在50%-60%附近图:问题:可能是传入的gt标签有错误,比如每次传入的都是同一个标签。需检查原创 2021-03-12 19:58:14 · 182 阅读 · 0 评论 -
mxnet报错TypeError: type <class ‘mxnet.initializer.InitDesc‘> not supported
手动写mxnet预测代码对单个图像进行预测时,报错:Traceback (most recent call last): File "/home/user1/pjs/frvt/arcface_Siamese_offline/recognition/tools/eval_on_train_set.py", line 163, in <module> fc7_mod.init_params(arg_params=fc7_overall, aux_params=None) File原创 2021-03-11 23:17:37 · 194 阅读 · 3 评论 -
mxnet报错 Check failed: dshp.ndim() == 4U (3 vs. 4) : Input data should be 4D in batch-num_filter-y-x
报错:mxnet.base.MXNetError: Error in operator conv0: [17:40:27] src/operator/nn/convolution.cc:152: Check failed: dshp.ndim() == 4U (3 vs. 4) : Input data should be 4D in batch-num_filter-y-x明明输入数据是4维的,为什么报错?因为用collections.namedtuple装载数据,进行前向预测时,没有给data外面加原创 2021-03-11 23:11:49 · 695 阅读 · 0 评论 -
mxnet/module/base_module.py“, line 855, in forward raise NotImplementedError()
Traceback (most recent call last): File "/home/user1/pjs/frvt/mask/arcface_Siamese_offline/recognition/train_0305.py", line 497, in <module> main() File "/home/user1/pjs/frvt/mask/arcface_Siamese_offline/recognition/train_0305.py", line 493,原创 2021-03-10 17:41:19 · 215 阅读 · 0 评论 -
pycharm mxnet src/base.cc:49: GPU context requested, but no GPUs found.
报错:src/base.cc:49: GPU context requested, but no GPUs found.src/storage/storage.cc???? Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: no CUDA-capable device is detected[14:07:55] src/base.cc:49: GPU context requested, but no GPUs原创 2021-03-10 13:50:43 · 836 阅读 · 0 评论 -
mxnet loss nan accuracy 0,模型提取特征输出为0
nan表示结果太大,正无穷或者负无穷可能的原因:1,学习率太大,比如应该从0.1改为0.01或0.001。如果还不行,那应该是代码有问题2, 你loss算的不对,导致太大3, 你梯度算的不对,如果loss是手动构建的, 请手动推导梯度计算公式,并正确实现...原创 2021-03-08 18:05:32 · 536 阅读 · 0 评论 -
mxnet加载模型 IndexError: list index out of range
loading ['/data/user1/log/frvt/models/r34_webface_mask/siamese/siamese34-arcface-webfaceSiamese/model-0009.params']Traceback (most recent call last): File "train_0305.py", line 447, in <module> main() File "train_0305.py", line 443, in main原创 2021-03-07 21:53:35 · 365 阅读 · 0 评论 -
mxnet stream_gpu-inl.h:62: Check failed: e == cudaSuccess: CUDA: unspecified launch failure Stack tr
Traceback (most recent call last): File "train_parall_fc7.py", line 409, in <module> main() File "train_parall_fc7.py", line 406, in main train_net(args) File "train_parall_fc7.py", line 401, in train_net epoch_end_callback = epoch_原创 2021-02-18 23:10:20 · 431 阅读 · 1 评论 -
pytorch 1.7训练保存的模型在1.4低版本无法加载:frame #63: <unknown function> + 0x1db3e0 (0x55ba98ddd3e0 in /data/user
pytorch 1.7高版本训练保存的模型在1.4低版本无法加载,报错:torch.load('/home/user1/model_best_b.pth.tar')Traceback (most recent call last): File "/data/user1/pkgs/conda/envs/drc/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3417, in run_code exec(co原创 2020-12-10 16:27:19 · 11194 阅读 · 3 评论 -
mxnet src/imperative/./imperative_utils.h:72: Check failed: inputs[i]->ctx().dev_mask() == ctx.dev_m
mxnet 1.6 自定义OP (计算metric)训练报错:src/imperative/./imperative_utils.h:72: Check failed: inputs[i]->ctx().dev_mask() == ctx.dev_mask() (1 vs. 2) : Operator broadcast_add require all inputs live on the same context. But the first argument is on gpu(0) while原创 2020-12-09 15:27:34 · 334 阅读 · 0 评论 -
pytorch AttributeError: ‘_IncompatibleKeys’ object has no attribute ‘eval’
哥们,你这个模型load_state_dict加载参数之后是不会将模型作为返回值的。所以你不要去接收返回参数。只是load_state_dict就可以了。错误的代码:# 这里你用列表接收了load_state_dict的返回值,但它其实不是load_state_dict之后得到的模型,而是一个_IncompatibleKeys的什么鬼models = [m.load_state_dict(ckpts[i]['state_dict']) for i,m in enumerate(models)]正确原创 2020-12-02 11:35:31 · 3887 阅读 · 0 评论 -
MXNetError: Check failed: assign(&dattr, vec.at(i)): Incompatible attr in node at 0-th output: expe
报错:MXNetError: Check failed: assign(&dattr, vec.at(i)): Incompatible attr in node at 0-th output: expected [51249,512], got [59643,512]这种奇奇怪怪的报错其实很头疼,mxnet给出的提示根本很难定位到具体代码,详细:Traceback (most recent call last): File "train_0723.py", line 494, in原创 2020-11-09 20:43:58 · 1339 阅读 · 0 评论 -
mxnet recordio读取rec文件pos = ctypes.c_size_t(self..报错 KeyError
Traceback (most recent call last): File "image_iter.py", line 42, in <module> C = FaceImageIter(20, (3,112,112), path_imgrec='/home/user1/data/deepglint/train.rec') File "image_iter.py", line 39, in __init__ s = self.imgrec.read_idx(0)原创 2020-11-04 23:05:25 · 726 阅读 · 0 评论 -
mxnet ubuntu18.04 python3.7.7 cuda10.1 AttributeError: module ‘mxnet‘ has no attribute xxx
报错:$ pythonPython 3.7.7 (default, Mar 26 2020, 15:48:22)[GCC 7.3.0] :: Anaconda, Inc. on linuxType "help", "copyright", "credits" or "license" for more information.>>> import mxnet as mx>>> mx.cpuAttributeError: module 'mxnet' has原创 2020-10-26 16:28:04 · 1190 阅读 · 0 评论 -
mxnet Segmentation fault: 11 libmxnet.so(+0x40c6b50)
Segmentation fault: 11Stack trace: [bt] (0) /data/user1/pkgs/conda/envs/drc/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x40c6b50) [0x7f2dfe4f8b50] [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f2edc06b4b0] [bt] (2) /lib/x86_64-linux-gn原创 2020-09-07 16:37:01 · 1089 阅读 · 0 评论 -
tensorflow keras InvalidArgumentError: Default MaxPoolingOp only supports NHWC on device type CPU
tf.version ‘2.3.0’keras.version ‘2.4.3’tensoflow代码:out = model.predict(x_test)out = np.array(out)报错:---------------------------------------------------------------------------InvalidArgumentError Traceback (most recent call las原创 2020-08-21 15:16:32 · 8120 阅读 · 1 评论 -
mxnet sklearn ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64‘).
Traceback (most recent call last): File "train_0723.py", line 455, in <module> main() File "train_0723.py", line 451, in main train_net(args) File "train_0723.py", line 445, in train_net epoch_end_callback=epoch_cb) File "/home/us原创 2020-08-17 11:39:41 · 606 阅读 · 0 评论 -
pytorch RuntimeError: expected backend CUDA and dtype Float but got backend CPU and dtype Float
代码:criterion = nn.BCEWithLogitsLoss(reduction='none')loss = criterion(output, target)loss.mul_(weights)报错:Traceback (most recent call last):File “/home/user1/main_cs_0708.py”, line 391, in main()File “/home/user1/main_cs_0708.py”, line 301, in mai原创 2020-08-14 17:46:08 · 2424 阅读 · 0 评论 -
pytorch网络结构可视化graphviz.backend.ExecutableNotFound: failed to execute [‘dot‘, ‘-Tpdf‘, ‘-O‘, ‘
报错:Traceback (most recent call last): File "/data/user1/pkgs/conda/envs/drc/lib/python3.7/site-packages/graphviz/backend.py", line 129, in render subprocess.check_call(args, stderr=stderr, **POPEN_KWARGS) File "/data/user1/pkgs/conda/envs/drc/lib/原创 2020-08-12 15:14:51 · 1502 阅读 · 0 评论 -
linux ubuntu 升级sklearn 0.19
sklearn 0.19各种报错还不能升级:报错1(使用时报错):ImportError: cannot import name 'multilabel_confusion_matrix'报错2(升级时报错):$ pip install git+http://github.com/scikit-learn/scikit-learn.gitLooking in indexes: https://mirrors.aliyun.com/pypi/simple/, https://pypi.tuna.t原创 2020-08-11 23:57:26 · 727 阅读 · 0 评论 -
MxNet base.h:459: Check failed: e == cudaSuccess (30 vs. 0) : CUDA: unknown error nvidia-smi显卡ERR!
命令输出缓慢,并且报ERR1,重启机器2,重装驱动3,维修或换显卡原创 2020-08-09 11:33:27 · 1288 阅读 · 0 评论 -
UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figur
画图出不来UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.解决:import matplotlib.pyplot as pltplt.switch_backend('TkAgg')okie@https://stackoverflow.com/questions/56656777/userwarning-matplotlib-is-curre翻译 2020-08-07 11:37:53 · 3576 阅读 · 0 评论