Debug result = unpickler.load() ModuleNotFoundError: No module named ‘models‘

四平先森

已于 2023-09-20 10:47:44 修改

阅读量793

点赞数 1

CC 4.0 BY-SA版权

分类专栏：深度学习 debug 文章标签：算法神经网络 pytorch

于 2023-08-25 11:01:52 首次发布

本文链接：https://blog.youkuaiyun.com/llsplsp/article/details/132490576

深度学习同时被 2 个专栏收录

8 篇文章

订阅专栏

debug

8 篇文章

订阅专栏

文章讲述了将torch训练的yolov5模型转换为TensorRT时遇到的问题，原因是torch.save保存了额外的训练参数，导致在其他机器上map_location失效。解决方法是先用yolov5自带的export.py转换为ONNX，再进行trt转换。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1.torch训练的yolov5转trt出现问题如下：

Using CUDA device0 _CudaDeviceProperties(name='NVIDIA GeForce RTX 3080', total_memory=10017MB)

Find Pytorch weight
Traceback (most recent call last):
  File "export.py", line 243, in <module>
    ckpt = torch.load(opt.weight, map_location=device)
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 592, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 851, in _load
    result = unpickler.load()
ModuleNotFoundError: No module named 'models'

2.解决办法：

直接先用yolov5自带的export.py转成.onnx模型，再通过onnx转trt，问题解决

Find ONNX weight

TensorRT: starting export with TensorRT 8.4.0.6...
[08/24/2023-18:57:25] [TRT] [I] [MemUsageChange] Init CUDA: CPU +359, GPU +0, now: CPU 426, GPU 401 (MiB)
[08/24/2023-18:57:26] [TRT] [I] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 444 MiB, GPU 401 MiB
[08/24/2023-18:57:27] [TRT] [I] [MemUsageSnapshot] End constructing builder kernel library: CPU 819 MiB, GPU 523 MiB
[08/24/2023-18:57:27] [TRT] [I] ----------------------------------------------------------------
[08/24/2023-18:57:27] [TRT] [I] Input filename:   ../best.onnx
[08/24/2023-18:57:27] [TRT] [I] ONNX IR version:  0.0.6
[08/24/2023-18:57:27] [TRT] [I] Opset version:    11
[08/24/2023-18:57:27] [TRT] [I] Producer name:    pytorch
[08/24/2023-18:57:27] [TRT] [I] Producer version: 1.9
[08/24/2023-18:57:27] [TRT] [I] Domain:           
[08/24/2023-18:57:27] [TRT] [I] Model version:    0
[08/24/2023-18:57:27] [TRT] [I] Doc string:       
[08/24/2023-18:57:27] [TRT] [I] ----------------------------------------------------------------
[08/24/2023-18:57:27] [TRT] [W] onnx2trt_utils.cpp:365: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
TensorRT: Network Description:
TensorRT:       input "images" with shape (1, 3, 640, 640) and dtype DataType.FLOAT
TensorRT:       output "output" with shape (1, 25200, 20) and dtype DataType.FLOAT
TensorRT: building FP16 engine in ../best.engine
[08/24/2023-18:57:29] [TRT] [W] TensorRT was linked against cuBLAS/cuBLAS LT 11.8.0 but loaded cuBLAS/cuBLAS LT 11.3.0
[08/24/2023-18:57:29] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +637, GPU +268, now: CPU 1545, GPU 791 (MiB)
[08/24/2023-18:57:29] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +356, GPU +258, now: CPU 1901, GPU 1049 (MiB)
[08/24/2023-18:57:29] [TRT] [W] TensorRT was linked against cuDNN 8.3.2 but loaded cuDNN 8.0.5
[08/24/2023-18:57:29] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
[08/24/2023-18:58:37] [TRT] [I] Some tactics do not have sufficient workspace memory to run. Increasing workspace size will enable more tactics, please check verbose output for requested sizes.
[08/24/2023-19:06:05] [TRT] [I] Detected 1 inputs and 4 output network tensors.
[08/24/2023-19:06:08] [TRT] [I] Total Host Persistent Memory: 218880
[08/24/2023-19:06:08] [TRT] [I] Total Device Persistent Memory: 1197056
[08/24/2023-19:06:08] [TRT] [I] Total Scratch Memory: 0
[08/24/2023-19:06:08] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 48 MiB, GPU 2470 MiB
[08/24/2023-19:06:08] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 29.1457ms to assign 9 blocks to 142 nodes requiring 25804804 bytes.
[08/24/2023-19:06:08] [TRT] [I] Total Activation Memory: 25804804
[08/24/2023-19:06:08] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +40, GPU +42, now: CPU 40, GPU 42 (MiB)
export.py:172: CryptographyDeprecationWarning: Python 3.6 is no longer supported by the Python core team. Therefore, support for it is deprecated in cryptography and will be removed in a future release.
  from cryptography.fernet import Fernet
TensorRT: export success, saved as ../best.engine

3.原因及其他解决办法

网上查了一下，主要原因是在保存训练的模型时，使用的torch.save(model, path)，而在加载时使用的model = torch.load(path)；export.py中对pt的加载源码如下：

if pt:
        logger.info("Find Pytorch weight")
        ckpt = torch.load(opt.weight, map_location=device)
        if opt.noema:
            model = ckpt['model']
        else:
            model = ckpt['ema'] if ckpt.get('ema') else ckpt['model']
            
        meta = get_meta_data(ckpt, model, meta)

        if opt.int8:
            zero_scale_fix(model, device)
            if model.__name__ != "EfficentYolo":
                for sub_fusion_list in op_concat_fusion_list[model.__name__]:
                    ops = [get_module(model, op_name) for op_name in sub_fusion_list]
                    concat_quant_amax_fuse(ops)
                for sub_fusion_list in op_concat_fusion_list[model.type]:
                    ops = [get_module(model, op_name) for op_name in sub_fusion_list]
                    concat_quant_amax_fuse(ops)
    
        model.float()
        if not opt.int8:
            model.fuse()
        model.to(device)
        model.eval()
        if opt.int8:
            quant_nn.TensorQuantizer.use_fb_fake_quant = True
        im = torch.zeros(1, 3, *imgsz).to(device)

        # 模型detect layer为了支持onnx的导出，所必须的更改
    #     model.detect.inplace = False
        if not(hasattr(model, 'type') and model.type in ['anchorfree', 'anchorbase']):
            model.type = 'anchorbase'
        model.detect.dynamic = dynamic
        model.detect.export = True  # 减少输出数量
        # 验证torch模型是否正常
        for _ in range(2):
            y = model(im)  # dry runs
            
        # 从模型中读取模型的labels，并保存到labels.txt下
        labels = str({i:l for i,l in enumerate(model.labels)})
        
        with open(file.parents[0]/'labels.txt','w') as f:
            f.write(labels)
        logger.info("the torch model is very successful, it's no possible!")
        
        if 'onnx' in opt.include or 'trt' in opt.include:
            try:
                import tensorrt as trt
                if model.type == 'anchorfree':
                    export_onnx(model, im, file, opt.opset, train=False, dynamic=False, simple=opt.simple)
                elif model.type == 'anchorbase':
                    if int(trt.__version__[0]) == 7:  # TensorRT 7 handling https://github.com/ultralytics/yolov5/issues/6012
                        model.detect.inplace = False
                        grid = model.detect.anchor_grid
                        model.detect.anchor_grid = [a[..., :1, :1, :] for a in grid]
                        export_onnx(model, im, file, opt.opset, train=False, dynamic=False, simple=opt.simple)  # opset 12
                        model.detect.anchor_grid = grid
                    else:  # TensorRT >= 8
                        export_onnx(model, im, file, opt.opset, train=False, dynamic=False, simple=opt.simple)  # opset 13
            except:
                logger.info("TRT ERROR, will custom onnx!")
                export_onnx(model, im, file, opt.opset, train=False, dynamic=False, simple=opt.simple)
                
            onnx_file = file.with_suffix('.onnx')
            add_meta_to_model(onnx_file, meta)
            if opt.int8:
                get_remove_qdq_onnx_and_cache(file.with_suffix('.onnx'))
                add_meta_to_model(str(onnx_file).replace('.onnx', '_wo_qdq.onnx'), meta)
                
        if 'trt' in opt.include:
            if opt.old:
                meta = False
            export_engine(onnx_file, None, meta=meta, half=opt.half, int8=opt.int8, workspace=opt.worker, encode=opt.encode, verbose=opt.verbose)
    else:    
        logger.info("Find ONNX weight")
        if not opt.old:
            meta = get_meta_data(file, None, meta)
            meta['half'] = opt.half
            meta['int8'] = opt.int8
            meta['encode'] = opt.encode
        if opt.old:
            meta = False

猜测可能是：
（1）模型在训练时，保存了一些其他参数信息，这些参数可能涉及到训练模型的位置等，模型迁移到其他机器上时，比如需要使用的机器上转trt时，找不到该位置了，可以先转成通用的onnx模型，再转trt。
（2）一般出现这种问题，多半不是最终训练完的模型，举个例子，例如yolov5m训练完之后的大小在40M左右，但是中途保存下来的模型best.pt和last.pt大小应该是在160M左右，多了120M的信息可能就是当前机器的一些东西了，如果把模型移植到其他机器上使用和转换，就会出现问题。