FSDv2环境配置踩坑全记录
PyTorch version: 1.8.1+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce RTX 3070
Nvidia driver version: 460.91.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] pytorch-jacinto-ai==1.0.0+76b6fea
[pip3] torch==1.8.1+cu111
[pip3] torch-scatter==2.1.1
[pip3] torchaudio==0.8.1
[pip3] torchelastic==0.2.2
[pip3] torchex==0.1.0
[pip3] torchtext==0.9.0
[pip3] torchvision==0.9.1+cu111
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.1.74 h6bb024c_0 nvidia
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2020.2 256
[conda] mkl-service 2.3.0 py38he904b0f_0
[conda] mkl_fft 1.3.0 py38h54f3939_0
[conda] mkl_random 1.1.1 py38h0573a6f_0
[conda] numpy 1.19.5 pypi_0 pypi
[conda] pytorch-jacinto-ai 1.0.0+76b6fea dev_0 <develop>
[conda] torch 1.8.1+cu111 pypi_0 pypi
[conda] torch-scatter 2.1.1 pypi_0 pypi
[conda] torchaudio 0.8.1 pypi_0 pypi
[conda] torchelastic 0.2.2 pypi_0 pypi
[conda] torchex 0.1.0 dev_0 <develop>
[conda] torchtext 0.9.0 py38 pytorch
[conda] torchvision 0.9.1+cu111 pypi_0 pypi
## pip list
cumm-cu102 0.4.11
mmcv-full 1.3.8
mmdet 2.14.0
mmdet3d 0.15.0 /codes/SST
mmengine 0.8.4
mmpycocotools 12.0.3
mmsegmentation 0.14.1
遇到问题及解决:
-
ModuleNotFoundError: No module named 'ingroup_indices'
解决方法:下载编译TorchEx -
provided PTX was compiled with an unsupported toolchain
解决方法:检查cuda,cudnn,torch版本匹配。
参考:https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/ -
OSError: /opt/conda/lib/python3.8/site-packages/torch_scatter/_scatter_cuda.so: undefined symbol: _ZNK2at6Tensor6deviceEv
解决方法:重装torch_scatter:
pip install --no-index torch-scatter -f https://pytorch-geometric.com/whl/torch-1.9.0+cu111.html
-
ImportError: /data/codes/SST/mmdet3d/ops/knn/knn_ext.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZNK2at6Tensor7is_cudaEv
解决方法:重新在SST文件夹内编译mmdet3d(不要去下载官方0.15.0的代码,作者改了源码)
pip install -e . -v 或者 python setup.py build develop
-
ImportError: cannot import name 'SparseModule' from 'mmcv.ops' (/opt/conda/lib/python3.8/site-packages/mmcv/ops/__init__.py)
SST中相关代码:from .spconv import IS_SPCONV2_AVAILABLE if IS_SPCONV2_AVAILABLE: from spconv.pytorch import SparseModule, SparseSequential else: from mmcv.ops import SparseModule, SparseSequential
解决方法:由于SparseModule在mmcv-full>=1.4.12中才出现,因此报错的实际原因IS_SPCONV2_AVAILABLE=False,即spconv2没有成功安装。
解决方法:重装spconv2:(需要cuda>11.0,已编译好的wheel有10.2/11.3/1.4,不要信作者的鬼话什么11.x之间可兼容,11.3的wheel就不能兼容cuda11.1!!!!最后还是安装了10.2版本的跑起来了)
bash pip install spconv-cuxxx
-
AttributeError: module 'mmcv' has no attribute 'Config'
解决方法:先检查是不是误装了mmcv而不是mmcv-full,否则考虑mmcv版本问题,按照requirements/mminstall.txt里的版本要求重装mmcv(1.3.8<=mmcv<=1.4.0):pip install mmcv-full=={mmcv_version} -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.9.0/index.html
参考:https://mmcv.readthedocs.io/zh_CN/v1.3.16/get_started/installation.html
-
OSError: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
解决方法:mmcv和torch版本不匹配,重装torch或者mmcv-fullpip install -U torch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch