最近使用分布式训练框架deepspeed进行训练,安装后报错,如下所示
File "**/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__
fused_adam_cuda = FusedAdamBuilder().load()
op_module = load(name=self.name, File "**/site-packages/deepspeed/ops/op_builder/builder.py", line 479, in load
File "**/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return self.jit_load(verbose)
File "**/site-packages/deepspeed/ops/op_builder/builder.py", line 523, in jit_load
op_module = load(name=self.name,
return _jit_compile( File "**/site-packages/torch/utils/cpp_extension.py", line 1284, in load
File "**/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
return _jit_compile(
File "**/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "**/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
return _import_module_from_library(name, build_directory, is_python_module)
File "**/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "<frozen importlib._bootstrap>", line 556, in module_from_spec
File "<frozen importlib._bootstrap_external>", line 1166, in create_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py38_cu117/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
module = importlib.util.module_from_spec(spec)
File "<frozen importlib._bootstrap>", line 556, in module_from_spec
File "<frozen importlib._bootstrap_external>", line 1166, in create_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py38_cu117/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
检查版本如下,操作系统centos7,内核3.10.0-1160.92.1.el7.x86_64 ,python 3.8,显卡驱动对应的版本11.2,torch 版本2.0.1+cu117,nvcc 版本11.2,deepspeed 版本0.13.4
ds_resport 输出如下
[2024-03-06 17:33:52,658] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['**/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['**/site-packages/deepspeed']
deepspeed info ................... 0.13.4, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.2
deepspeed wheel compiled w. ...... torch 1.10, cuda 10.2
shared memory (/dev/shm) size .... 125.87 GB
其中fused_adam ............. [NO] ....... [OKAY] 显示未安装,其中torch cuda 版本为11.7,但nvcc 版本为11.2;更换一台机器,ubuntu 20.04,python 3.10,显卡驱动对应的cuda 12.2nvcc 版本为11.8,torch 版本2.1.2+cu121,deepspeed 0.13.5
运行deepspeed 报错同上
运行ds_report
[2024-03-06 17:47:43,578] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['**/site-packages/torch']
torch version .................... 2.1.2+cu121
deepspeed install path ........... ['**/site-packages/deepspeed']
deepspeed info ................... 0.13.5, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 125.76 GB其中torch cuda 版本12.1,但nvcc 版本为11.8,此时按照参考资料2的内容,复制github上deepspeed目录,进入deepspeed目录,并执行DS_BUILD_FUSED_ADAM=1 pip3 install .,报如下错误:
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "**/DeepSpeed/setup.py", line 196, in <module>
ext_modules.append(builder.builder())
File "**/DeepSpeed/op_builder/builder.py", line 633, in builder
assert_no_cuda_mismatch(self.name)
File "**/DeepSpeed/oop_builder/builder.py", line 101, in assert_no_cuda_mismatch
raise CUDAMismatchException(
op_builder.builder.CUDAMismatchException: >- DeepSpeed Op Builder: Installed CUDA version 11.8 does not match the version torch was compiled with 12.1, unable to compile cuda/cpp extensions without a matching cuda version.
DS_BUILD_OPS=0
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
显示cuda 版本11.8和12.1版本的torch 不匹配,因此升级cuda 为12.2,然后再进入deepspeed目录,DS_BUILD_FUSED_ADAM=1 pip3 install . ,安装后ds_report,发现fused_adam已安装上,fused_adam ............. [YES] ...... [OKAY],再次执行训练,发现已经不在报上述错误,注意DS_BUILD_FUSED_ADAM=1 pip3 install deepspeed 是不起作用的,安装不上fused_adam
由此得出结论:torch 的cuda 版本要和nvcc 的版本一致才可以,至少torch 的cuda 版本不能比nvcc 的版本低才行(也不能太高)
参考资料:
1 fused_adam.so: cannot open shared object file: No such file or directory问题排查与解决-优快云博客2 fused_adam.so: cannot open shared object file: No such file or directory · Issue #119 · databrickslabs/dolly · GitHub