fused_adam.so: cannot open shared object file: No such file or directory

文章讲述了用户在使用deepspeed进行分布式训练时遇到的安装错误,主要原因是CUDA版本(nvcc11.2)与torch版本(2.0.1+cu117)不匹配。通过调整CUDA版本和尝试编译操作,最终发现torch的CUDA版本需与nvcc版本一致才能成功安装fused_adam等扩展。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

最近使用分布式训练框架deepspeed进行训练,安装后报错,如下所示

 File "**/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
    op_module = load(name=self.name,  File "**/site-packages/deepspeed/ops/op_builder/builder.py", line 479, in load

  File "**/site-packages/torch/utils/cpp_extension.py", line 1284, in load
    return self.jit_load(verbose)
  File "**/site-packages/deepspeed/ops/op_builder/builder.py", line 523, in jit_load
    op_module = load(name=self.name,
    return _jit_compile(  File "**/site-packages/torch/utils/cpp_extension.py", line 1284, in load

  File "**/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
    return _jit_compile(
  File "**/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "**/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
    return _import_module_from_library(name, build_directory, is_python_module)
  File "**/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
    module = importlib.util.module_from_spec(spec)
  File "<frozen importlib._bootstrap>", line 556, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 1166, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py38_cu117/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
    module = importlib.util.module_from_spec(spec)
  File "<frozen importlib._bootstrap>", line 556, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 1166, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py38_cu117/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory

检查版本如下,操作系统centos7,内核3.10.0-1160.92.1.el7.x86_64 ,python 3.8,显卡驱动对应的版本11.2,torch 版本2.0.1+cu117,nvcc 版本11.2,deepspeed 版本0.13.4
ds_resport 输出如下

[2024-03-06 17:33:52,658] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['**/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['**/site-packages/deepspeed']
deepspeed info ................... 0.13.4, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.2
deepspeed wheel compiled w. ...... torch 1.10, cuda 10.2
shared memory (/dev/shm) size .... 125.87 GB
 

其中fused_adam ............. [NO] ....... [OKAY] 显示未安装,其中torch cuda 版本为11.7,但nvcc 版本为11.2;更换一台机器,ubuntu 20.04,python 3.10,显卡驱动对应的cuda 12.2nvcc 版本为11.8,torch 版本2.1.2+cu121,deepspeed 0.13.5

运行deepspeed 报错同上

运行ds_report

[2024-03-06 17:47:43,578] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['**/site-packages/torch']
torch version .................... 2.1.2+cu121
deepspeed install path ........... ['**/site-packages/deepspeed']
deepspeed info ................... 0.13.5, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 125.76 GB其中torch cuda 版本12.1,但nvcc 版本为11.8,此时按照参考资料2的内容,复制github上deepspeed目录,进入deepspeed目录,并执行DS_BUILD_FUSED_ADAM=1 pip3 install .,报如下错误:

      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "**/DeepSpeed/setup.py", line 196, in <module>
          ext_modules.append(builder.builder())
        File "**/DeepSpeed/op_builder/builder.py", line 633, in builder
          assert_no_cuda_mismatch(self.name)
        File "**/DeepSpeed/oop_builder/builder.py", line 101, in assert_no_cuda_mismatch
          raise CUDAMismatchException(
      op_builder.builder.CUDAMismatchException: >- DeepSpeed Op Builder: Installed CUDA version 11.8 does not match the version torch was compiled with 12.1, unable to compile cuda/cpp extensions without a matching cuda version.
      DS_BUILD_OPS=0
       [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
       [WARNING]  async_io: please install the libaio-dev package with apt
       [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
显示cuda 版本11.8和12.1版本的torch 不匹配,因此升级cuda 为12.2,然后再进入deepspeed目录,DS_BUILD_FUSED_ADAM=1 pip3 install . ,安装后ds_report,发现fused_adam已安装上,fused_adam ............. [YES] ...... [OKAY],再次执行训练,发现已经不在报上述错误,注意DS_BUILD_FUSED_ADAM=1 pip3 install deepspeed 是不起作用的,安装不上fused_adam

由此得出结论:torch 的cuda 版本要和nvcc 的版本一致才可以,至少torch 的cuda 版本不能比nvcc 的版本低才行(也不能太高)

参考资料:

fused_adam.so: cannot open shared object file: No such file or directory问题排查与解决-优快云博客fused_adam.so: cannot open shared object file: No such file or directory · Issue #119 · databrickslabs/dolly · GitHub

3  【工程实践】解决 nvcc: command not found_nvcc -v 提示未找到命令-优快云博客

4   https://github.com/stanford-crfm/mistral/issues/196

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值