1 训练模型,需要使用transformer_engine
系统ubuntu 22.04,
#uname -a
Linux ubuntu-workstation 6.8.0-54-generic #56~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Sat Feb 8 11:41:24 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
conda 虚拟环境内:
python=3.11
transformers==4.57
pytorch=2.7.0
cuda=12.6,cudnn =8.9.7
cmake version 3.22.1
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
1 按照资料1 的pip 安装方式
pip3 install --no-build-isolation transformer_engine[pytorch]
安装完毕,开始测试,命令行输入python,然后import transformer_engine,也可以按照参考资料4给出的例子进行测试
刚开始报第一个错
File "/***/TransformerEngine/transformer_engine/__init__.py", line 11, in <module>
import transformer_engine.common
File "/***/TransformerEngine/transformer_engine/common/__init__.py", line 392, in <module>
sanity_checks_for_pypi_installation()
File "/***/TransformerEngine/transformer_engine/common/__init__.py", line 206, in sanity_checks_for_pypi_installation
assert te_installed_via_pypi, "Could not find `transformer-engine` PyPI package."
^^^^^^^^^^^^^^^^^^^^^
AssertionError: Could not find `transformer-engine` PyPI package.
最后发现,目前的目录是从github 上克隆的TransformerEngine,里面有个transformer_engine 文件夹,这个文件夹里也有__init__.py,所以导入的不是已安装好的transformer 包,而是这个文件夹的transfromer_engine,所以报错(当transformer_engine 和torch 版本不匹配时也会报这个类型错误),更换目录后,导入成功
继续测试
import transformer_engine.pytorch
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ubuntu/anaconda3/envs/finetuning/lib/python3.11/site-packages/transformer_engine/pytorch/__init__.py", line 18, in <module>
load_framework_extension("torch")
File "/home/ubuntu/anaconda3/envs/finetuning/lib/python3.11/site-packages/transformer_engine/common/__init__.py", line 190, in load_framework_extension
solib = importlib.util.module_from_spec(spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ImportError: /home/ubuntu/anaconda3/envs/finetuning/lib/python3.11/site-packages/transformer_engine/wheel_lib/transformer_engine_torch.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEab
此时,删掉整个conda 虚拟环境,然后重新安装transformer_engine,第一个安装transformer_egnine,由于虚拟环境是干净的,它会安装一些附属的包,也包含transformer_engine_torch-2.10.0,安装完成后,再导入就不会报错。不过也安装了torch,这个torch 版本有可能不是想要的,可以重装
上述默认安装最新版本的torch_engine(目前是2.10.0),如果安装之前的版本,则使用
pip3 install --no-build-isolation transformer_engine[pytorch]==2.8.0,则是安装2.8.0版本的transformer_engine
2 从github 安装
pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable
因为有3个子项目也要克隆下来,很容易出现超时,
2 源码安装
# Clone repository, checkout stable branch, clone submodules git clone --branch stable --recursive https://github.com/NVIDIA/TransformerEngine.git cd TransformerEngine export NVTE_FRAMEWORK=pytorch # Optionally set framework pip3 install --no-build-isolation . # Build and install
克隆源码,相关的包也要克隆下来,如不能自动克隆,可以手工克隆或下载对应的目录 ./Transformer_Engine/3rdpary
遇到了如下错误
(1)error Version mismatch in cuDNN GRAPH!!!
这是因为cudnn 版本过低导致的,参考资料的安装说明给出以下几个前提条件:
-
Linux x86_64
-
NVIDIA Driver supporting CUDA 12.1 or later.
-
cuDNN 9.3 or later.
第4条,要求cudnn 版本不低于9.3,下载最新的cudnn 版本并更新
https://developer.nvidia.com/cudnn-downloads,按照说明安装了9.17.0 的版本
(2)缺少nccl 2
#include <nccl.h> no such name or directory
按照参考资料:https://zhuanlan.zhihu.com/p/25513394133安装nccl 2,安装后需要从github 下载nccl-tests,然后测试,只有1张显卡,g 为1
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
# nccl-tests version 2.17.6 nccl-headers=22403 nccl-library=22403
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 424079 on ubuntu-workstation device 0 [0000:04:00] NVIDIA GeForce RTX 4090
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 1.97 0.00 0.00 0 0.06 0.13 0.00 0
16 4 float sum -1 2.52 0.01 0.00 0 0.06 0.26 0.00 0
32 8 float sum -1 2.21 0.01 0.00 0 0.06 0.52 0.00 0
64 16 float sum -1 1.84 0.03 0.00 0 0.06 1.03 0.00 0
128 32 float sum -1 1.82 0.07 0.00 0 0.06 2.12 0.00 0
256 64 float sum -1 10.49 0.02 0.00 0 0.06 4.25 0.00 0
512 128 float sum -1 2.40 0.21 0.00 0 0.06 8.76 0.00 0
1024 256 float sum -1 2.41 0.43 0.00 0 0.06 17.39 0.00 0
2048 512 float sum -1 2.46 0.83 0.00 0 0.06 34.83 0.00 0
4096 1024 float sum -1 2.42 1.69 0.00 0 0.06 67.87 0.00 0
8192 2048 float sum -1 2.01 4.07 0.00 0 0.06 137.33 0.00 0
16384 4096 float sum -1 2.05 8.00 0.00 0 0.06 284.69 0.00 0
32768 8192 float sum -1 2.50 13.12 0.00 0 0.06 539.84 0.00 0
65536 16384 float sum -1 1.93 33.97 0.00
再次安装,不再报这个错误,代表已安装
(3)第3个错误
RuntimeError: Error when running CMake: Command '['/usr/bin/cmake', '-S', '/***/TransformerEngine/transformer_engine/common', '-B', '/***/TransformerEngine/build/cmake', '-DPython_EXECUTABLE=/home/ubuntu/anaconda3/envs/finetuning/bin/python3.11', '-DPython_INCLUDE_DIR=/home/ubuntu/anaconda3/envs/finetuning/include/python3.11', '-DPython_SITEARCH=/home/ubuntu/anaconda3/envs/finetuning/lib/python3.11/site-packages', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/***/TransformerEngine/build/lib.linux-x86_64-cpython-311', '-DCMAKE_CUDA_ARCHITECTURES=70;80;89;90', '-Dpybind11_DIR=/home/ubuntu/anaconda3/envs/finetuning/lib/python3.11/site-packages/pybind11/share/cmake/pybind11', '-GNinja']' returned non-zero exit status 1.
[end of output]
按照参考资料2的说明
pip list |grep nvidia
nvidia-cudnn-cu12 9.10.2.21
其次导入环境变量
- export CUDNN_PATH=/path/to/cudnn
- export CPLUS_INCLUDE_PATH=/path/to/cudnn/include
并查看环境变量
echo $CUDNN_PATH
/home/ubuntu/anaconda3/envs/finetuning/lib/python3.11/site-packages/nvidia/cudnn
echo $CPLUS_INCLUDE_PATH
/home/ubuntu/anaconda3/envs/finetuning/lib/python3.11/site-packages/nvidia/cudnn/include
再次执行安装,问题依旧,还是按照第1种方式安装吧
参考资料:1
1 https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/installation.html
2 https://blog.youkuaiyun.com/kurumigao/article/details/152820808
3 https://blog.youkuaiyun.com/qq_41185868/article/details/130983787
4https://pypi.org/project/transformer-engine/
5 https://blog.youkuaiyun.com/weixin_74277223/article/details/151863470
3670

被折叠的 条评论
为什么被折叠?



