A40, centos7配置unsloth，本地大模型微调记录，主要是 xformers 和 bitsandbytes 这两个包的问题

本文链接：https://blog.youkuaiyun.com/weixin_43971235/article/details/146401394

想研究一下大模型微调，实验室有两块A40，跟着这个攻略做的：
https://www.datacamp.com/tutorial/fine-tuning-deepseek-r1-reasoning-model
，要用到unsloth和一些相关的包，配环境麻烦的要死，记录一下，免得以后忘了。

首先是升级了一下cuda驱动版本，本来实验室用的是CUDA 11.7，也不知道为啥，用这么老的版本，实在是各种包的冲突解决不，只好升级了，群里问了下有人在用没，没人就升级了。

最后升级到了12.4，cuda升级是跟着这个攻略做的:
【Linux更新驱动、cuda和cuda toolkit】_linux更新显卡驱动-优快云博客

要去官网看一下哪个版本的CUDA和CUDA toolkit，这里也有坑，服务器内核版本太低了（centos7 默认的内核版本），nvidia推荐的版本居然不匹配，最后使用的是cuda_12.4.1_550.54.15_linux.run
这个版本的CUDA toolkit，官网的连接是：
CUDA Toolkit 12.4 Update 1 Downloads | NVIDIA Developer

Drivers Details

这个版本的CUDA

下载下来，上传到服务器，运行，安装，全部选是就行了。

然后是安装unsloth，服务器上不能连接github，下载到本地然后上传到服务器上再在文件夹里pip install .的，链接是：
https://github.com/unslothai/unsloth

缺什么安装什么就行了。需要注意的是 xformers 和 bitsandbytes 这两个包。xformers要和pytorch版本匹配。最后用的版本是：xformers==0.0.28 pytorch==2.6.0，pytorch的链接是：
PyTorch

Xformers这个包即使不安其实也不会报错，但是运行模型的结果会非常奇怪，首先loss会下降的非常快，基本上一轮之后就接近0了，然后输出的结果也全是<think>，很明显根本就没训练。所以还是要安装一下。

bitsandbytes，安装起来超级麻烦的包，不能直接pip，也是从github上clone下来，然后还要自己编译。经常安装的时候不报错，编译也不报错，但是就是不能用，往往要执行一步: python -m bitsandbytes，才会发现又错。因为已经过了一段时间了，我已经找不到当时报错的内容了，只记得第一个消息是显示没有/bitsandbytes-0.45.2/bitsandbytes/libbitsandbytes_cuda124.so这样，查了一下还真没有，在这里卡了好久。

先给出官网的链接：GitHub - bitsandbytes-foundation/bitsandbytes: Accessible large language models via k-bit quantization for PyTorch.wwo

上传到服务器上，cd进目录，跟着这个攻略： Bitsandbytes最新版本编译安装_bitsandbytes安装-优快云博客

这里对CMake版本还有要求，CMake >= 3.22.1 和Python >= 3.8。

然后，编译的时候，遇到非常奇怪的问题，明明前面我们已经把CUDA版本升级了，但是这里，执行

cmake -DCOMPUTE_BACKEND=cuda -S .

之后再查看nvcc，发现版本又变成原来的版本了。这样就会反复报错，这也是之前报错没有 /bitsandbytes-0.45.2/bitsandbytes/libbitsandbytes_cuda124.so的原因，因为他使用的是老版本的cuda，所以产生的是libbitsandbytes_cuda117.so，这样的文件。很奇怪，即使我们修改了环境变量，依然这样，感觉有可能是软连接的问题？懒得改了，这里直接给他指定了cuda的版本124

cmake -DCOMPUTE_BACKEND=cuda -DCUDAToolkit_ROOT=/usr/local/cuda-12.4 -S .

反正问题出在 CMake的时候使用的cuda版本不对。

这次编译完之后没有问题了，运行python -m bitsandbytes显示：

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++ BUG REPORT INFORMATION ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++ OTHER +++++++++++++++++++++++++++
CUDA specs: CUDASpecs(highest_compute_capability=(8, 6), cuda_version_string='124', cuda_version_tuple=(12, 4))
PyTorch settings found: CUDA_VERSION=124, Highest Compute Capability: (8, 6).
To manually override the PyTorch CUDA version please see: https://github.com/TimDettmers/bitsandbytes/blob/main/docs/source/nonpytorchcuda.mdx
The directory listed in your path is found to be non-existent: /home/lighting/perl5/lib/perl5
The directory listed in your path is found to be non-existent: --install_base /home/lighting/perl5
The directory listed in your path is found to be non-existent: INSTALL_BASE=/home/lighting/perl5
The directory listed in your path is found to be non-existent: /home/lighting/.local/share/autojump/errors.log
Found duplicate CUDA runtime files (see below).

We select the PyTorch default CUDA runtime, which is 12.4,
but this might mismatch with the CUDA version that is needed for bitsandbytes.
To override this behavior set the `BNB_CUDA_VERSION=<version string, e.g. 122>` environmental variable.

For example, if you want to use the CUDA version 122,
BNB_CUDA_VERSION=122 python ...

OR set the environmental variable in your .bashrc:
export BNB_CUDA_VERSION=122

In the case of a manual override, make sure you set LD_LIBRARY_PATH, e.g.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.2,
* Found CUDA runtime at: /usr/local/cuda-12.4/lib64/libcudart.so
* Found CUDA runtime at: /usr/local/cuda-12.4/lib64/libcudart.so.12.4.127
* Found CUDA runtime at: /usr/local/cuda-12.4/lib64/libcudart.so.12
* Found CUDA runtime at: /usr/local/cuda-12.4/lib64/libcudart.so
* Found CUDA runtime at: /usr/local/cuda-12.4/lib64/libcudart.so.12.4.127
* Found CUDA runtime at: /usr/local/cuda-12.4/lib64/libcudart.so.12
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++ DEBUG INFO END ++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Checking that the library is importable and CUDA is callable...
SUCCESS!
Installation was successful!

SUCCESS!

没问题了！

然后是最开始那个攻略里的代码的问题，原本他们是运行在colab里的，我们要放在本地，而且还不能连接huggingface，所以代码还有些不同。大概变化是：

1. 从huggingface上下载模型到本地再上传到服务器上，调用大模型的代码：


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "./LLM_fine_tuning/DeepSeek-R1-Distill-Llama-8B", # 改成自己的路径
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = hf_token, 
)

记得改成自己的路径

2. 数据集的路径，也是一样，改成自己的路径

from datasets import load_dataset
dataset = load_dataset("./datasets/medical-o1-reasoning-SFT-en", split = "train[0:500]")
dataset = dataset.map(formatting_prompts_func, batched = True,)

差不多就这样吧，剩下的想起来再说