使用llama.cpp部署推理GGUF格式的大模型实践记录

xuhf3150

于 2025-04-09 22:54:52 发布

阅读量674

点赞数 30

分类专栏： AI大模型学习笔记文章标签： llama.cpp gguf llama-cli llama-server

本文链接：https://blog.youkuaiyun.com/xuhf3150/article/details/147103873

版权

AI大模型学习笔记专栏收录该内容

4 篇文章

订阅专栏

1. 下载安装 llama.cpp

git clone https://github.com/ggml-org/llama.cpp

进入到 llama.cpp 目录

cd llama.cpp

2.编译 llama.cpp 生成可执行文件和库文件

编译GPU版本脚本：

cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_ENABLE_UNIFIED_MEMORY=1

如果执行以上脚本时报了类似下面的错误：

-- Including CUDA backend

-- Could NOT find CURL (missing: CURL_LIBRARY CURL_INCLUDE_DIR)

CMake Error at common/CMakeLists.txt:90 (message):

Could NOT find CURL. Hint: to disable this feature, set -DLLAMA_CURL=OFF

先执行如下脚本：

sudo apt-get update
sudo apt-get install libcurl4-openssl-dev

再执行编译GPU版本脚本。

编译CPU版本脚本:

cmake -B build

然后执行以下脚本：

cmake --build build --config Release -j 8

编译完成后，在llama.cpp/build/bin 目录下会生成很多可执行文件和库文件，例如：llama-cli 、llama-server、 llama-quantize 等。如下图所示：

3.安装python依赖库

还是在llama.cpp当前目录下执行以下脚本安装python依赖库，用于转换其它格式的模型为gguf格式的模型：

pip install -r requirements.txt

4.部署gguf格式模型进行推理

（1）gguf格式的模型可以从modelscope等大模型平台上下载，也可用使用llama.cpp自己转换的gguf格式模型，使用 llama.cpp 转换 hugging face 模型为 gguf 模型脚本如下：

# 如果不量化，保留模型的效果
python convert_hf_to_gguf.py /root/autodl-tmp/model/Qwen/Qwen1.5-7B-Chat_lora_sft --outtype f16 --verbose --outfile /root/autodl-tmp/model/Qwen/Qwen1.5-7B-Chat-gguf.gguf

# 如果需要量化，加速并有损效果
python convert_hf_to_gguf.py /root/autodl-tmp/model/Qwen/Qwen1.5-7B-Chat_lora_sft --outtype q8_0 --verbose --outfile /root/autodl-tmp/model/Qwen/Qwen1.5-7B-Chat-q8.gguf

（2）在命令行与模型进行对话

./build/bin/llama-cli -m /root/autodl-tmp/model/Qwen/Qwen1.5-7B-Chat-q8.gguf -p "you are a helpful assistant" -cnv -ngl 24

（3）以HTTP Server方式部署进行推理

./build/bin/llama-server -m /root/autodl-tmp/model/Qwen/Qwen1.5-7B-Chat-q8.gguf --port 8088

部署服务后，可以打开浏览器，输入地址http://127.0.0.1:8088就可以在网页上与模型进行交互，也可以使用类似 openai 风格的API进行调用推理。

网页交互示例：

类似 openai 风格的API进行调用推理示例：

先写一个api调用的python文件：callRemoteLLMServer.py

#多轮对话
from openai import OpenAI

#定义多轮对话方法
def run_chat_session():
    #初始化客户端
    client = OpenAI(base_url="http://localhost:8088/v1/",api_key="suibianxie")
    model_name = client.models.list().data[0].id
    print(model_name)
    #初始化对话历史
    chat_history = []
    #启动对话循环
    while True:
        #获取用户输入
        user_input = input("用户：")
        if user_input.lower() == "exit":
            print("退出对话。")
            break
        #更新对话历史(添加用户输入)
        chat_history.append({"role":"user","content":user_input})
        #调用模型回答
        try:
            chat_complition = client.chat.completions.create(messages=chat_history,model=model_name, temperature=0.8,top_p=0.8)
            #获取最新回答
            model_response = chat_complition.choices[0]
            print("AI:",model_response.message.content)
            #更新对话历史（添加AI模型的回复）
            chat_history.append({"role":"assistant","content":model_response.message.content})
        except Exception as e:
            print("发生错误：",e)
            break
if __name__ == '__main__':
    run_chat_session()

在终端窗口中执行该python 文件：