基于llama_cpp 调用本地模型(llama)实现基本推理

零基础实践本地推理模型基本应用: 基于llama_cpp的本地模型调用。

本文先安装 llama_cpp python库,再编写程序,利用其调用llama-2-7b-chat.Q4_K_M.ggu模型。

背景

llama_cpp 是一个基于 C++ 的高性能库(llama.cpp)的 Python 绑定,支持在 CPU 或 GPU 上高效运行 LLaMA 及其衍生模型(如 LLaMA 2),并通过量化技术(如 GGUF 格式)优化内存使用,适合本地部署和推理。

llama.cpp由保加利亚开发者Georgi Gerganov(网名 ggerganov)开发。他拥有物理学硕士背景,曾从事医疗物理研究,后转向软件开发。Georgi 通过纯 C/C++ 实现了轻量级的 LLM 推理框架,支持 CPU/GPU 加速及多平台适配,并首创 GGML/GGUF 模型格式,显著降低模型运行门槛。他的工作推动了 LLaMA 等大型语言模型在消费级硬件(如 MacBook、树莓派、手机)上的本地化部署,被誉为开源 AI 领域的重要突破。目前他创立了公司 ggml.ai,专注于设备端推理技术的商业化

运行环境

硬件配置(i7 CPU + Iris Xe显卡), 选择7B参数的量化模型(如llama-2-7b-chat.Q4_K_M.gguf), 实现一个基于本地模型的问答示例,优化CPU线程分配和GPU层数设置,确保在集成显卡环境下获得最佳推理性能。

操作系统: win11

基础软件: vs MSBuild.exe, Cmake , conda等。

安装llama-cpp-python

以下给出多种安装方法,如果失败可以换另一种。

  1. 常规安装
pip install llama-cpp-python 
  1. pip 离线安装
#手动下载文件: 
https://files.pythonhosted.org/packages/a6/38/7a47b1fb1d83eaddd86ca8ddaf20f141cbc019faf7b425283d8e5ef710e5/llama_cpp_python-0.3.7.tar.gz

# 安装 
pip install llama_cpp_python-0.3.7.tar.gz
  1. extra-index-url
pip install llama-cpp-python  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
  1. 清缓存
pip cache purge
pip install llama_cpp_python
  1. 更换较低版本【win11 集成显卡, python3.10.0 亲测成功】
pip install llama-cpp-python==0.2.23 --force-reinstall
##虚拟环境下依赖包的版本如下: 
------------------ -----------
build              1.2.2.post1
cmake              3.31.6
colorama           0.4.6
diskcache          5.6.3
importlib_metadata 8.6.1
llama_cpp_python   0.2.23
numpy              2.2.3
packaging          24.2
pip                25.0
pyproject_hooks    1.2.0
setuptools         75.8.2
tomli              2.2.1
typing_extensions  4.12.2
wheel              0.45.1
zipp               3.21.0

模型调用

本节演示如何基于 Python 库调用方式,在本地用 llama.cpp 运行 llama-2-7b-chat.Q4_K_M.gguf 模型。

Llama-2-7b-chat.Q4_K_M.ggufMeta 公司研发,是 Llama-2 系列中的 70 亿参数对话模型,采用 GGUF 格式(由 llama.cpp 团队开发)。其核心特点包括:

高效量化:使用 Q4_K_M 量化方法(4 位),大幅减小模型体积至约 4.3GB,同时保持较高推理性能,支持 CPU/GPU 加速。

格式优势:GGUF 支持更好的分词处理、特殊令牌及元数据管理,扩展性强,兼容 llama.cpp、text-generation-webui 等主流工具。

场景适配:适用于本地部署、低资源设备或需快速响应的对话场景,生成文本质量接近原版模型,适合开发者低成本集成到应用中。

下载模型

https://cdn-lfs-cn-1.modelscope.cn/prod/lfs-objects/08/a5/566d61d7cb6b420c3e4387a39e0078e1f2fe5f055f3a03887385304d4bfa?filename=llama-2-7b-chat.Q4_K_M.gguf&namespace=Xorbits&repository=Llama-2-7b-Chat-GGUF&revision=master&auth_key=1741233419-991552e421204cd299a693fde0e2f35f-0-232242f927d8051e27e207a6bb877f23

演示一

极简演示问答:

(langchain) PS D:\code\trae> python
Python 3.10.0 | packaged by conda-forge | (default, Nov 10 2021, 13:20:59) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> from llama_cpp import Llama
>>> llm = Llama(model_path="llama-2-7b-chat.Q4_K_M.gguf", n_ctx=2048)
>>> print(llm("who are you!", max_tokens=50)["choices"][0]["text"])  
Llama.generate: prefix-match hit

llama_print_timings:        load time =    1842.63 ms
llama_print_timings:      sample time =      14.19 ms /    50 runs   (    0.28 ms per token,  3523.86 tokens per second)
llama_print_timings: prompt eval time =     501.53 ms /     3 tokens (  167.18 ms per token,     5.98 tokens per second)
llama_print_timings:        eval time =    9411.83 ms /    49 runs   (  192.08 ms per token,     5.21 tokens per second)
llama_print_timings:       total time =   10064.66 ms
!!?。
姓名:叮当
Age:17
Gender:male
Occupation:Student
Hobbies:Playing video games,watching anime,reading manga,list
>>>

演示二

创建一个新的Python脚本文件llama_qa.py,用于实现基于llama.cpp的本地问答程序。使用Q4_K_M量化的7B参数模型,该模型在保持精度的同时具有较好的CPU推理速度。

from llama_cpp import Llama
import time

# 初始化模型(使用绝对路径)
llm = Llama(
    model_path="d:\\code\\trae\\llama-2-7b-chat.Q4_K_M.gguf",
    n_ctx=2048,
    n_threads=8,  # 根据i7 CPU的核心数调整
    n_gpu_layers=0  # 禁用GPU加速,完全使用CPU
)

# 系统提示模板
SYSTEM_PROMPT = """你是人工智能助手,根据用户的问题提供简洁准确的回答。"""

while True:
    try:
        user_input = input("\n用户:")
        if user_input.lower() in ('exit', 'quit'):
            break
        
        # 记录开始时间
        start_time = time.time()
        
        # 生成响应
        output = llm.create_chat_completion(
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": user_input}
            ],
            max_tokens=500,
            temperature=0.7
        )
        
        # 计算耗时
        latency = time.time() - start_time
        
        # 提取回答内容
        response = output['choices'][0]['message']['content']
        
        print(f"\n助手(耗时{latency:.2f}s):")
        print(response.replace("。", ".\n"))
        
    except KeyboardInterrupt:
        break
    except Exception as e:
        print(f"发生错误:{str(e)}")

print("\n问答程序已退出")

运行结果

(langchain) PS D:\code\trae> python llama_qa.py
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from d:\code\trae\llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     ....
...
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.12 MiB
llm_load_tensors: mem required  = 3891.36 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 159.32 MiB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |

用户:解释一下llama_cpp

llama_print_timings:        load time =   14522.49 ms
llama_print_timings:      sample time =     211.79 ms /   447 runs   (    0.47 ms per token,  2110.57 tokens per second)
llama_print_timings: prompt eval time =   14522.17 ms /    67 tokens (  216.75 ms per token,     4.61 tokens per second)
llama_print_timings:        eval time =  190233.36 ms /   446 runs   (  426.53 ms per token,     2.34 tokens per second)
llama_print_timings:       total time =  207667.52 ms

助手(耗时207.67s):
  Certainly! Llama_cpp is a programming language that is designed to be simple, easy to use, and efficient. It is a lightweight alternative to traditional programming languages like C++ or Java, and it is well-suited for developing small to medium-sized applications.
Here are some key features of Llama_cpp:
1. **Simple syntax**: Llama_cpp has a simple and intuitive syntax that makes it easy to learn and use. It eliminates unnecessary complexity and makes it easier for developers to focus on the logic of their code rather than worrying about syntax.
2. **Efficient**: Llama_cpp is designed to be efficient, which means that it can run fast and smoothly even on low-powered devices. This makes it well-suited for developing applications that need to run quickly and smoothly, such as games or scientific simulations.
3. **Object-oriented**: Llama_cpp is an object-oriented language, which means that it supports the principles of object-oriented programming (OOP). This makes it easy to write reusable code and develop complex applications.
4. **Garbage collection**: Llama_cpp has a built-in garbage collector, which automatically manages memory allocation and deallocation. This eliminates the need for manual memory management, which can be error-prone and time-consuming.      
5. **Cross-platform**: Llama_cpp is designed to be cross-platform, which means that it can run on a variety of operating systems, including Windows, MacOS, and Linux. This makes it easy to develop applications that can run on multiple platforms without having to write separate code for each platform.
Overall, Llama_cpp is a versatile and efficient programming language that is well-suited for developing a wide range of applications. Its simple syntax, object-oriented design, garbage collection, and cross-platform capabilities make it an attractive choice for developers who want to write fast, efficient, and easy-to-maintain code.

用户:
Llama.generate: prefix-match hit

llama_print_timings:        load time =   14522.49 ms
llama_print_timings:      sample time =      28.00 ms /    79 runs   (    0.35 ms per token,  2820.92 tokens per second)
llama_print_timings: prompt eval time =    1981.60 ms /     4 tokens (  495.40 ms per token,     2.02 tokens per second)
llama_print_timings:        eval time =   31865.71 ms /    78 runs   (  408.53 ms per token,     2.45 tokens per second)
llama_print_timings:       total time =   34145.26 ms

助手(耗时34.15s):
  Hello! I'm here to help you with any questions or problems you may have. Please feel free to ask me anything, and I will do my best to provide a concise and accurate response. Whether you need information on a specific topic, assistance with a task, or just someone to talk to, I'm here to help. Just go ahead and ask me anything!
### 解决基于 llama.cpp 本地化部署 DeepSeek-R1 模型时缺少上传附件功能的问题 为了使 Windows 7 用户能够成功部署并利用 DeepSeek-R1 模型,教程已经针对该环境进行了特定调整[^1]。然而,在实际应用过程中遇到了缺乏文件上传能力这一挑战。 #### 文件上传机制分析 通常情况下,Llama.cpp 并不自带图形界面或直接支持文件上传的功能。这是因为 Llama.cpp 主要专注于提供推理服务而非构建完整的应用程序框架。因此,当需要实现文件上传特性时,则需额外集成其他组件来补充此功能。 对于希望增加文件上传特性的用户来说,可以考虑采用以下几种方法: - **通过 Web 接口间接处理** 构建一个简单的 HTTP/HTTPS 服务器作为前端接口,允许用户提交文件到指定位置。Python 的 Flask 或 FastAPI 是两个轻量级的选择,易于配置且能快速搭建起所需的服务端逻辑。 ```python import os from flask import Flask, request, send_from_directory app = Flask(__name__) UPLOAD_FOLDER = './uploads' if not os.path.exists(UPLOAD_FOLDER): os.makedirs(UPLOAD_FOLDER) @app.route('/upload', methods=['POST']) def upload_file(): if 'file' not in request.files: return "No file part", 400 file = request.files['file'] if file.filename == '': return "No selected file", 400 filename = secure_filename(file.filename) filepath = os.path.join(app.config['UPLOAD_FOLDER'], filename) file.save(filepath) # 这里可以根据需求调用 llm 处理函数 process_llm_with_uploaded_file(filepath) return f"File {filename} has been uploaded successfully.", 200 def process_llm_with_uploaded_file(path_to_file): pass # 实现具体的LLM处理流程 if __name__ == "__main__": app.run(debug=True) ``` - **命令行参数传递** 如果应用场景较为简单,也可以简化设计思路——即让用户先手动将待处理的数据放置于预设目录下,再启动 Llama.cpp 应用程序并通过命令行参数告知其输入路径。这种方式虽然不够直观友好,但对于某些场景可能是最简便有效的解决方案之一。 - **第三方库辅助开发** 利用 Python 中诸如 `streamlit` 等可视化工具包创建交互式的 GUI 页面,不仅可以让整个过程更加人性化,同时也更容易满足不同层次用户的操作习惯。Streamlit 提供了便捷的方式用于展示模型预测结果以及接收来自用户的反馈信息。 综上所述,尽管原生的 Llama.cpp 不具备内置的文件管理能力,但借助外部技术和开源项目完全可以克服这个障碍,并为用户提供满意的体验。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

月光技术杂谈

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值