零基础实践本地推理模型基本应用: 基于llama_cpp的本地模型调用。
本文先安装 llama_cpp python库,再编写程序,利用其调用llama-2-7b-chat.Q4_K_M.ggu模型。
背景
llama_cpp 是一个基于 C++ 的高性能库(llama.cpp)的 Python 绑定,支持在 CPU 或 GPU 上高效运行 LLaMA 及其衍生模型(如 LLaMA 2),并通过量化技术(如 GGUF 格式)优化内存使用,适合本地部署和推理。
llama.cpp由保加利亚开发者Georgi Gerganov(网名 ggerganov)开发。他拥有物理学硕士背景,曾从事医疗物理研究,后转向软件开发。Georgi 通过纯 C/C++ 实现了轻量级的 LLM 推理框架,支持 CPU/GPU 加速及多平台适配,并首创 GGML/GGUF 模型格式,显著降低模型运行门槛。他的工作推动了 LLaMA 等大型语言模型在消费级硬件(如 MacBook、树莓派、手机)上的本地化部署,被誉为开源 AI 领域的重要突破。目前他创立了公司 ggml.ai,专注于设备端推理技术的商业化
运行环境
硬件配置(i7 CPU + Iris Xe显卡), 选择7B参数的量化模型(如llama-2-7b-chat.Q4_K_M.gguf), 实现一个基于本地模型的问答示例,优化CPU线程分配和GPU层数设置,确保在集成显卡环境下获得最佳推理性能。
操作系统: win11
基础软件: vs MSBuild.exe, Cmake , conda等。
安装llama-cpp-python
以下给出多种安装方法,如果失败可以换另一种。
- 常规安装
pip install llama-cpp-python
- pip 离线安装
#手动下载文件:
https://files.pythonhosted.org/packages/a6/38/7a47b1fb1d83eaddd86ca8ddaf20f141cbc019faf7b425283d8e5ef710e5/llama_cpp_python-0.3.7.tar.gz
# 安装
pip install llama_cpp_python-0.3.7.tar.gz
- extra-index-url
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
- 清缓存
pip cache purge
pip install llama_cpp_python
- 更换较低版本【win11 集成显卡, python3.10.0 亲测成功】
pip install llama-cpp-python==0.2.23 --force-reinstall
##虚拟环境下依赖包的版本如下:
------------------ -----------
build 1.2.2.post1
cmake 3.31.6
colorama 0.4.6
diskcache 5.6.3
importlib_metadata 8.6.1
llama_cpp_python 0.2.23
numpy 2.2.3
packaging 24.2
pip 25.0
pyproject_hooks 1.2.0
setuptools 75.8.2
tomli 2.2.1
typing_extensions 4.12.2
wheel 0.45.1
zipp 3.21.0
模型调用
本节演示如何基于 Python 库调用方式,在本地用 llama.cpp 运行 llama-2-7b-chat.Q4_K_M.gguf 模型。
Llama-2-7b-chat.Q4_K_M.gguf由 Meta 公司研发,是 Llama-2 系列中的 70 亿参数对话模型,采用 GGUF 格式(由 llama.cpp 团队开发)。其核心特点包括:
高效量化:使用 Q4_K_M 量化方法(4 位),大幅减小模型体积至约 4.3GB,同时保持较高推理性能,支持 CPU/GPU 加速。
格式优势:GGUF 支持更好的分词处理、特殊令牌及元数据管理,扩展性强,兼容 llama.cpp、text-generation-webui 等主流工具。
场景适配:适用于本地部署、低资源设备或需快速响应的对话场景,生成文本质量接近原版模型,适合开发者低成本集成到应用中。
下载模型
https://cdn-lfs-cn-1.modelscope.cn/prod/lfs-objects/08/a5/566d61d7cb6b420c3e4387a39e0078e1f2fe5f055f3a03887385304d4bfa?filename=llama-2-7b-chat.Q4_K_M.gguf&namespace=Xorbits&repository=Llama-2-7b-Chat-GGUF&revision=master&auth_key=1741233419-991552e421204cd299a693fde0e2f35f-0-232242f927d8051e27e207a6bb877f23
演示一
极简演示问答:
(langchain) PS D:\code\trae> python
Python 3.10.0 | packaged by conda-forge | (default, Nov 10 2021, 13:20:59) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> from llama_cpp import Llama
>>> llm = Llama(model_path="llama-2-7b-chat.Q4_K_M.gguf", n_ctx=2048)
>>> print(llm("who are you!", max_tokens=50)["choices"][0]["text"])
Llama.generate: prefix-match hit
llama_print_timings: load time = 1842.63 ms
llama_print_timings: sample time = 14.19 ms / 50 runs ( 0.28 ms per token, 3523.86 tokens per second)
llama_print_timings: prompt eval time = 501.53 ms / 3 tokens ( 167.18 ms per token, 5.98 tokens per second)
llama_print_timings: eval time = 9411.83 ms / 49 runs ( 192.08 ms per token, 5.21 tokens per second)
llama_print_timings: total time = 10064.66 ms
!!?。
姓名:叮当
Age:17
Gender:male
Occupation:Student
Hobbies:Playing video games,watching anime,reading manga,list
>>>
演示二
创建一个新的Python脚本文件llama_qa.py,用于实现基于llama.cpp的本地问答程序。使用Q4_K_M量化的7B参数模型,该模型在保持精度的同时具有较好的CPU推理速度。
from llama_cpp import Llama
import time
# 初始化模型(使用绝对路径)
llm = Llama(
model_path="d:\\code\\trae\\llama-2-7b-chat.Q4_K_M.gguf",
n_ctx=2048,
n_threads=8, # 根据i7 CPU的核心数调整
n_gpu_layers=0 # 禁用GPU加速,完全使用CPU
)
# 系统提示模板
SYSTEM_PROMPT = """你是人工智能助手,根据用户的问题提供简洁准确的回答。"""
while True:
try:
user_input = input("\n用户:")
if user_input.lower() in ('exit', 'quit'):
break
# 记录开始时间
start_time = time.time()
# 生成响应
output = llm.create_chat_completion(
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_input}
],
max_tokens=500,
temperature=0.7
)
# 计算耗时
latency = time.time() - start_time
# 提取回答内容
response = output['choices'][0]['message']['content']
print(f"\n助手(耗时{latency:.2f}s):")
print(response.replace("。", ".\n"))
except KeyboardInterrupt:
break
except Exception as e:
print(f"发生错误:{str(e)}")
print("\n问答程序已退出")
运行结果
(langchain) PS D:\code\trae> python llama_qa.py
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from d:\code\trae\llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: - tensor 0: token_embd.weight q4_K [ 4096, 32000, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_norm.weight f32 [ 4096, 1, 1, ....
...
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.12 MiB
llm_load_tensors: mem required = 3891.36 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 159.32 MiB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
用户:解释一下llama_cpp
llama_print_timings: load time = 14522.49 ms
llama_print_timings: sample time = 211.79 ms / 447 runs ( 0.47 ms per token, 2110.57 tokens per second)
llama_print_timings: prompt eval time = 14522.17 ms / 67 tokens ( 216.75 ms per token, 4.61 tokens per second)
llama_print_timings: eval time = 190233.36 ms / 446 runs ( 426.53 ms per token, 2.34 tokens per second)
llama_print_timings: total time = 207667.52 ms
助手(耗时207.67s):
Certainly! Llama_cpp is a programming language that is designed to be simple, easy to use, and efficient. It is a lightweight alternative to traditional programming languages like C++ or Java, and it is well-suited for developing small to medium-sized applications.
Here are some key features of Llama_cpp:
1. **Simple syntax**: Llama_cpp has a simple and intuitive syntax that makes it easy to learn and use. It eliminates unnecessary complexity and makes it easier for developers to focus on the logic of their code rather than worrying about syntax.
2. **Efficient**: Llama_cpp is designed to be efficient, which means that it can run fast and smoothly even on low-powered devices. This makes it well-suited for developing applications that need to run quickly and smoothly, such as games or scientific simulations.
3. **Object-oriented**: Llama_cpp is an object-oriented language, which means that it supports the principles of object-oriented programming (OOP). This makes it easy to write reusable code and develop complex applications.
4. **Garbage collection**: Llama_cpp has a built-in garbage collector, which automatically manages memory allocation and deallocation. This eliminates the need for manual memory management, which can be error-prone and time-consuming.
5. **Cross-platform**: Llama_cpp is designed to be cross-platform, which means that it can run on a variety of operating systems, including Windows, MacOS, and Linux. This makes it easy to develop applications that can run on multiple platforms without having to write separate code for each platform.
Overall, Llama_cpp is a versatile and efficient programming language that is well-suited for developing a wide range of applications. Its simple syntax, object-oriented design, garbage collection, and cross-platform capabilities make it an attractive choice for developers who want to write fast, efficient, and easy-to-maintain code.
用户:
Llama.generate: prefix-match hit
llama_print_timings: load time = 14522.49 ms
llama_print_timings: sample time = 28.00 ms / 79 runs ( 0.35 ms per token, 2820.92 tokens per second)
llama_print_timings: prompt eval time = 1981.60 ms / 4 tokens ( 495.40 ms per token, 2.02 tokens per second)
llama_print_timings: eval time = 31865.71 ms / 78 runs ( 408.53 ms per token, 2.45 tokens per second)
llama_print_timings: total time = 34145.26 ms
助手(耗时34.15s):
Hello! I'm here to help you with any questions or problems you may have. Please feel free to ask me anything, and I will do my best to provide a concise and accurate response. Whether you need information on a specific topic, assistance with a task, or just someone to talk to, I'm here to help. Just go ahead and ask me anything!