【大模型】大模型 CPU 推理之 llama.cpp

本文介绍了一款名为llama.cpp的库,用于简化大模型的CPU和GPU推理,支持多种硬件和量化技术,提供从HuggingFace下载和测试模型的方法,特别关注Qwen模型的使用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

llama.cpp

  • 描述

    The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.

    • Plain C/C++ implementation without any dependencies
    • Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
    • AVX, AVX2 and AVX512 support for x86 architectures
    • 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
    • Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP)
    • Vulkan, SYCL, and (partial) OpenCL backend support
    • CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
  • 官网
    https://github.com/ggerganov/llama.cpp

  • Supported platforms:

     Mac OS
     Linux
     Windows (via CMake)
     Docker
     FreeBSD
    
  • Supported models:

    • Typically finetunes of the base models below are supported as well.

    LLaMA 🦙
    LLaMA 2 🦙🦙
    Mistral 7B
    Mixtral MoE
    Falcon
    Chinese LLaMA / Alpaca and Chinese LLaMA-2 / Alpaca-2
    Vigogne (French)
    Koala
    Baichuan 1 & 2 + derivations
    Aquila 1 & 2
    Starcoder models
    Refact
    Persimmon 8B
    MPT
    Bloom
    Yi models
    StableLM models
    Deepseek models
    Qwen models
    PLaMo-13B
    Phi models
    GPT-2
    Orion 14B
    InternLM2
    CodeShell
    Gemma
    Mamba
    Xverse
    Command-R

    • Multimodal models:

    LLaVA 1.5 models, LLaVA 1.6 models
    BakLLaVA
    Obsidian
    ShareGPT4V
    MobileVLM 1.7B/3B models
    Yi-VL

安装llama.cpp

Memory/Disk Requirements

在这里插入图片描述

Quantization

在这里插入图片描述

测试推理

下载模型

快速下载模型,参考: 无需 VPN 即可急速下载 huggingface 上的 LLM 模型
我这里下 qwen/Qwen1.5-1.8B-Chat-GGUF 进行测试

huggingface-cli download --resume-download  qwen/Qwen1.5-1.8B-Chat-GGUF  --local-dir  qwen/Qwen1.5-1.8B-Chat-GGUF

测试

cd ./llama.cpp

./main -m /your/path/qwen/Qwen1.5-1.8B-Chat-GGUF/qwen1_5-1_8b-chat-q4_k_m.gguf -n 512 --color -i -cml -f ./prompts/chat-with-qwen.txt

需要修改提示语,可以编辑 ./prompts/chat-with-qwen.txt 进行修改。

加载模型输出信息:

llama.cpp# ./main -m /mnt/data/llm/Qwen1.5-1.8B-Chat-GGUF/qwen1_5-1_8b-chat-q4_k_m.gguf -n 512 --color -i -cml -f ./prompts/chat-with-qwen
.txt
Log start
main: build = 2527 (ad3a0505)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1711760850
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /mnt/data/llm/Qwen1.5-1.8B-Chat-GGUF/qwen1_5-1_8b-chat-q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = Qwen1.5-1.8B-Chat-AWQ-fp16
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 24
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 2048
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 5504
llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 16
llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  10:                qwen2.use_parallel_residual bool             = true
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  13:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  14:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  15:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  16:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  18:                    tokenizer.chat_template str              = {% for message in messages %}{{'<|im_...
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - kv  20:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q5_0:   12 tensors
llama_model_loader: - type q8_0:   12 tensors
llama_model_loader: - type q4_K:  133 tensors
llama_model_loader: - type q6_K:   13 tensors
llm_load_vocab: special tokens definition check successful ( 293/151936 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5504
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 1.84 B
llm_load_print_meta: model size       = 1.13 GiB (5.28 BPW)
llm_load_print_meta: general.name     = Qwen1.5-1.8B-Chat-AWQ-fp16
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors:        CPU buffer size =  1155.67 MiB
...................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    96.00 MiB
llama_new_context_with_model: KV self size  =   96.00 MiB, K (f16):   48.00 MiB, V (f16):   48.00 MiB
llama_new_context_with_model:        CPU  output buffer size =   296.75 MiB
llama_new_context_with_model:        CPU compute buffer size =   300.75 MiB
llama_new_context_with_model: graph nodes  = 868
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 4 / 4 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
main: interactive mode on.
Reverse prompt: '<|im_start|>user
'
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = 512, n_keep = 10


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

system
You are a helpful assistant.
user

>

输入文本:What’s AI?

输出示例:
在这里插入图片描述

参考

<think>好的,我现在需要帮用户解决在Windows环境下,使用12代i7、32G DDR4内存和AMD 7900XTX显卡部署DeepSeek-32B模型的问题。用户之前已经询问过类似的问题,但之前的回答可能更偏向Linux或WSL环境,现在用户明确要求Windows环境,所以需要调整步骤。 首先,我需要回顾用户之前的提问和回答。之前的回答提到了使用Ollama和ROCm,但在Windows下可能需要不同的方法,因为ROCm对Windows的支持有限。用户现在的配置是Windows系统,硬件是i7-12代,32G内存,7900XTX显卡。AMD显卡在Windows下的机器学习支持主要通过DirectML或者Vulkan,而ROCm可能不适用于Windows。 接下来,我需要确定在Windows下可行的部署方案。Ollama官方支持Windows,但可能依赖WSL,但用户可能希望纯Windows环境。另一个选择是使用llama.cpp的DirectML版本,或者使用Text Generation WebUI,它支持AMD显卡通过DirectML。此外,Hugging Face的Transformers库可能通过ONNX Runtime与DirectML结合使用。 然后,我需要考虑用户可能的痛点。用户可能对在Windows下配置AMD显卡的深度学习环境不熟悉,特别是因为大多数教程都是针对NVIDIA的CUDA。需要提供明确的步骤,包括驱动安装、依赖库的配置,以及具体的模型加载方法。 还要注意显存限制。7900XTX有24GB显存,但DeepSeek-32B模型即使量化后可能仍然需要较高的显存,需要选择合适的量化等级,比如Q4或Q5,并可能结合部分卸载到GPU,剩余部分用CPU处理。 另外,用户可能希望有一个图形界面,所以推荐Text Generation WebUI可能更友好。同时,性能优化方面,需要指导用户如何调整线程数、层数卸载等参数,以充分利用硬件资源。 最后,需要验证每个步骤的可行性,确保提供的命令和链接在Windows下有效,并提醒常见问题,比如驱动版本、依赖安装错误等,以及相应的解决方法。可能还需要提到备选方案,比如使用Linux子系统,但用户明确要求Windows环境,所以应优先纯Windows的解决方案。</think>--- ### **在Windows环境下使用7900XTX部署大模型的完整指南** (硬件:12代i7 + 32GB DDR4 + 7900XTX 24GB显存) -- ### **一、环境适配的核心挑战** 1. **AMD显卡的Windows生态限制**: - ROCm官方**不支持Windows**,需改用**DirectML**(微软+AMD联合开发)或**Vulkan**加速。 - 推荐工具链: **DirectML + ONNX Runtime** 或 **llama.cpp (DirectML分支)** **Text Generation WebUI**(兼容AMD显卡的社区方案) 2. **模型量化要求**: - DeepSeek-32B在FP16下需64GB显存,7900XTX需使用**GGUF/Q4/Q5量化模型**(显存占用约16-20GB)。 --- ### **二、部署流程(纯Windows环境)** #### **方案1:使用Text Generation WebUI(推荐)** 1. **安装依赖**: - 安装[Python 3.10](https://www.python.org/downloads/)并勾选**“Add to PATH”**。 - 更新显卡驱动至最新[Adrenalin Edition](https://www.amd.com/zh-hans/support)。 - 安装[Visual Studio 2022](https://visualstudio.microsoft.com/)(勾选“C++桌面开发”)。 2. **部署Text Generation WebUI**: ```cmd git clone https://github.com/oobabooga/text-generation-webui cd text-generation-webui python -m venv venv venv\Scripts\activate pip install torch torch-r requirements.txt ``` 3. **配置模型**: - 下载DeepSeek-32B的GGUF量化文件(如[TheBloke/DeepSeek-R1-32B-GGUF](https://huggingface.co/TheBloke))。 - 将模型文件放入`text-generation-webui\models`文件夹。 4. **启动参数优化**: ```cmd python server.py --model deepseek-R1-32B-Q4_K_M.gguf --n-gpu-layers 99 --loader llamacpp --auto-devices --chat ``` - **关键参数**: `--n-gpu-layers 99`:尽可能多的GPU层加速 `--threads 16`:根据CPU核心数调整(12代i7建议12-16线程) `--cache-capacity 24G`:显存分配控制 5. **访问Web界面**: - 浏览器打开`http://localhost:7860`,选择模型后开始对话。 --- #### **方案2:llama.cpp + DirectML(高性能原生部署)** 1. **编译llama.cpp(DirectML支持)**: ```cmd git clone https://github.com/llama-cpp/llama.cpp cd llama.cpp mkdir build cd build cmake -DLLAMA_DIRECTML=ON .. cmake --build . --config Release ``` 2. **运行模型**: ```cmd .\bin\Release\main.exe -m D:\models\deepseek-R1-32B-Q4_K_M.gguf -n 512 --n-gpu-layers 99 -t 16 --temp 0.7 --color -i ``` - **参数说明**: `-t 16`:CPU线程数(i7-12700K可用16线程) `--n-gpu-layers 99`:最大化GPU加速 --- ### **三、性能调优技巧** 1. **CPU/GPU协同计算**: - 若显存不足,通过`--n-gpu-layers 40`限制GPU层数,剩余计算由CPU承担。 - 任务管理器中监控显存占用(Ctrl+Shift+Esc → 性能 → GPU)。 2. **量化等级选择**: | 量化等级 | 显存占用 | 速度 | 推荐场景 | |----------|----------|-------|------------------| | Q4_K_M | ~18GB | 快 | 平衡精度与速度 | | Q5_K_M | ~22GB | 中等 | 需更高精度时使用 | | Q3_K_L | ~14GB | 极快 | 快速响应需求 | 3. **内存虚拟盘加速**: - 使用[ImDisk Toolkit](https://sourceforge.net/projects/imdisk-toolkit/)将32GB内存划出8GB作为虚拟盘,存放模型临时文件。 --- ### **四、常见问题解决** #### **1. 报错“Failed to initialize Backend...”** - **原因**:DirectML依赖未安装。 - **解决**: ```cmd pip uninstall torch-directml pip install torch-directml --force-reinstall ``` #### **2. 响应速度慢** - **优化方向**: - 增加`--threads`参数至物理核心数(如i7-12700K为12核20线程,建议设为16)。 - 减少`--n_ctx`(上下文长度),例如从4096改为2048。 #### **3. 显存不足(OOM)** - **措施**: - 改用更低量化等级模型(如Q4→Q3)。 - 添加`--no-mmap`参数减少内存映射开销。 --- ### **五、替代方案(备用)** - **WSL2 + Linux环境**: 1. 启用WSL2并安装Ubuntu。 2. 在WSL中安装ROCm(需确认7900XTX兼容性)。 3. 参考Linux部署流程运行Ollama。 - **Docker Desktop + DirectML**: ```powershell docker run -- **总结**:Windows下推荐**Text Generation WebUI + DirectML**方案,兼顾易用性与性能。若追求极致速度,可尝试llama.cpp原生编译。注意根据任务类型调整量化等级与GPU卸载层数。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

szZack

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值