vLLM-CPU部署大模型

最新推荐文章于 2025-06-13 14:31:36 发布

DYF-AI

最新推荐文章于 2025-06-13 14:31:36 发布

阅读量1.2k

点赞数 13

CC 4.0 BY-SA版权

分类专栏：大模型推理文章标签：数据库 android 服务器

本文链接：https://blog.youkuaiyun.com/weixin_40437821/article/details/148386698

大模型推理专栏收录该内容

1 篇文章

订阅专栏

vLLM-CPU部署大模型

在这里插入图片描述

背景

作为一个预算有限的开发者（手头没有高配GPU，但又想调试7B以上的大模型），当vLLM运行大参数量模型时爆显存怎么办？别急，这里教你一个"丐版"解决方案——基于CPU的vLLM部署大模型方法（仅适用于测试场景）。

运行环境

系统：ubuntu20.04
python：3.9-3.12

gcc/g++：建议使用 gcc/g++ >= 12.3.0 作为默认编译器，我这里直接上gcc/g++ 13, 详细的安装方式可以参考：

 sudo add-apt-repository ppa:ubuntu-toolchain-r/test
 sudo apt update
 sudo apt install gcc-13 g++-13
 # 查看优先级
 sudo update-alternatives --config gcc
 # 将4.8版本的优先级提高
 sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-13 100
 sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-13 100

结果果如下：

(vllm-cpu) dongyongfei786@DYF-PC:/mnt/g/dongyongfei786/vllm_source$  sudo update-alternatives --config gcc
[sudo] password for dongyongfei786:
There are 4 choices for the alternative gcc (providing /usr/bin/gcc).

  Selection    Path             Priority   Status
------------------------------------------------------------
  0            /usr/bin/gcc-13   100       auto mode
  1            /usr/bin/gcc-11   90        manual mode
* 2            /usr/bin/gcc-13   100       manual mode
  3            /usr/bin/gcc-7    70        manual mode
  4            /usr/bin/gcc-9    80        manual mode

源码编译

创建一个新的 Python 虚拟环境

# （推荐）创建新的 conda 环境
conda create -n vllm python=3.12 -y
conda activate vllm

从源码构建 wheel

升级gcc/g++上面已经说了，libnuma-dev这个必须安装，不然会报错/

sudo apt-get update  -y
sudo apt-get install -y gcc-12 g++-12 libnuma-dev
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-13 10 --slave /usr/bin/g++ g++ /usr/bin/g++-13

克隆 vLLM 仓库

git clone https://github.com/vllm-project/vllm.git vllm_source
cd vllm_source

安装 vLLM CPU 后端构建所需 Python 包

pip install --upgrade pip
pip install "cmake>=3.26" wheel packaging ninja "setuptools-scm>=8" numpy
pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu

构建并安装 vLLM CPU 后端

VLLM_TARGET_DEVICE=cpu python setup.py install

显示下面结果，说明安装成功

# VLLM_TARGET_DEVICE=cpu python setup.py install

Using /home/dongyongfei786/miniconda3/envs/vllm-cpu/lib/python3.12/site-packages
Searching for markdown-it-py==3.0.0
Best match: markdown-it-py 3.0.0
Adding markdown-it-py 3.0.0 to easy-install.pth file
Installing markdown-it script to /home/dongyongfei786/miniconda3/envs/vllm-cpu/bin

Using /home/dongyongfei786/miniconda3/envs/vllm-cpu/lib/python3.12/site-packages
Searching for mdurl==0.1.2
Best match: mdurl 0.1.2
Adding mdurl 0.1.2 to easy-install.pth file

Using /home/dongyongfei786/miniconda3/envs/vllm-cpu/lib/python3.12/site-packages
Finished processing dependencies for vllm==0.9.1.dev80+g2dbe8c077.cpu

简单运行：vllm serve --help

(vllm-cpu) dongyongfei786@DYF-PC:/mnt/g/dongyongfei786/VLM-GUI/AgentCPM-GUI/deploy$ vllm serve --help
...
options:
  --allow-credentials   Allow credentials. (default: False)
  --allowed-headers ALLOWED_HEADERS
                        Allowed headers. (default: ['*'])
  --allowed-methods ALLOWED_METHODS
                        Allowed methods. (default: ['*'])
  --allowed-origins ALLOWED_ORIGINS
                        Allowed origins. (default: ['*'])
  --api-key API_KEY     If provided, the server will require this key to be presented in the header. (default: None)
  --api-server-count API_SERVER_COUNT, -asc API_SERVER_COUNT
                        How many API server processes to run. (default: 1)

使用vllm-cpu部署AgentCPM-GUI（8B）模型

启动模型

vllm serve /mnt/n/model/VLM-GUI/AgentCPM-GUI --served-model-name AgentCPM-GUI --tensor_parallel_size 1 --trust-remote-code --limit_mm_per_prompt image=2

启动成功

(vllm-cpu) dongyongfei786@DYF-PC:/mnt/g/dongyongfei786/VLM-GUI/AgentCPM-GUI/deploy$ bash vllm_cpu.sh
[W531 08:20:53.911671810 OperatorEntry.cpp:154] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_addmm_activation(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1, bool use_gelu=False) -> Tensor
    registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: AutocastCPU
  previous kernel: registered at /pytorch/aten/src/ATen/autocast_mode.cpp:327
       new kernel: registered at /opt/workspace/ipex-cpu-dev/csrc/cpu/autocast/autocast_mode.cpp:112 (function operator())
INFO 05-31 08:20:58 [__init__.py:243] Automatically detected platform cpu.
INFO 05-31 08:21:03 [__init__.py:31] Available plugins for group vllm.general_plugins:
INFO 05-31 08:21:03 [__init__.py:33] - lora_filesystem_resolver -> 
INFO 05-31 08:21:05 [config.py:1934] Disabled the custom all-reduce kernel because it is not supported on current platform.
INFO 05-31 08:21:05 [cli_args.py:300] non-default args: {'trust_remote_code': True, 'served_model_name': ['AgentCPM-GUI'], 'limit_mm_per_prompt': {'image': 2}}
INFO 05-31 08:21:13 [config.py:813] This model supports multiple tasks: {'reward', 'generate', 'score', 'classify', 'embed'}. Defaulting to 'generate'.
WARNING 05-31 08:21:13 [_logger.py:72] device type=cpu is not supported by the V1 Engine. Falling back to V0.
INFO 05-31 08:21:13 [config.py:1934] Disabled the custom all-reduce kernel because it is not supported on current platform.
WARNING 05-31 08:21:13 [_logger.py:72] Environment variable VLLM_CPU_KVCACHE_SPACE (GiB) for CPU backend is not set, using 4 by default.
WARNING 05-31 08:21:13 [_logger.py:72] uni is not supported on CPU, fallback to mp distributed executor backend.
INFO 05-31 08:21:13 [api_server.py:266] Started engine process with PID 26110
[W531 08:21:17.882850704 OperatorEntry.cpp:154] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_addmm_activation(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1, bool use_gelu=False) -> Tensor
    registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: AutocastCPU
  previous kernel: registered at /pytorch/aten/src/ATen/autocast_mode.cpp:327
       new kernel: registered at /opt/workspace/ipex-cpu-dev/csrc/cpu/autocast/autocast_mode.cpp:112 (function operator())
INFO 05-31 08:21:18 [__init__.py:243] Automatically detected platform cpu.
INFO 05-31 08:21:20 [__init__.py:31] Available plugins for group vllm.general_plugins:
INFO 05-31 08:21:20 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
...
INFO 05-31 08:22:17 [launcher.py:36] Route: /v2/rerank, Methods: POST
INFO 05-31 08:22:17 [launcher.py:36] Route: /invocations, Methods: POST
INFO 05-31 08:22:17 [launcher.py:36] Route: /metrics, Methods: GET
INFO:     Started server process [26058]
INFO:     Waiting for application startup.
INFO:     Application startup complete.