vLLM-CPU部署大模型
背景
作为一个预算有限的开发者(手头没有高配GPU,但又想调试7B以上的大模型),当vLLM运行大参数量模型时爆显存怎么办?别急,这里教你一个"丐版"解决方案——基于CPU的vLLM部署大模型方法(仅适用于测试场景)。
运行环境
-
系统:ubuntu20.04
-
python:3.9-3.12
-
gcc/g++:建议使用
gcc/g++ >= 12.3.0
作为默认编译器,我这里直接上gcc/g++ 13, 详细的安装方式可以参考:sudo add-apt-repository ppa:ubuntu-toolchain-r/test sudo apt update sudo apt install gcc-13 g++-13 # 查看优先级 sudo update-alternatives --config gcc # 将4.8版本的优先级提高 sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-13 100 sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-13 100
结果果如下:
(vllm-cpu) dongyongfei786@DYF-PC:/mnt/g/dongyongfei786/vllm_source$ sudo update-alternatives --config gcc [sudo] password for dongyongfei786: There are 4 choices for the alternative gcc (providing /usr/bin/gcc). Selection Path Priority Status ------------------------------------------------------------ 0 /usr/bin/gcc-13 100 auto mode 1 /usr/bin/gcc-11 90 manual mode * 2 /usr/bin/gcc-13 100 manual mode 3 /usr/bin/gcc-7 70 manual mode 4 /usr/bin/gcc-9 80 manual mode
源码编译
- 创建一个新的 Python 虚拟环境
# (推荐)创建新的 conda 环境
conda create -n vllm python=3.12 -y
conda activate vllm
-
从源码构建 wheel
升级gcc/g++上面已经说了,libnuma-dev这个必须安装,不然会报错/
sudo apt-get update -y sudo apt-get install -y gcc-12 g++-12 libnuma-dev sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-13 10 --slave /usr/bin/g++ g++ /usr/bin/g++-13
-
克隆 vLLM 仓库
git clone https://github.com/vllm-project/vllm.git vllm_source cd vllm_source
-
安装 vLLM CPU 后端构建所需 Python 包
pip install --upgrade pip pip install "cmake>=3.26" wheel packaging ninja "setuptools-scm>=8" numpy pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
-
构建并安装 vLLM CPU 后端
VLLM_TARGET_DEVICE=cpu python setup.py install
显示下面结果,说明安装成功
# VLLM_TARGET_DEVICE=cpu python setup.py install Using /home/dongyongfei786/miniconda3/envs/vllm-cpu/lib/python3.12/site-packages Searching for markdown-it-py==3.0.0 Best match: markdown-it-py 3.0.0 Adding markdown-it-py 3.0.0 to easy-install.pth file Installing markdown-it script to /home/dongyongfei786/miniconda3/envs/vllm-cpu/bin Using /home/dongyongfei786/miniconda3/envs/vllm-cpu/lib/python3.12/site-packages Searching for mdurl==0.1.2 Best match: mdurl 0.1.2 Adding mdurl 0.1.2 to easy-install.pth file Using /home/dongyongfei786/miniconda3/envs/vllm-cpu/lib/python3.12/site-packages Finished processing dependencies for vllm==0.9.1.dev80+g2dbe8c077.cpu
-
简单运行:vllm serve --help
(vllm-cpu) dongyongfei786@DYF-PC:/mnt/g/dongyongfei786/VLM-GUI/AgentCPM-GUI/deploy$ vllm serve --help ... options: --allow-credentials Allow credentials. (default: False) --allowed-headers ALLOWED_HEADERS Allowed headers. (default: ['*']) --allowed-methods ALLOWED_METHODS Allowed methods. (default: ['*']) --allowed-origins ALLOWED_ORIGINS Allowed origins. (default: ['*']) --api-key API_KEY If provided, the server will require this key to be presented in the header. (default: None) --api-server-count API_SERVER_COUNT, -asc API_SERVER_COUNT How many API server processes to run. (default: 1)
使用vllm-cpu部署AgentCPM-GUI(8B)模型
-
启动模型
vllm serve /mnt/n/model/VLM-GUI/AgentCPM-GUI --served-model-name AgentCPM-GUI --tensor_parallel_size 1 --trust-remote-code --limit_mm_per_prompt image=2
-
启动成功
(vllm-cpu) dongyongfei786@DYF-PC:/mnt/g/dongyongfei786/VLM-GUI/AgentCPM-GUI/deploy$ bash vllm_cpu.sh [W531 08:20:53.911671810 OperatorEntry.cpp:154] Warning: Warning only once for all operators, other operators may also be overridden. Overriding a previously registered kernel for the same operator and the same dispatch key operator: aten::_addmm_activation(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1, bool use_gelu=False) -> Tensor registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 dispatch key: AutocastCPU previous kernel: registered at /pytorch/aten/src/ATen/autocast_mode.cpp:327 new kernel: registered at /opt/workspace/ipex-cpu-dev/csrc/cpu/autocast/autocast_mode.cpp:112 (function operator()) INFO 05-31 08:20:58 [__init__.py:243] Automatically detected platform cpu. INFO 05-31 08:21:03 [__init__.py:31] Available plugins for group vllm.general_plugins: INFO 05-31 08:21:03 [__init__.py:33] - lora_filesystem_resolver -> INFO 05-31 08:21:05 [config.py:1934] Disabled the custom all-reduce kernel because it is not supported on current platform. INFO 05-31 08:21:05 [cli_args.py:300] non-default args: {'trust_remote_code': True, 'served_model_name': ['AgentCPM-GUI'], 'limit_mm_per_prompt': {'image': 2}} INFO 05-31 08:21:13 [config.py:813] This model supports multiple tasks: {'reward', 'generate', 'score', 'classify', 'embed'}. Defaulting to 'generate'. WARNING 05-31 08:21:13 [_logger.py:72] device type=cpu is not supported by the V1 Engine. Falling back to V0. INFO 05-31 08:21:13 [config.py:1934] Disabled the custom all-reduce kernel because it is not supported on current platform. WARNING 05-31 08:21:13 [_logger.py:72] Environment variable VLLM_CPU_KVCACHE_SPACE (GiB) for CPU backend is not set, using 4 by default. WARNING 05-31 08:21:13 [_logger.py:72] uni is not supported on CPU, fallback to mp distributed executor backend. INFO 05-31 08:21:13 [api_server.py:266] Started engine process with PID 26110 [W531 08:21:17.882850704 OperatorEntry.cpp:154] Warning: Warning only once for all operators, other operators may also be overridden. Overriding a previously registered kernel for the same operator and the same dispatch key operator: aten::_addmm_activation(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1, bool use_gelu=False) -> Tensor registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 dispatch key: AutocastCPU previous kernel: registered at /pytorch/aten/src/ATen/autocast_mode.cpp:327 new kernel: registered at /opt/workspace/ipex-cpu-dev/csrc/cpu/autocast/autocast_mode.cpp:112 (function operator()) INFO 05-31 08:21:18 [__init__.py:243] Automatically detected platform cpu. INFO 05-31 08:21:20 [__init__.py:31] Available plugins for group vllm.general_plugins: INFO 05-31 08:21:20 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver ... INFO 05-31 08:22:17 [launcher.py:36] Route: /v2/rerank, Methods: POST INFO 05-31 08:22:17 [launcher.py:36] Route: /invocations, Methods: POST INFO 05-31 08:22:17 [launcher.py:36] Route: /metrics, Methods: GET INFO: Started server process [26058] INFO: Waiting for application startup. INFO: Application startup complete.
官方文档
vLLM官方文档:https://vllm.hyper.ai/docs/getting-started/installation/cpu/
关注走一波?
既然都看到这儿了,不如关注公众号:DYF-AI ,交个朋友,获取更多AI大模型算法、部署等实用技巧!