vLLM-CPU部署大模型

vLLM-CPU部署大模型

在这里插入图片描述

背景

​ 作为一个预算有限的开发者(手头没有高配GPU,但又想调试7B以上的大模型),当vLLM运行大参数量模型时爆显存怎么办?别急,这里教你一个"丐版"解决方案——基于CPU的vLLM部署大模型方法(仅适用于测试场景)。

运行环境

  • 系统:ubuntu20.04

  • python:3.9-3.12

  • gcc/g++:建议使用 gcc/g++ >= 12.3.0 作为默认编译器,我这里直接上gcc/g++ 13, 详细的安装方式可以参考:

     sudo add-apt-repository ppa:ubuntu-toolchain-r/test
     sudo apt update
     sudo apt install gcc-13 g++-13
     # 查看优先级
     sudo update-alternatives --config gcc
     # 将4.8版本的优先级提高
     sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-13 100
     sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-13 100
    
    

    结果果如下:

    (vllm-cpu) dongyongfei786@DYF-PC:/mnt/g/dongyongfei786/vllm_source$  sudo update-alternatives --config gcc
    [sudo] password for dongyongfei786:
    There are 4 choices for the alternative gcc (providing /usr/bin/gcc).
    
      Selection    Path             Priority   Status
    ------------------------------------------------------------
      0            /usr/bin/gcc-13   100       auto mode
      1            /usr/bin/gcc-11   90        manual mode
    * 2            /usr/bin/gcc-13   100       manual mode
      3            /usr/bin/gcc-7    70        manual mode
      4            /usr/bin/gcc-9    80        manual mode
    

源码编译

  • 创建一个新的 Python 虚拟环境
# (推荐)创建新的 conda 环境
conda create -n vllm python=3.12 -y
conda activate vllm
  • 从源码构建 wheel

    升级gcc/g++上面已经说了,libnuma-dev这个必须安装,不然会报错/

    sudo apt-get update  -y
    sudo apt-get install -y gcc-12 g++-12 libnuma-dev
    sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-13 10 --slave /usr/bin/g++ g++ /usr/bin/g++-13
    
  • 克隆 vLLM 仓库

    git clone https://github.com/vllm-project/vllm.git vllm_source
    cd vllm_source
    
  • 安装 vLLM CPU 后端构建所需 Python 包

    pip install --upgrade pip
    pip install "cmake>=3.26" wheel packaging ninja "setuptools-scm>=8" numpy
    pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
    
  • 构建并安装 vLLM CPU 后端

    VLLM_TARGET_DEVICE=cpu python setup.py install
    

    显示下面结果,说明安装成功

    # VLLM_TARGET_DEVICE=cpu python setup.py install
    
    Using /home/dongyongfei786/miniconda3/envs/vllm-cpu/lib/python3.12/site-packages
    Searching for markdown-it-py==3.0.0
    Best match: markdown-it-py 3.0.0
    Adding markdown-it-py 3.0.0 to easy-install.pth file
    Installing markdown-it script to /home/dongyongfei786/miniconda3/envs/vllm-cpu/bin
    
    Using /home/dongyongfei786/miniconda3/envs/vllm-cpu/lib/python3.12/site-packages
    Searching for mdurl==0.1.2
    Best match: mdurl 0.1.2
    Adding mdurl 0.1.2 to easy-install.pth file
    
    Using /home/dongyongfei786/miniconda3/envs/vllm-cpu/lib/python3.12/site-packages
    Finished processing dependencies for vllm==0.9.1.dev80+g2dbe8c077.cpu
    
  • 简单运行:vllm serve --help

    (vllm-cpu) dongyongfei786@DYF-PC:/mnt/g/dongyongfei786/VLM-GUI/AgentCPM-GUI/deploy$ vllm serve --help
    ...
    options:
      --allow-credentials   Allow credentials. (default: False)
      --allowed-headers ALLOWED_HEADERS
                            Allowed headers. (default: ['*'])
      --allowed-methods ALLOWED_METHODS
                            Allowed methods. (default: ['*'])
      --allowed-origins ALLOWED_ORIGINS
                            Allowed origins. (default: ['*'])
      --api-key API_KEY     If provided, the server will require this key to be presented in the header. (default: None)
      --api-server-count API_SERVER_COUNT, -asc API_SERVER_COUNT
                            How many API server processes to run. (default: 1)
    

使用vllm-cpu部署AgentCPM-GUI(8B)模型

  • 启动模型

    vllm serve /mnt/n/model/VLM-GUI/AgentCPM-GUI --served-model-name AgentCPM-GUI --tensor_parallel_size 1 --trust-remote-code --limit_mm_per_prompt image=2
    
  • 启动成功

    (vllm-cpu) dongyongfei786@DYF-PC:/mnt/g/dongyongfei786/VLM-GUI/AgentCPM-GUI/deploy$ bash vllm_cpu.sh
    [W531 08:20:53.911671810 OperatorEntry.cpp:154] Warning: Warning only once for all operators,  other operators may also be overridden.
      Overriding a previously registered kernel for the same operator and the same dispatch key
      operator: aten::_addmm_activation(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1, bool use_gelu=False) -> Tensor
        registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      dispatch key: AutocastCPU
      previous kernel: registered at /pytorch/aten/src/ATen/autocast_mode.cpp:327
           new kernel: registered at /opt/workspace/ipex-cpu-dev/csrc/cpu/autocast/autocast_mode.cpp:112 (function operator())
    INFO 05-31 08:20:58 [__init__.py:243] Automatically detected platform cpu.
    INFO 05-31 08:21:03 [__init__.py:31] Available plugins for group vllm.general_plugins:
    INFO 05-31 08:21:03 [__init__.py:33] - lora_filesystem_resolver -> 
    INFO 05-31 08:21:05 [config.py:1934] Disabled the custom all-reduce kernel because it is not supported on current platform.
    INFO 05-31 08:21:05 [cli_args.py:300] non-default args: {'trust_remote_code': True, 'served_model_name': ['AgentCPM-GUI'], 'limit_mm_per_prompt': {'image': 2}}
    INFO 05-31 08:21:13 [config.py:813] This model supports multiple tasks: {'reward', 'generate', 'score', 'classify', 'embed'}. Defaulting to 'generate'.
    WARNING 05-31 08:21:13 [_logger.py:72] device type=cpu is not supported by the V1 Engine. Falling back to V0.
    INFO 05-31 08:21:13 [config.py:1934] Disabled the custom all-reduce kernel because it is not supported on current platform.
    WARNING 05-31 08:21:13 [_logger.py:72] Environment variable VLLM_CPU_KVCACHE_SPACE (GiB) for CPU backend is not set, using 4 by default.
    WARNING 05-31 08:21:13 [_logger.py:72] uni is not supported on CPU, fallback to mp distributed executor backend.
    INFO 05-31 08:21:13 [api_server.py:266] Started engine process with PID 26110
    [W531 08:21:17.882850704 OperatorEntry.cpp:154] Warning: Warning only once for all operators,  other operators may also be overridden.
      Overriding a previously registered kernel for the same operator and the same dispatch key
      operator: aten::_addmm_activation(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1, bool use_gelu=False) -> Tensor
        registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      dispatch key: AutocastCPU
      previous kernel: registered at /pytorch/aten/src/ATen/autocast_mode.cpp:327
           new kernel: registered at /opt/workspace/ipex-cpu-dev/csrc/cpu/autocast/autocast_mode.cpp:112 (function operator())
    INFO 05-31 08:21:18 [__init__.py:243] Automatically detected platform cpu.
    INFO 05-31 08:21:20 [__init__.py:31] Available plugins for group vllm.general_plugins:
    INFO 05-31 08:21:20 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
    ...
    INFO 05-31 08:22:17 [launcher.py:36] Route: /v2/rerank, Methods: POST
    INFO 05-31 08:22:17 [launcher.py:36] Route: /invocations, Methods: POST
    INFO 05-31 08:22:17 [launcher.py:36] Route: /metrics, Methods: GET
    INFO:     Started server process [26058]
    INFO:     Waiting for application startup.
    INFO:     Application startup complete.
    

官方文档

​ vLLM官方文档:https://vllm.hyper.ai/docs/getting-started/installation/cpu/

关注走一波?

​ 既然都看到这儿了,不如关注公众号:DYF-AI ,交个朋友,获取更多AI大模型算法、部署等实用技巧!

### 如何在本地环境中部署和运行vLLM大语言模型 #### 准备环境 为了成功安装和配置vLLM,在本地机器上需具备Python 3.8或更高版本以及PyTorch库。建议创建一个新的虚拟环境来管理依赖项,这有助于避免与其他项目发生冲突。 ```bash python -m venv vllm_env source vllm_env/bin/activate # Linux/MacOS vllm_env\Scripts\activate # Windows pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu117 ``` #### 安装vLLM 通过GitHub仓库获取最新版的vLLM源码,并按照指示完成安装过程[^1]: ```bash git clone https://github.com/vllm-project/vllm.git cd vllm pip install . ``` #### 配置硬件资源 考虑到大型预训练模型所需的计算能力较大,推荐使用配备有高性能GPU的工作站来进行部署操作。对于不具备足够图形处理器的情况,则可能需要更长时间才能加载并执行推理任务。 #### 下载所需的大规模预训练模型权重文件 访问Hugging Face Model Hub或其他可信站点下载目标架构对应的参数集;确保遵循相关版权协议规定合法地取得这些资料[^2]。 #### 启动服务端口监听 一旦上述准备工作全部就绪之后就可以启动API服务器了,它会暴露RESTful接口供客户端请求调用: ```python from fastapi import FastAPI, Request import uvicorn from transformers import AutoModelForCausalLM, AutoTokenizer import torch app = FastAPI() model_name_or_path = "path_to_your_model" tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) model = AutoModelForCausalLM.from_pretrained(model_name_or_path).cuda() # 如果是在CPU环境下去掉.cuda() @app.post("/generate/") async def generate(request: Request): json_request = await request.json() prompt = json_request.get('prompt') inputs = tokenizer(prompt, return_tensors="pt").to("cuda") # 或者.to("cpu") outputs = model.generate(**inputs) response_text = tokenizer.decode(outputs[0], skip_special_tokens=True) return {"response": response_text} if __name__ == "__main__": uvicorn.run(app, host='0.0.0.0', port=8000) ``` 此段代码展示了如何基于FastAPI框架快速搭建起一个简单的HTTP API用于接收文本输入并通过指定的语言模型生成回复内容。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

DYF-AI

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值