在Win11上部署大模型推理加速工具vLLM

原创已于 2024-06-18 10:09:50 修改 · 2.2w 阅读

47 ·

CC 4.0 BY-SA版权

文章标签：

#AI

于 2024-04-08 10:50:46 首次发布

AI 专栏收录该内容

86 篇文章

订阅专栏

本文介绍了vLLM，伯克利大学LMSYS组织的开源框架，显著提升了语言模型在实时场景中的效率，对比HuggingFace有显著性能提升。文章详细讲解了如何通过Docker安装和配置，以及如何部署为API服务并进行离线和在线调用。

vLLM是伯克利大学LMSYS组织开源的大语言模型高速推理框架，旨在极大地提升实时场景下的语言模型服务的吞吐与内存使用效率。vLLM是一个快速且易于使用的库，用于 LLM 推理和服务，可以和HuggingFace 无缝集成。vLLM利用了全新的注意力算法PagedAttention，有效地管理注意力键和值。

在吞吐量方面，vLLM的性能比HuggingFace Transformers(HF)高出 24 倍，文本生成推理（TGI）高出3.5倍。

使用docker方式安装

拉取cuda镜像

docker pull nvcr.io/nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04

创建容器

docker run --gpus=all -it --name vllm -p 8010:8000 -v D:\llm-model:/llm-model  nvcr.io/nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04

安装依赖环境

apt-get update -yq --fix-missing
DEBIAN_FRONTEND=noninteractive
apt-get install -yq --no-install-recommends pkg-config wget cmake curl git vim

安装Miniconda3

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh -b -u -p ~/miniconda3
~/miniconda3/bin/conda init
source ~/.bashrc

创建环境

conda create -n vllm python=3.10
conda activate vllm

安装依赖库

pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip install torch==2.1.2+cu118 torchvision==0.16.2+cu118 torchaudio==2.1.2 xformers==0.0.23.post1+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
pip install transformers
pip install requests
pip install gradio==4.14.0

export VLLM_VERSION=0.4.0
export PYTHON_VERSION=39
pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118

在线调用

vLLM可以部署为API服务，web框架使用FastAPI。API服务使用AsyncLLMEngine类来支持异步调用。

启动服务

python -m vllm.entrypoints.openai.api_server --model /llm-model/Baichuan2-7B-Chat --served-model-name Baichuan2-7B-Chat --trust-remote-code

#查看GPU
nvidia-smi

#指定GPU和端口号
CUDA_VISIBLE_DEVICES=7 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 10086 --model /llm-model/Baichuan2-7B-Chat --served-model-name Baichuan2-7B-Chat --trust-remote-code

调用方式

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Baichuan2-7B-Chat",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Baichuan2-7B-Chat",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

离线调用

import torch
from vllm import LLM, SamplingParams
MODEL_PATH = "/llm-model/Baichuan2-7B-Chat"

prompts = ["San Francisco is a"]
sampling_params = SamplingParams(temperature=0, max_tokens=100)

llm = LLM(model=MODEL_PATH,
              tokenizer_mode='auto',
              trust_remote_code=True,
              enforce_eager=True,
              enable_prefix_caching=True)

outputs = llm.generate(prompts, sampling_params=sampling_params)
for output in outputs:
    generated_text = output.outputs[0].text
    print(generated_text)