大模型推理框架：vLLM

最新推荐文章于 2025-06-18 20:16:30 发布

小枫@码

最新推荐文章于 2025-06-18 20:16:30 发布

阅读量1.9k

点赞数 9

CC 4.0 BY-SA版权

分类专栏：大模型文章标签：语言模型

本文链接：https://blog.youkuaiyun.com/wsq1011/article/details/146018261

大模型专栏收录该内容

22 篇文章

订阅专栏

一、vLLM 介绍

vLLM是伯克利大学LMSYS组织开源的大语言模型高速推理框架。它利用了全新的注意力算法「PagedAttention」，提供易用、快速、便宜的LLM服务。

二、安装 vLLM

2.1 使用 GPU 进行安装

vLLM 是一个Python库，同时也包含预编译的C++和CUDA（12.1版本）二进制文件。

1. 安装条件：

OS: Linux
Python: 3.8 – 3.11

2.2 使用CPU进行安装

vLLM 也支持在 x86 CPU 平台上进行基本的模型推理和服务，支持的数据类型包括 FP32 和 BF16。

1.安装要求：

OS: Linux
Compiler: gcc/g++>=12.3.0 (recommended)
Instruction set architecture (ISA) requirement: AVX512 is required.
2.安装编译依赖：

yum install -y gcc gcc-c++

3.下载源码：

git clone https://github.com/vllm-project/vllm.git

4.安装python依赖：

pip install wheel packaging ninja setuptools>=49.4.0 numpy psutil

# 需要进入源码目录
pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu

5.执行安装：

VLLM_TARGET_DEVICE=cpu python setup.py install

2.3 相关配置

1. vLLM默认从HuggingFace下载模型，如果想从ModelScope下载模型，需要配置环境变量：

export VLLM_USE_MODELSCOPE=True

三、使用 vLLM

3.1 离线推理

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("/data/weisx/model/Qwen1.5-4B-Chat")

# Pass the default decoding hyperparameters of Qwen1.5-4B-Chat
# max_tokens is for the maximum length for generation.
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512)

# Input the model name or path. Can be GPTQ or AWQ models.
llm = LLM(model="Qwen/l/Qwen1.5-4B-Chat", trust_remote_code=True)

# Prepare your prompts
prompt = "Tell me something about large language models."
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)

# generate outputs
outputs = llm.generate([text], sampling_params)

# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

3.2 适配OpenAI-API的API服务

借助vLLM，构建一个与OpenAI API兼容的API服务十分简便，该服务可以作为实现OpenAI API协议的服务器进行部署。默认情况下，它将在 http://localhost:8000 启动服务器。您可以通过 --host 和 --port 参数来自定义地址。请按照以下所示运行命令：

python -m vllm.entrypoints.openai.api_server \

--model Qwen/Qwen1.5-4B-Chat

使用curl与Qwen对接：

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen1.5-4B-Chat",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."}
]
}'

使用python客户端与Qwen对接：

from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
model="Qwen/Qwen1.5-4B-Chat",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."},
]
)
print("Chat response:", chat_response)