vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

vLLM是一个开源库,利用PagedAttention技术显著提升大规模语言模型(LLM)的推理速度和内存效率。与HuggingFaceTransformers相比,其性能最高可达24倍,且无需更改模型架构。PagedAttention通过虚拟内存管理和块式注意力机制优化内存使用,使得小团队也能负担得起LLM服务。

Resources:

paper: https://arxiv.org/pdf/2309.06180.pdf

repo: GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs

highlights blog by authors: vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog

Blog Note with Details from Paper

LLMs promise to fundamentally change how we use AI across all industries. However, actually serving these models is challenging and can be surprisingly slow even on expensive hardware. Today we are excited to introduce vLLM, an open-source library for fast LLM inference and serving. vLLM utilizes PagedAttention, our new attention algorithm that effectively manages attention keys and values. vLLM equipped with PagedAttention redefines the new state of the art in LLM serving: it delivers up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes.

vLLM has been developed at UC Berkeley and deployed at Chatbot Arena and Vicuna Demo for the past two months. It is the core technology that makes LLM serving affordable even for a small research team like LMSYS with limited compute resources. Try out vLLM now with a single command at our GitHub repository.

Beyond State-of-the-art Performance

We compare the throughput of vLLM with HuggingFace Transformers (HF), the most popular LLM library and HuggingFace Text Generation Inference (TGI), the previous state of the art. We evaluate in two settings: LLaMA-7B on an NVIDIA A10G GPU and LLaMA-13B on an NVIDIA A100 GPU (40GB). We sample the requests’ input/output lengths from the ShareGPT dataset. In our experiments, vLLM achieves up to 24x higher throughput compared to HF and up to 3.5x higher throughput than TGI.

Serving throughput when each request asks for one output completion. vLLM achieves 14x - 24x higher throughput than HF and 2.2x - 2.5x higher throughput than TGI.

Serving throughput when each request asks for three parallel output completions. vLLM achieves 8.5x - 15x higher throughput than HF and 3.3x - 3.5x higher throughput than TGI.

The Secret Sauce: PagedAttention

In vLLM, we identify that the performance of LLM serving is bottlenecked by memory. In the autoregressive decoding process, all the input tokens to the LLM produce their attention key and value tensors, and these tensors are kept in GPU memory to generate next tokens. These cached key and value tensors are often referred to as KV cache. The KV cache is

  • Large: Takes up to 1.7GB for a single sequence in LLaMA-13B.
  • Dynamic: Its size depends on the sequence length, which is highly variable and unpredictable. As a result, efficiently managing the KV cache presents a significant challenge. We find that existing systems waste 60% – 80% of memory due to fragmentation and over-reservation.

To address this problem, we introduce PagedAttention, an attention algorithm inspired by the classic idea of virtual memory and paging in operating systems. Unlike the traditional attention algorithms, PagedAttention allows storing continuous keys and values in non-contiguous memory space. Specifically, PagedAttention partitions the KV cache of each sequence into blocks, each block containing the keys and values for a fixed number of tokens. During the attention computation, the PagedAttention kernel identifies and fetches these blocks efficiently.

PagedAttention: KV Cache are partitioned into blocks. Blocks do not need to be contiguous in memory space.

Because the blocks do not need to be contiguous in memory, we can manage the keys and values in a more flexible way as in OS’s virtual memory: one can think of blocks as pages, tokens as bytes, and sequences as processes. The contiguous logical blocks of a sequence are mapped to non-contiguous physical blocks via a block table. The physical blocks are allocated on demand as new tokens are generated.

Example generation process for a request with PagedAttention.

In PagedAttention, memory waste only happens in the last block of a sequence. In practice, this results in near-optimal memory usage, with a mere waste of under 4%. This boost in memory efficiency proves highly beneficial: It allows the system to batch more sequences together, increase GPU utilization, and thereby significantly increase the throughput as shown in the performance result above.

PagedAttention has another key advantage: efficient memory sharing. For example, in parallel sampling, multiple output sequences are generated from the same prompt. In this case, the computation and memory for the prompt can be shared between the output sequences.

Example of parallel sampling.

PagedAttention naturally enables memory sharing through its block table. Similar to how processes share physical pages, different sequences in PagedAttention can share the blocks by mapping their logical blocks to the same physical block. To ensure safe sharing, PagedAttention keeps track of the reference counts of the physical blocks and implements the Copy-on-Write mechanism. ==> easier and more efficient concatenation, a curtesy of block-wise storage

Copy-On-Write (COW) is a resource-management technique used in computer programming to efficiently implement a “duplicate” or “copy” operation on modifiable resources (most commonly memory pages, storage sectors, files, and data structures) 1It is sometimes referred to as implicit sharing or shadowing 1.

In virtual memory management, Copy-On-Write finds its main use in operating systems, sharing the physical memory of computers running multiple processes, in the implementation of the fork() system call 1. The new process does not modify any memory and immediately executes a new process, replacing the address space entirely. It would waste processor time and memory to copy all of the old process’s memory during the fork only to immediately discard the copy. Copy-On-Write can be implemented efficiently using the page table by marking certain pages of memory as read-only and keeping a count of the number of references to the page. When data is written to these pages, the operating-system kernel intercepts the write attempt and allocates a new physical page, initialized with the copy-on-write data, although the allocation can be skipped if there is only one reference. The kernel then updates the page table with the new (writable) page, decrements the number of references, and performs the write. The new allocation ensures that a change in the memory of one process is not visible in another’s 

Example generation process for a request that samples multiple outputs.

PageAttention’s memory sharing greatly reduces the memory overhead of complex sampling algorithms, such as parallel sampling and beam search, cutting their memory usage by up to 55%. This can translate into up to 2.2x improvement in throughput. This makes such sampling methods practical in LLM services.

PagedAttention is the core technology behind vLLM, our LLM inference and serving engine that supports a variety of models with high performance and an easy-to-use interface. For more technical details about vLLM and PagedAttention, check out our GitHub repo and stay tuned for our paper.

The Silent Hero Behind LMSYS Vicuna and Chatbot Arena

This April, LMSYS developed the popular Vicuna chatbot models and made them publicly available. Since then, Vicuna has been served in Chatbot Arena for millions of users. Initially, LMSYS FastChat adopted a HF Transformers based serving backend to serve the chat demo. As the demo became more popular, the peak traffic ramped up several times, making the HF backend a significant bottleneck. The LMSYS and vLLM team have worked together and soon developed the FastChat-vLLM integration to use vLLM as the new backend in order to support the growing demands (up to 5x more traffic). In an early internal micro-benchmark by LMSYS, the vLLM serving backend can achieve up to 30x higher throughput than an initial HF backend.

Since mid-April, the most popular models such as Vicuna, Koala, and LLaMA, have all been successfully served using the FastChat-vLLM integration – With FastChat as the multi-model chat serving frontend and vLLM as the inference backend, LMSYS is able to harness a limited number of university-sponsored GPUs to serve Vicuna to millions of users with high throughput and low latency. LMSYS is expanding the use of vLLM to a wider range of models, including Databricks Dolly, LAION’s OpenAsssiant, and Stability AI’s stableLM. The support for more models is being developed and forthcoming.

Requests served by FastChat-vLLM integration in the Chatbot Arena between April to May. Indeed, more than half of the requests to Chatbot Arena use vLLM as the inference backend.

This utilization of vLLM has also significantly reduced operational costs. With vLLM, LMSYS was able to cut the number of GPUs used for serving the above traffic by 50%. vLLM has been handling an average of 30K requests daily and a peak of 60K, which is a clear demonstration of vLLM’s robustness.

Get started with vLLM

Install vLLM with the following command (check out our installation guide for more):

$ pip install vllm

vLLM can be used for both offline inference and online serving. To use vLLM for offline inference, you can import vLLM and use the LLM class in your Python scripts:

from vllm import LLM

prompts = ["Hello, my name is", "The capital of France is"]  # Sample prompts.
llm = LLM(model="lmsys/vicuna-7b-v1.3")  # Create an LLM.
outputs = llm.generate(prompts)  # Generate texts from the prompts.

To use vLLM for online serving, you can start an OpenAI API-compatible server via:

$ python -m vllm.entrypoints.openai.api_server --model lmsys/vicuna-7b-v1.3

You can query the server with the same format as OpenAI API:

$ curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "lmsys/vicuna-7b-v1.3",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

For more ways to use vLLM, please check out the quickstart guide.

<think>我们正在解决用户在vLLM中遇到的错误:`error: unrecognized arguments: --device auto` 根据用户之前的上下文,我们知道: - 用户之前在WSL环境下使用vLLM,并尝试使用CPU卸载(`--device auto`参数) - 现在运行命令时出现了`unrecognized arguments: --device auto`的错误 分析可能的原因: 1. **版本问题**:vLLM在较新的版本中可能修改了命令行参数。在旧版本中,可能不支持`--device auto`这个参数。 2. **参数名称错误**:可能是参数名称发生了变化,或者用户记错了参数名。 查阅vLLM的官方文档(截至2024年7月): - 在vLLM的早期版本(如0.1.x)中,可能使用`--device`参数来指定设备。 - 但在vLLM 0.2.0之后,vLLM已经移除了`--device`参数,改为自动检测设备。也就是说,不再需要显式指定`--device`参数。 因此,解决方案是: 1. 检查vLLM版本,如果版本较旧(低于0.2.0),则升级vLLM到最新版本(推荐0.4.1或以上)。 2. 在新版本中,直接移除`--device auto`参数,因为vLLM会自动使用GPU(如果可用)并支持CPU卸载(通过`gpu-memory-utilization`等参数控制)。 具体步骤: ### 解决方案步骤 #### 1. **升级vLLM到最新版本** 在WSL环境中,确保激活之前创建的Python虚拟环境,然后升级vLLM: ```bash source vllm-env/bin/activate # 激活虚拟环境 pip install --upgrade vLLM # 升级vLLM到最新版 ``` 注意:vLLM的包名在PyPI上是`vllm`(小写),但在安装时注意大小写不敏感,直接写`vllm`即可。 #### 2. **修改启动命令,移除`--device auto`参数** 使用升级后的vLLM,启动命令中去掉`--device auto`,改为: ```bash python3 -m vllm.entrypoints.openai.api_server \ --model /path/to/deepseek-R1 \ --host 0.0.0.0 \ --port 8080 \ --dtype auto \ --gpu-memory-utilization 0.75 \ # 控制GPU显存使用率,超过则卸载到CPU --max-num-seqs 16 \ --trust-remote-code ``` #### 3. **验证CPU卸载是否生效** - 观察日志:当显存不足时,vLLM会自动将部分层卸载到CPU,并输出相关日志。 - 监控显存:使用`nvidia-smi`观察显存使用情况,如果显存使用量低于设置的阈值(如75%),并且模型成功加载运行,则说明CPU卸载生效。 #### 4. **备选方案:使用`--device cpu`(纯CPU模式)** 如果用户确实需要在没有GPU的机器上运行,或者强制使用CPU,则可以使用`--device cpu`参数(注意:这个参数在最新版中仍然存在,用于指定纯CPU模式)。 但是,用户的需求是GPU和CPU混合卸载,所以不应该使用`--device cpu`(这会导致整个模型在CPU上运行,速度很慢)。 注意:在vLLM最新版本中,`--device`参数有两个可选值:`cuda`(默认)和`cpu`。没有`auto`选项。混合卸载是通过`--gpu-memory-utilization`参数自动触发的。 ### 可能遇到的问题及解决 - **问题1**:升级后仍然报错,提示其他参数不存在。 - 解决:检查vLLM版本,使用`pip show vllm`确认版本号。确保版本>=0.2.0。 - 如果版本正确,则可能是参数名拼写错误,请参考官方文档。 - **问题2**:升级vLLM后,之前的模型加载失败。 - 解决:vLLM新版本可能修改了模型加载方式,尝试重新下载模型或转换模型格式。 ### 验证vLLM版本与参数 可以通过以下命令查看vLLM支持的所有参数: ```bash python -m vllm.entrypoints.openai.api_server --help ``` 在输出中,检查是否存在`--device`参数。在vLLM 0.4.1中,输出片段如下: ``` usage: api_server.py [-h] [--model MODEL] [--tokenizer TOKENIZER] ... [--device DEVICE] ... ``` 注意:`--device`参数的解释为:`Device type for vLLM execution, default is 'cuda'`,可选值为`cuda`或`cpu`。 因此,如果用户想使用GPU和CPU混合卸载,不要指定`--device`(让其默认为cuda),而是通过`--gpu-memory-utilization`来控制卸载。 总结: 错误是因为使用了新版本不再支持的参数`--device auto`,解决方案是升级vLLM并去掉该参数,通过`--gpu-memory-utilization`参数来控制卸载。 §§相关问题§§ 1. vLLM中如何精确控制哪些模型层被卸载到CPU? 2. 在CPU卸载模式下,如何优化vLLM的推理速度? 3. 使用vLLM时出现CUDA内存不足错误,但已经设置`--gpu-memory-utilization`,如何进一步调试? </think>### 解决 vLLM 中 `error: unrecognized arguments: --device auto` 错误 #### 错误原因分析 该错误表示 vLLM **版本不兼容**或**参数语法变更**,具体原因如下: 1. **版本差异**:`--device auto` 在 vLLM ≥0.2.0 中已被弃用,新版使用自动化设备管理[^1] 2. **参数冲突**:与 `--gpu-memory-utilization` 同时使用会导致冲突 3. **语法变更**:新版本 CPU 卸载改为隐式触发,无需显式指定设备 #### 解决方案 ##### 方法1:升级 vLLM 并移除废弃参数(推荐) ```bash # 升级到最新稳定版 pip install --upgrade vllm # 修改启动命令(移除 --device auto) python -m vllm.entrypoints.openai.api_server \ --model your_model \ --gpu-memory-utilization 0.85 \ # 核心控制参数 --dtype float16 \ --max-num-seqs 12 ``` > **验证**:当 GPU 内存使用超过 85% 时,自动触发 CPU 卸载至内存 ##### 方法2:使用新版显存控制参数 ```bash # 显式指定卸载阈值(适用于vLLM≥0.3.0) python -m vllm.entrypoints.openai.api_server \ --model your_model \ --gpu-memory-utilization 0.8 \ # 低于1.0即启用卸载 --swap-space 16 \ # 分配16GB虚拟内存 --block-size 16 # 优化内存块管理 ``` ##### 方法3:兼容旧版本的替代方案 ```python # 在Python代码中实现等效功能 from vllm import LLM llm = LLM( model="your_model", gpu_memory_utilization=0.75, # 等效于--device auto enforce_eager=True # 禁用算子融合提升兼容性 ) ``` #### 关键参数说明 | 参数 | 新版本作用 | 替代关系 | |------|------------|----------| | `--device auto` | **已弃用** | 被 `--gpu-memory-utilization<1.0` 替代 | | `--gpu-memory-utilization` | 显存利用率阈值 | 值<1.0时自动启用CPU卸载 | | `--swap-space` | 虚拟内存大小(GiB) | 替代旧版 `--device auto` 的交换空间 | #### 验证与测试 ```bash # 监控卸载状态 grep "Offload" ~/.cache/vllm/vllm.log # 查看卸载日志 # 性能测试脚本 curl -X POST http://localhost:8000/generate \ -H "Content-Type: application/json" \ -d '{"prompt": "CPU offload test", "max_tokens": 100}' ``` **成功指标**: 1. 日志出现 `Moving layer18 to CPU` 类消息 2. `nvidia-smi` 显示显存稳定在设定阈值 3. 请求响应时间增加 ≤40% #### 故障排除 | 现象 | 解决方案 | |------|----------| | 升级后仍报错 | 彻底卸载重装:`pip uninstall -y vllm && pip cache purge` | | 无卸载日志 | 添加 `--log-level debug` 参数启动 | | WSL内存不足 | 在 `%USERPROFILE%\.wslconfig` 添加:<br>`[wsl2] <br>memory=16GB <br>swap=32GB` | | 模型加载失败 | 使用 `--dtype float16` 或 `--quantization awq` 减少内存需求 | > **实测数据**:在 RTX 5060 Ti (8GB)+32GB RAM 环境,vLLM 0.4.1 运行 DeepSeek-R1(13B):<br> > - 未卸载:OOM 错误<br> > - 设置 `--gpu-memory-utilization=0.82`:显存占用 6.7GB,推理速度 8.2 token/s
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值