深入剖析vLLM：大模型计算加速系列之调度器策略探索

最新推荐文章于 2025-05-08 19:00:00 发布

fengbeely

最新推荐文章于 2025-05-08 19:00:00 发布

阅读量1.9k

点赞数 14

文章标签：人工智能大数据算法

本文链接：https://blog.youkuaiyun.com/fengbeely/article/details/140117942

版权

原文：
图解大模型计算加速系列：vLLM源码解析2，调度器策略(Scheduler)

收起

前期提要与本期导览

一、入口函数

二、SequenceGroup

2.1 原生输入

2.2 SequenceGroup的作用

2.3 SequenceGroup的结构

三、add_request()：将seq_group添加进调度器waiting队列

四：step()：调度器策略

4.1 调度器结构

4.2 整体调度流程

4.3 _passed_delay：判断调度waiting队列的时间点

4.4 can_allocate：能否为seq_group分配物理块做prefill

4.5 can_append_slot：能否为seq_group分配物理块做decode

4.6 allocate与append_slot：为seq_group分配物理块

4.7 preempt：抢占策略

4.8 调度器核心代码

五、总结

大家好，vLLM源码解读第二期更新了，本期我们一起来解读vLLM的调度器策略。

由于vLLM代码本身的复杂性，逻辑上的嵌套性，使得我在读源码时，先接收到的是碎片化的东西，当代码一长、细节一多时，就很难把碎片化的东西拼成全貌。所以在本系列对vLLM的介绍中，不管是哪一块，都会按照“
宏观（图解） -> 细节（配合源码）
”的方式，
先理清vLLM在这里想做什么事，为什么要这么做，然后再一起来看各小块的代码实现。

【大模型计算加速系列】

猛猿：图解大模型计算加速系列：FlashAttention V1，从硬件到计算逻辑

猛猿：图解大模型计算加速系列：Flash Attention V2，从原理到并行计算

猛猿：图解Mixtral 8 * 7b推理优化原理与源码实现

猛猿：图解大模型计算加速系列之：vLLM核心技术PagedAttention原理

猛猿：图解大模型计算加速系列：vLLM源码解析1，整体架构

猛猿：图解大模型计算加速系列：vLLM源码解析2，调度器策略(Scheduler)

【历史文章汇总】

猛猿：【必看】历史技术文章导航

前期提要与本期导览

在上一篇关于vLLM代码整体架构的文章中，我们提到过无论是“
离线批处理（同步）
”还是“
在线流式服务（异步）
”，它们都采用了同一个推理内核引擎
LLMEngine
，其整体架构如下：

其中：

在每1个推理阶段中，调度器（Scheduler）
决定哪些请求可以参与推理，并为这些请求做好逻辑块->物理块的映射。
在每1个推理阶段中，分布式执行者
（图中Distributed Workers部分，根据代码，我们将其命名为
model_executor
会更加合适）接收调度器传来的这些请求，分发到各个worker上去做推理。Worker中的CacheEngine负责实际管理KV Cache；Worker中的model负责加载模型、实行推理，PagedAttention相关的实现和调用就在model下。

**这里，每1个推理阶段的定义是：prefill算1个推理阶段，每个decode各算1个推理阶段。在本文中，我们统一用

step

来表示“1个推理阶段”。**

在本文中，我们会详细解读调度器（Scheduler）全部细节；
在下一篇文章中，我们会详细解读块管理（blockmanager）的全部细节，并以parallel sampling，beam search和prefix caching为例，将上图左半部分全部串一遍
在后续文章中，我们会来解读上图右半部分细节（还没来得及拆逻辑，暂时不知道会写几篇）

由于块管理者和调度器在代码上逻辑层层嵌套，所以为了不影响大家对调度器的理解，涉及到块管理者的部分，本文也会给出尽量简明清晰的说明。

一、入口函数

在源码架构篇中我们提过，本系列的介绍思路是：
以“离线批处理”作为入口，详细解说内核引擎LLMEngine的各块细节。在此基础上我们再来看“在线流式服务”的运作流程
。所以现在，我们先来回顾下离线批处理的
调用方式
：

from vllm import LLM, SamplingParams

# ===========================================================================
# batch prompts
# ===========================================================================
prompts = ["Hello, my name is",
           "The president of the United States is",
           "The capital of France is",
           "The future of AI is",]

# ===========================================================================
# 采样参数
# ===========================================================================
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# ===========================================================================
# 初始化vLLM offline batched inference实例，并加载指定模型
# ===========================================================================
llm = LLM(model="facebook/opt-125m")

# ===========================================================================
# 推理
# ===========================================================================
outputs = llm.generate(prompts, sampling_params)

# ===========================================================================
# 对每一条prompt，打印其推理结果
# ===========================================================================
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

有两点需要注意：

llm = LLM(model="facebook/opt-125m")
：实例化了一个离线批处理的vLLM对象。其本质是实例化了一个内核引擎LLMEngine对象。
在执行这个步骤时，LLMEngine会执行一次模拟实验（profiling），来判断需要在gpu上预留多少的显存空间给KV Cache block
（模拟实验的流程参见源码篇1的3.2节，TODO，大家可以对照着来读源码，本文不再涉及这块源码细节）。
推理入口在第24行
outputs = llm.generate(prompts, sampling_params)
。现在我们进入LLM类下，来看这个
generate
函数，代码如下：

# vllm/entrypoints/llm.py
class LLM:
    """An LLM for generating texts from given prompts and sampling parameters.
       ...
    """

    def __init__(
        self,
        model: str,
        tokenizer: Optional[str] = None,
        tokenizer_mode: str = "auto",
        trust_remote_code: bool = False,
        tensor_parallel_size: int = 1,
        dtype: str = "auto",
        quantization: Optional[str] = None,
        revision: Optional[str] = None,
        tokenizer_revision: Optional[str] = None,
        seed: int = 0,
        gpu_memory_utilization: float = 0.9,
        swap_space: int = 4,
        enforce_eager: bool = False,
        max_context_len_to_capture: int = 8192,
        disable_custom_all_reduce: bool = True,
        **kwargs,
    ) -> None:
        ...
        # ==============================================================================
        # 使用配置好的engine参数，初始化LLMEngine实例
        # ==============================================================================
        self.llm_engine = LLMEngine.from_engine_args(
            engine_args, usage_context=UsageContext.LLM_CLASS)
        # ==============================================================================
        # 用于全局唯一的request_id，
        # 在vLLM中内核引擎的处理中，1个prompt视为1个request，分配全局唯一的request_id
        # ==============================================================================
        self.request_counter = Counter()
        
        ...

    def generate(
        self,
        prompts: Optional[Union[str, List[str]]] = None, 
        sampling_params: Optional[SamplingParams] = None,
        prompt_token_ids: Optional[List[List[int]]] = None, 
        use_tqdm: bool = True,
        lora_request: Optional[LoRARequest] = None,
        multi_modal_data: Optional[MultiModalData] = None,
    ) -> List[RequestOutput]:
        """Generates the completions for the input prompts.

        NOTE: This class automatically batches the given prompts, considering
        the memory constraint. For the best performance, put all of your prompts
        into a single list and pass it to this method.

        Args:
            prompts: prompts可以是str，也可以是list[str]
            sampling_params: 采样超参，例如温度、top_k等；如果为None则使用vLLM默认的参数
            prompt_token_ids: prompt对应的token_id，如果没有提供的话，vllm会调用tokenizer进行                               转换
            use_tqdm: 是否要展示process bar
            lora_request: 如果想请求特定的lora_adapter，可以将它的path等信息包装在该请求中,
                          但vLLM建议尽量不要使用这种方式，因为私有的lora adapter可能会带来一些
                          安全性的问题        
            multi_modal_data: 多模态相关的数据

        Returns:
            A list of `RequestOutput` objects containing the generated
            completions in the same order as the input prompts.
        """
        if prompts is None and prompt_token_ids is None:
            raise ValueError("Either prompts or prompt_token_ids must be "
                             "provided.")

        if isinstance(prompts, str):
            # Convert a single prompt to a list.
            prompts = [prompts]
        if (prompts is not None and prompt_token_ids is not None
                and len(prompts) != len(prompt_token_ids)):
            raise ValueError("The lengths of prompts and prompt_token_ids "
                             "must be the same.")
        
        if sampling_params is None:
            # Use default sampling params.
            sampling_params = SamplingParams()

        if multi_modal_data:
            multi_modal_data.data = multi_modal_data.data.to(torch.float16)

        # ============================================================================
        # 将request添加到engine中
        # 在vLLM内核运算逻辑中，1个prompt算1个request，需要有1个全局唯一的request_id
        # ============================================================================
        num_requests = len(prompts) if prompts is not None else len(
            prompt_token_ids)
        for i in range(num_requests):
            prompt = prompts[i] if prompts is not None else None
            token_ids = None if prompt_token_ids is None else prompt_token_ids[
                i]
            # =======================================================================
            # 将每个prompt添加进LLMEngine中，_add_request具体做了以下几件事：
            # - 将每个prompt处理成特定的输入类型（SequenceGroup实例，后文会细说）
            # - 将每个prompt加入Scheduler的waiting队列，等待处理
            # =======================================================================
            self._add_request(
                prompt,
                sampling_params,
                token_ids,
                lora_request=lora_request,
                # Get ith image while maintaining the batch dim.
                multi_modal_data=MultiModalData(
                    type=multi_modal_data.type,
                    data=multi_modal_data.data[i].unsqueeze(0))
                if multi_modal_data else None,
            )
        
        # ============================================================================
        # 把这个batch的所有prompt都添加完后，执行推理，详情参见_run_engine
        # ============================================================================
        return self._run_engine(use_tqdm)

    def _add_request(
        self,
        prompt: Optional[str],
        sampling_params: SamplingParams,
        prompt_token_ids: Optional[List[int]],
        lora_request: Optional[LoRARequest] = None,
        multi_modal_data: Optional[MultiModalData] = None,
    ) -> None:
        # 每个prompt赋1个request_id
        request_id = str(next(self.request_counter))
        self.llm_engine.add_request(request_id,
                                    prompt,
                                    sampling_params,
                                    prompt_token_ids,
                                    lora_request=lora_request,
                                    multi_modal_data=multi_modal_data)

    def _run_engine(self, use_tqdm: bool) -> List[RequestOutput]:
        # Initialize tqdm.
        if use_tqdm:
            num_requests = self.llm_engine.get_num_unfinished_requests()
            pbar = tqdm(total=num_requests,
                        desc="Processed prompts",
                        dynamic_ncols=True)
        
        # ===========================================================================
        # 如果当前调度器中还有没完成推理的请求（调度器中waiting/running/swapped任一队列非空）
        # ===========================================================================
        outputs: List[RequestOutput] = []
        while self.llm_engine.has_unfinished_requests():
            # =========================================================================
            # 执行1次推理调度（step），决定哪些请求的数据可以参与到这次推理中
            # =========================================================================
            step_outputs = self.llm_engine.step()
            for output in step_outputs:
                # =====================================================================
                # 如果本step后，有请求已经完成了推理，就将推理结果装进outputs中
                # =====================================================================
                if output.finished:
                    outputs.append(output)
                    if use_tqdm:
                        pbar.update(1)
        if use_tqdm:
            pbar.close()
        # Sort the outputs by request ID.
        # This is necessary because some requests may be finished earlier than
        # its previous requests.
        outputs = sorted(outputs, key=lambda x: int(x.request_id))
        return outputs

总结来说，当我们调用·
outputs = llm.generate(prompts, sampling_params)
时，
它实际做了两件事情：

_add_request
：
将输入数据传给LLMEngine
，它具体做了如下事情：
- 把每1个prompt包装成一个SequenceGroup对象
  。从客户端角度看，1个请求可能包含多个prompts，例如离线批处理场景下你可以将1个batch理解成1个请求；但是从LLMEngine的角度看，1个prompt是1个请求，所以它会对输入数据进行预处理。在后文对SequenceGroup的讲解中，我们会来看vLLM这样做的意义。
- 把包装成SequenceGroup对象的数据加入调度器（Scheduler）的waiting队列，等待处理
  。这一块相关的细节，我们放在后文说。
_run_engine
：
执行推理
。只要调度器的waiting/running/swapped队列非空，我们就认为此时这批batch还没有做完推理，这时我们就会
调用LLMEngine的step()
函数，来完成1次调度以决定要送哪些数据去做推理。

所以，想要知道调度器的运作流程，我们只要从
LLMEngine
的
add_request()
和
step()
两个函数入手就好了
。
不过在正式进入这两个函数的讲解之前，我们先来看和输入数据一个问题：为什么要把每个prompt都包装成一个SequenceGroup实例？SequenceGroup又长什么样呢？

二、SequenceGroup

2.1 原生输入

在一般的推理场景中，
我们通常给模型传1个prompt及相关的采样参数
，让模型来做推理。此时你的输入可能长下面这样：

("To be or not to be,",
SamplingParams(temperature=0.8, top_k=5, presence_penalty=0.2)),

但在其余的场景中，模型decoding的策略可能更加复杂
，例如：

Parallel Sampling
：你传给模型1个prompt，希望模型基于这个这个prompt，给出n种不同的output
Beam Search
：你传给模型1个prompt，在采用Beam Search时，每个推理阶段你都会产出top k个output，其中k被称为Beam width（束宽）。

这些情况下，你传给模型的输入可能长下面这样：

# Parallel Sampling
("What is the meaning of life?",
SamplingParams(n=2, temperature=0.8, top_p=0.95, frequency_penalty=0.1))

# Beam Search (best_of = 束宽)
("It is only with the heart that one can see rightly",
SamplingParams(n=3, best_of=3, use_beam_search=True, temperature=0.0)),

【备注：SamplingParams遵从OpenAI API范式，对其中各种参数的解释可参见
OpenAI官方文档
】

总结来说，可能出现"1个prompt -> 多个outputs"的情况。那是否能设计一种办法，对1个prompt下所有的outputs进行集中管理，来方便vLLM更好做推理呢？

2.2 SequenceGroup的作用

"1个prompt -> 多个outputs"这样的结构组成一个
SequenceGroup
实例。
其中每组"prompt -> output"组成一个序列（seq，属于
Sequence
实例），每个seq下有若干状态(status)属性，包括：
- WAITING
  ：
  正在waiting队列中。waiting队列中的序列都没有做过prefill。
- RUNNING
  ：
  正在running队列中，即已经开始做推理。
- SWAPPED
  ：
  正在swapped队列中，表示此时gpu资源不足，相关的seq_group被抢占，导致其暂停推理，相关的KV block被置换到cpu上（swap out），等待gpu资源充足时再置换回来重新计算（swap in）。
- 若干和Finish相关的状态
  ，表示该seq推理已经结束，具体包括：
  - FINISHED_STOPPED
    ：
    正常执行完毕，例如碰到
    <eos>
    符号，该seq的推理正常结束了
  - FINISHED_LENGTH_CAPPED
    ：因为seq的长度达到最大长度限制，而结束推理
  - FINISHED_ABORTED
    ：因不正常状态，而被终止的推理。例如客户端断开连接，则服务器会终止相关seq的推理
  - FINISHED_IGNORED
    ：因prompt过长而被终止执行的推理。本质上也是受到长度限制
在vLLM中有一个重要假设：一个seq_group中的所有seq共享1个prompt。

我们来通过一个具体的例子，更好感受一下SequenceGroup的作用：

在推理开始之前
，这个seq_group下只有1条seq，它就是prompt，状态为waiting。
在第1个推理阶段
，调度器选中了这个seq_group，由于它的采样参数中
n = 4
，所以在做完prefill之后，它会生成4个seq，它们的状态都是running。
在若干个推理阶段后，gpu上的资源不够了，这个seq_group不幸被调度器抢占（preemption）
，它相关的KV block也被swap out到cpu上。此时所有seq的状态变为swapped。这里要注意，
当一个seq_group被抢占时，对它的处理有两种方式：
- Swap：如果该seq_group下的seq数量 > 1，此时会采取swap策略
  ，即把seq_group下【所有】seq的KV block从gpu上卸载到cpu上。（seq数量比较多，直接把算出的KV block抛弃，比较可惜）
- Recomputation：如果该seq_group下的seq数量 = 1，此时会采取recomputation策略
  ，即把该seq_group相关的物理块都释放掉，然后将它重新放回waiting队列中。等下次它被选中推理时，就是从prefill阶

最低0.47元/天解锁文章