VLLM V1 part 5 - graph capture图捕获

Luchang-Li

已于 2025-04-09 08:57:32 修改

阅读量969

点赞数 13

分类专栏：大模型推理引擎 VLLM/sglang 文章标签： VLLM LLM 大模型 graph

于 2025-04-02 18:21:22 首次发布

本文链接：https://blog.youkuaiyun.com/u013701860/article/details/146958845

版权

推理引擎同时被 3 个专栏收录

33 篇文章

订阅专栏

大模型

22 篇文章

订阅专栏

VLLM/sglang

9 篇文章

订阅专栏

Ref

https://docs.vllm.ai/en/stable/design/v1/torch_compile.html

https://pytorch.org/docs/stable/torch.compiler.html

graph capture可以参考:

vLLM’s torch.compile integration — vLLM

vllm 为什么没在 prefill 阶段支持 cuda graph？

浅谈cuda graph在llm推理中的应用

一文读懂cudagraph

图捕获流程

vllm/compilation/decorators.py里面定义的support_torch_compile，核心逻辑在_support_torch_compile：A decorator to add support for compiling the forward method of a class。

Qwen2Model使用support_torch_compile进行装饰，这使得Qwen2Model在推理时会进行图捕获，或者使用捕获的图进行推理。

@support_torch_compile(
    dynamic_arg_dims={
        "input_ids": 0,
        # positions is of shape (3, seq_len) if mrope is enabled for qwen2-vl,
        # otherwise (seq_len, ).
        "positions": -1,
        "intermediate_tensors": 0,
        "inputs_embeds": 0,
    })
class Qwen2Model(nn.Module):

Qwen2ForCausalLM.forward里面调用
hidden_states = self.model(input_ids, positions, kv_caches...)会触发vllm/compilation/decorators.py中的图编译：

output = self.compiled_callable(*args, **kwargs)

support_torch_compile主要做两件事：1是进行dynamic_arg_dim设置和检查，而是调用_support_torch_compile。

_support_torch_compile

这里只是梳理了下大致流程，非常深入的细节，要去研究torch.compile的底层了。

_support_torch_compile的主要流程如下：

1，给当前的类(例如Qwen2Model)添加了一个TorchCompileWrapperWithCustomDispatcher的基类：

cls.__bases__ = cls.__bases__ + (TorchCompileWrapperWithCustomDispatcher, )

2，定义新的__init__函数和替换：

    old_init = cls.__init__

    def __init__(self, *, vllm_config: VllmConfig, prefix: str = '', **kwargs):
        old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
        self.vllm_config = vllm_config
        # for CompilationLevel.DYNAMO_AS_IS , the upper level model runner
        # will handle the compilation, so we don't need to do anything here.
        self.do_not_compile = \
            vllm_config.compilation_config.level in [
            CompilationLevel.NO_COMPILATION, CompilationLevel.DYNAMO_AS_IS
        ] or not supports_dynamo()
        if self.do_not_compile:
            return
        compilation_counter.num_models_seen += 1
        TorchCompileWrapperWithCustomDispatcher.__init__(
            self, compilation_level=vllm_config.compilation_config.level)

    cls.__init__ = __init__

这里面调用了TorchCompileWrapperWithCustomDispatcher.__init__函数，而这里面进行torch compile，赋值给了compiled_callable:

class TorchCompileWrapperWithCustomDispatcher:
    """
    A wrapper class for torch.compile, with a custom dispatch logic.
    Subclasses should:
    1. Implement the forward method
    2. Implement the dispatch logic in the __call__ method
        It can use `self.compiled_codes` to access the compiled bytecode,
        and `with self.dispatch_to_code(index):` to dispatch to
        the compiled code.
    3. Implement the `__init__` method to determine how to call
        `torch.compile` over the forward method.
    """

    def __init__(self, compiled_callable: Optional[Callable] = None, compilation_level: int = 0):
        vllm_config = get_current_vllm_config()
        self.vllm_config = vllm_config
        if compiled_callable is None:
            # default compilation settings
            # compiling the forward method
            backend = vllm_config.compilation_config.init_backend(vllm_config)

            compiled_callable = torch.compile(
                self.forward,
                fullgraph=envs.VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE,
                backend=backend)

        self.compiled_callable = compiled_callable
        self.original_code_object = self.__class__.forward.__code__
        self.compiled_codes: List[CodeType] = []
        torch._dynamo.convert_frame.register_bytecode_hook(self.bytecode_hook)

这里torch.compile的backend是vllm.compilation.backends.VllmBackend。

3，重新定义__call__函数和替换。

3.1，这个__call__里面做的第一个事情是设置动态shape维度：torch._dynamo.mark_dynamic(arg, dims)。

3.2，如果len(self.compiled_codes) < 1 or not self.use_custom_dispatcher就调用torch.compile后的model进行推理，这里面会实际进行图捕获和编译。相当于第一次调用torch.compile后的模型进行推理进行实际的捕获和编译。

with patch.object(InliningInstructionTranslator, 'inline_call',
                  patched_inline_call):
    output = self.compiled_callable(*args, **kwargs)
return output

3.3，如果len(self.compiled_codes)>=1，实际我看对于qwen2最大长度也就是1:

    # usually, capturing the model once is enough, and then we can
    # dispatch to the compiled code directly, without going through
    # the Dynamo guard mechanism.
    with self.dispatch_to_code(0):
        model_output = self.forward(*args, **kwargs)
        return model_output

    @contextmanager
    def dispatch_to_code(self, index: int):
        """Context manager to dispatch to the compiled code.
        Why does this work? Because Dynamo guarantees that the compiled
        bytecode has exactly the same arguments, cell variables, and free
        variables as the original code. Therefore we can directly switch
        the code object in the function and call it.

        See https://dev-discuss.pytorch.org/t/what-is-the-relationship-requirement-among-original-bytecode-transformed-bytecode-and-bytecode-returned-by-hooks-in-dynamo/1693/7 for more details.
        """ # noqa
        self.__class__.forward.__code__ = self.compiled_codes[index]
        yield
        self.__class__.forward.__code__ = self.original_code_object

这里面虽然直接调用的是原始的forward函数，但是看上去通过dispatch_to_code来保证仍然使用的是编译后的图。

总的来说，就是调用torch.compile进行编译，然后首次推理进行模型编译，然后再调用这个model使用一些预定义好的数据进行warmup。最后实际推理。

GPUModelRunner capture_model

VLLM加载模型后init里面调用图捕获函数，最终调用GPUModelRunner.capture_model:

def capture_model(self) -> None:
    start_time = time.perf_counter()
    start_free_gpu_memory = torch.cuda.mem_get_info()[0]

    # Trigger CUDA graph capture for specific shapes.
    # Capture the large shapes first so that the smaller shapes
    # can reuse the memory pool allocated for the large shapes.
    with graph_capture(device=self.device):
        for num_tokens in reversed(self.cudagraph_batch_sizes):
            for _ in range(self.vllm_config.compilation_config.cudagraph_num_of_warmups):
                self._dummy_run(num_tokens)
            self._dummy_run(num_tokens)

    end_time = time.perf_counter()
    end_free_gpu_memory = torch.cuda.mem_get_info()[0]
    elapsed_time = end_time - start_time
    cuda_graph_size = start_free_gpu_memory - end_free_gpu_memory
    # This usually takes 5~20 seconds.
    logger.info("Graph capturing finished in %.0f secs, took %.2f GiB", elapsed_time, cuda_graph_size / (1 << 30))

实际cudagraph_num_of_warmups=1，graph capture时这个warmup有何用？

捕获的graph num_tokens数量为cudagraph_batch_sizes，在vllm.config.VllmConfig._set_cudagraph_sizes里面设置：

batch_size_capture_list = []
if self.model_config is not None and not self.model_config.enforce_eager:
    batch_size_capture_list = [1, 2, 4] + [i for i in range(8, 513, 8)]

这就是采用了预先定义好的一些输入token长度调用torch.compile后的模型进行推理，从而保证这些长度的模型推理性能。

Piecewise CUDA graphs

这个是这么体现的呢？

相关逻辑在vllm.compilation.backends.VllmBackend。

class VllmBackend:
    """The compilation backend for `torch.compile` with VLLM.
    It is used for compilation level of `CompilationLevel.PIECEWISE`,
    where we customize the compilation.

    The major work of this backend is to split the graph into
    piecewise graphs, and pass them to the piecewise backend.

    This backend also adds the PostGradPassManager to Inductor config,
    which handles the post-grad passes.
    """