VLLM V1 part 5 - graph capture图捕获

Ref

https://docs.vllm.ai/en/stable/design/v1/torch_compile.html

https://pytorch.org/docs/stable/torch.compiler.html

graph capture可以参考:

vLLM’s torch.compile integration — vLLM

vllm 为什么没在 prefill 阶段支持 cuda graph?

浅谈cuda graph在llm推理中的应用

一文读懂cudagraph

图捕获流程

vllm/compilation/decorators.py里面定义的support_torch_compile,核心逻辑在_support_torch_compile:A decorator to add support for compiling the forward method of a class。

Qwen2Model使用support_torch_compile进行装饰,这使得Qwen2Model在推理时会进行图捕获,或者使用捕获的图进行推理。

@support_torch_compile(
    dynamic_arg_dims={
        "input_ids": 0,
        # positions is of shape (3, seq_len) if mrope is enabled for qwen2-vl,
        # otherwise (seq_len, ).
        "positions": -1,
        "intermediate_tensors": 0,
        "inputs_embeds": 0,
    })
class Qwen2Model(nn.Module):

Qwen2ForCausalLM.forward里面调用
hidden_states = self.model(input_ids, positions, kv_caches...)会触发vllm/compilation/decorators.py中的图编译:

output = self.compiled_callable(*args, **kwargs)

support_torch_compile主要做两件事:1是进行dynamic_arg_dim设置和检查,而是调用_support_torch_compile。

_support_torch_compile

这里只是梳理了下大致流程,非常深入的细节,要去研究torch.compile的底层了。

_support_torch_compile的主要流程如下:

1,给当前的类(例如Qwen2Model)添加了一个TorchCompileWrapperWithCustomDispatcher的基类:

cls.__bases__ = cls.__bases__ + (TorchCompileWrapperWithCustomDispatcher, )

2,定义新的__init__函数和替换:

    old_init = cls.__init__

    def __init__(self, *, vllm_config: VllmConfig, prefix: str = '', **kwargs):
        old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
        self.vllm_config = vllm_config
        # for CompilationLevel.DYNAMO_AS_IS , the upper level model runner
        # will handle the compilation, so we don't need to do anything here.
        self.do_not_compile = \
            vllm_config.compilation_config.level in [
            CompilationLevel.NO_COMPILATION, CompilationLevel.DYNAMO_AS_IS
        ] or not supports_dynamo()
        if self.do_not_compile:
            return
        compilation_counter.num_models_seen += 1
        TorchCompileWrapperWithCustomDispatcher.__init__(
            self, compilation_level=vllm_config.compilation_config.level)

    cls.__init__ = __init__

这里面调用了TorchCompileWrapperWithCustomDispatcher.__init__函数,而这里面进行torch compile,赋值给了compiled_callable:

class TorchCompileWrapperWithCustomDispatcher:
    """
    A wrapper class for torch.compile, with a custom dispatch logic.
    Subclasses should:
    1. Implement the forward method
    2. Implement the dispatch logic in the __call__ method
        It can use `self.compiled_codes` to access the compiled bytecode,
        and `with self.dispatch_to_code(index):` to dispatch to
        the compiled code.
    3. Implement the `__init__` method to determine how to call
        `torch.compile` over the forward method.
    """

    def __init__(self, compiled_callable: Optional[Callable] = None, compilation_level: int = 0):
        vllm_config = get_current_vllm_config()
        self.vllm_config = vllm_config
        if compiled_callable is None:
            # default compilation settings
            # compiling the forward method
            backend = vllm_config.compilation_config.init_backend(vllm_config)

            compiled_callable = torch.compile(
                self.forward,
                fullgraph=envs.VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE,
                backend=backend)

        self.compiled_callable = compiled_callable
        self.original_code_object = self.__class__.forward.__code__
        self.compiled_codes: List[CodeType] = []
        torch._dynamo.convert_frame.register_bytecode_hook(self.bytecode_hook)

这里torch.compile的backend是vllm.compilation.backends.VllmBackend。

3,重新定义__call__函数和替换。

3.1,这个__call__里面做的第一个事情是设置动态shape维度:torch._dynamo.mark_dynamic(arg, dims)。

3.2,如果len(self.compiled_codes) < 1 or not self.use_custom_dispatcher就调用torch.compile后的model进行推理,这里面会实际进行图捕获和编译。相当于第一次调用torch.compile后的模型进行推理进行实际的捕获和编译。

with patch.object(InliningInstructionTranslator, 'inline_call',
                  patched_inline_call):
    output = self.compiled_callable(*args, **kwargs)
return output

3.3,如果len(self.compiled_codes)>=1,实际我看对于qwen2最大长度也就是1:

    # usually, capturing the model once is enough, and then we can
    # dispatch to the compiled code directly, without going through
    # the Dynamo guard mechanism.
    with self.dispatch_to_code(0):
        model_output = self.forward(*args, **kwargs)
        return model_output

    @contextmanager
    def dispatch_to_code(self, index: int):
        """Context manager to dispatch to the compiled code.
        Why does this work? Because Dynamo guarantees that the compiled
        bytecode has exactly the same arguments, cell variables, and free
        variables as the original code. Therefore we can directly switch
        the code object in the function and call it.

        See https://dev-discuss.pytorch.org/t/what-is-the-relationship-requirement-among-original-bytecode-transformed-bytecode-and-bytecode-returned-by-hooks-in-dynamo/1693/7 for more details.
        """ # noqa
        self.__class__.forward.__code__ = self.compiled_codes[index]
        yield
        self.__class__.forward.__code__ = self.original_code_object

这里面虽然直接调用的是原始的forward函数,但是看上去通过dispatch_to_code来保证仍然使用的是编译后的图。

总的来说,就是调用torch.compile进行编译,然后首次推理进行模型编译,然后再调用这个model使用一些预定义好的数据进行warmup。最后实际推理。

GPUModelRunner capture_model

VLLM加载模型后init里面调用图捕获函数,最终调用GPUModelRunner.capture_model:

def capture_model(self) -> None:
    start_time = time.perf_counter()
    start_free_gpu_memory = torch.cuda.mem_get_info()[0]

    # Trigger CUDA graph capture for specific shapes.
    # Capture the large shapes first so that the smaller shapes
    # can reuse the memory pool allocated for the large shapes.
    with graph_capture(device=self.device):
        for num_tokens in reversed(self.cudagraph_batch_sizes):
            for _ in range(self.vllm_config.compilation_config.cudagraph_num_of_warmups):
                self._dummy_run(num_tokens)
            self._dummy_run(num_tokens)

    end_time = time.perf_counter()
    end_free_gpu_memory = torch.cuda.mem_get_info()[0]
    elapsed_time = end_time - start_time
    cuda_graph_size = start_free_gpu_memory - end_free_gpu_memory
    # This usually takes 5~20 seconds.
    logger.info("Graph capturing finished in %.0f secs, took %.2f GiB", elapsed_time, cuda_graph_size / (1 << 30))

实际cudagraph_num_of_warmups=1,graph capture时这个warmup有何用?

捕获的graph num_tokens数量为cudagraph_batch_sizes,在vllm.config.VllmConfig._set_cudagraph_sizes里面设置:

batch_size_capture_list = []
if self.model_config is not None and not self.model_config.enforce_eager:
    batch_size_capture_list = [1, 2, 4] + [i for i in range(8, 513, 8)]

这就是采用了预先定义好的一些输入token长度调用torch.compile后的模型进行推理,从而保证这些长度的模型推理性能。

Piecewise CUDA graphs

这个是这么体现的呢?

相关逻辑在vllm.compilation.backends.VllmBackend

class VllmBackend:
    """The compilation backend for `torch.compile` with VLLM.
    It is used for compilation level of `CompilationLevel.PIECEWISE`,
    where we customize the compilation.

    The major work of this backend is to split the graph into
    piecewise graphs, and pass them to the piecewise backend.

    This backend also adds the PostGradPassManager to Inductor config,
    which handles the post-grad passes.
    """

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Luchang-Li

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值