TensorRT-LLM项目：如何在PyTorch后端添加新模型-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_01152/article/details/148415941

TensorRT-LLM项目：如何在PyTorch后端添加新模型

TensorRT-LLM TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines. 项目地址: https://gitcode.com/gh_mirrors/te/TensorRT-LLM

前言

TensorRT-LLM作为NVIDIA推出的高性能推理框架，为大型语言模型(LLM)提供了高效的推理能力。本文将详细介绍如何在TensorRT-LLM的PyTorch后端中添加新的模型支持，帮助开发者扩展框架的模型兼容性。

准备工作

在开始添加新模型前，需要确保：

已正确安装TensorRT-LLM框架
熟悉PyTorch模型开发基础
了解目标模型的基本架构

模型配置实现

复用HuggingFace配置

如果目标模型已在HuggingFace Transformers库中实现，可以直接复用其配置类：

from transformers import LlamaConfig  # 以LLaMA为例

这种方式能最大程度减少重复工作，确保配置参数的一致性。

自定义配置类

对于Transformers库中未包含的模型，需要自定义配置类：

from transformers.configuration_utils import PretrainedConfig

class MyConfig(PretrainedConfig):
    def __init__(self, 
                 vocab_size=32000,
                 hidden_size=4096,
                 num_hidden_layers=32,
                 num_attention_heads=32,
                 **kwargs):
        super().__init__(**kwargs)
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads

配置类应继承自PretrainedConfig，并包含模型的关键参数定义。

模型架构实现

核心组件实现

注意力机制实现：

from tensorrt_llm._torch.modules.attention import Attention

class MyAttention(Attention):
    def __init__(self, model_config, layer_idx=None):
        super().__init__(
            hidden_size=model_config.pretrained_config.hidden_size,
            num_heads=model_config.pretrained_config.num_attention_heads,
            # 其他必要参数
        )

解码层实现：

class MyDecoderLayer(DecoderLayer):
    def __init__(self, model_config, layer_idx):
        super().__init__()
        self.input_layernorm = nn.LayerNorm(model_config.hidden_size)
        self.self_attn = MyAttention(model_config, layer_idx)
        self.mlp = nn.Sequential(
            # MLP层实现
        )

完整模型实现

class MyModel(DecoderModel):
    def __init__(self, model_config):
        super().__init__(model_config)
        self.embed_tokens = nn.Embedding(
            model_config.pretrained_config.vocab_size,
            model_config.pretrained_config.hidden_size
        )
        self.layers = nn.ModuleList([
            MyDecoderLayer(model_config, i) 
            for i in range(model_config.pretrained_config.num_hidden_layers)
        ])

语言模型头实现

class MyModelForCausalLM(DecoderModelForCausalLM[MyModel, MyConfig]):
    def __init__(self, model_config):
        super().__init__(
            MyModel(model_config),
            config=model_config,
            hidden_size=model_config.pretrained_config.hidden_size,
            vocab_size=model_config.pretrained_config.vocab_size
        )

权重加载实现

权重加载是模型适配的关键环节，需要正确处理原始权重与模型结构的映射关系：

def load_weights(self, weights: dict):
    # 处理词嵌入权重
    self.model.embed_tokens.weight.data = weights["embed_tokens.weight"]
    
    # 处理各层权重
    for layer_idx in range(self.config.num_hidden_layers):
        prefix = f"model.layers.{layer_idx}."
        layer = self.model.layers[layer_idx]
        
        # 加载注意力层权重
        q_weight = weights[prefix + "self_attn.q_proj.weight"]
        k_weight = weights[prefix + "self_attn.k_proj.weight"]
        v_weight = weights[prefix + "self_attn.v_proj.weight"]
        layer.self_attn.qkv_proj.weight.data = torch.cat([q_weight, k_weight, v_weight], dim=0)
        
        # 加载其他层权重...

模型注册

核心模型注册

将模型添加到框架核心模型库中：

在tensorrt_llm/_torch/models/__init__.py中添加：

from .modeling_mymodel import MyModelForCausalLM

__all__ = [
    ...,
    "MyModelForCausalLM",
]

使用装饰器注册模型：

from tensorrt_llm._torch.models.modeling_utils import register_auto_model

@register_auto_model("MyModelForCausalLM")
class MyModelForCausalLM(...):
    ...

外部模型注册

对于不希望修改框架代码的情况，可以采用外部模型注册方式：

# 在你的脚本中
from tensorrt_llm._torch import LLM
import modeling_mymodel  # 你的模型实现文件

llm = LLM(model="MyModelForCausalLM", ...)

性能优化建议

使用优化模块：
- 替换标准Linear层为tensorrt_llm._torch.modules.linear.Linear
- 使用优化的Embedding实现
- 采用高性能的RotaryEmbedding和RMSNorm实现
注意力机制优化：
- 确保正确处理attn_metadata
- 实现高效的KV缓存机制
张量并行支持：
- 在模型设计中考虑张量并行需求
- 正确实现权重切分逻辑