LLMs-from-scratch GPT到Llama2模型转换完全指南-优快云博客

LLMs-from-scratch GPT到Llama2模型转换完全指南

【免费下载链接】LLMs-from-scratch 从零开始逐步指导开发者构建自己的大型语言模型（LLM），旨在提供详细的步骤和原理说明，帮助用户深入理解并实践LLM的开发过程。项目地址: https://gitcode.com/GitHub_Trending/ll/LLMs-from-scratch

你还在为不同大型语言模型（LLM）架构间的转换感到困惑吗？本文将带你一步步将GPT模型转换为Llama2模型，让你轻松掌握模型架构迁移的核心技术。读完本文，你将能够：理解GPT与Llama2的关键差异、掌握五大核心模块的转换方法、成功加载并运行Llama2预训练权重、优化模型性能提升推理速度。

一、GPT与Llama2架构差异解析

GPT（Generative Pre-trained Transformer）和Llama2（Large Language Model Meta AI）虽然同属Transformer架构，但在细节实现上存在显著差异，这些差异直接影响模型的性能和适用场景。

1.1 核心架构对比

特征	GPT	Llama2
归一化层	LayerNorm	RMSNorm
激活函数	GELU	SiLU (Swish)
前馈网络	标准线性层	SwiGLU
位置编码	绝对位置嵌入	旋转位置编码（RoPE）
上下文窗口	1024 (GPT-2)	4096
注意力机制	多头注意力	多头注意力+RoPE

1.2 转换工作流概览

THE 0TH POSITION OF THE ORIGINAL IMAGE

转换过程主要包括五个关键步骤，我们将逐一详细讲解：

将LayerNorm替换为RMSNorm
将GELU激活函数替换为SiLU
使用SwiGLU更新前馈网络
实现旋转位置编码（RoPE）
更新多头注意力模块并整合RoPE

二、逐步转换实现

2.1 从LayerNorm到RMSNorm

RMSNorm（Root Mean Square Layer Normalization）是Llama2采用的归一化方法，与传统的LayerNorm相比，它移除了均值中心化步骤，仅保留均方根缩放，这不仅提高了计算效率，还在实践中表现出更好的训练稳定性。

RMSNorm公式： $$y_i = \frac{x_i}{\text{RMS}(x)} \gamma_i, \quad \text{where} \quad \text{RMS}(x) = \sqrt{\epsilon + \frac{1}{n} \sum x_i^2}$$

实现代码位于ch05/07_gpt_to_llama/converting-gpt-to-llama2.ipynb：

class RMSNorm(nn.Module):
    def __init__(self, emb_dim, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.emb_dim = emb_dim
        self.weight = nn.Parameter(torch.ones(emb_dim)).float()

    def forward(self, x):
        means = x.pow(2).mean(dim=-1, keepdim=True)
        x_normed = x * torch.rsqrt(means + self.eps)
        return (x_normed * self.weight).to(dtype=x.dtype)

2.2 激活函数：GELU到SiLU

Llama2使用SiLU（Sigmoid Linear Unit）激活函数，也称为Swish函数，其数学表达式为：$ \text{silu}(x) = x \cdot \sigma(x) $，其中$ \sigma(x) $是sigmoid函数。与GELU相比，SiLU在深层网络中通常能提供更好的梯度流。

实现代码位于ch05/07_gpt_to_llama/converting-gpt-to-llama2.ipynb：

class SiLU(nn.Module):
    def __init__(self):
        super(SiLU, self).__init__()

    def forward(self, x):
        return x * torch.sigmoid(x)

2.3 前馈网络：从标准线性层到SwiGLU

Llama2采用了SwiGLU（SwiGLU是GLU的一种变体）作为前馈网络，它使用两个并行的线性层，其中一个经过SiLU激活后与另一个相乘，最后通过第三个线性层输出。这种结构增强了模型的表达能力。

实现代码位于ch05/07_gpt_to_llama/converting-gpt-to-llama2.ipynb：

class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.fc1 = nn.Linear(cfg["emb_dim"], cfg["hidden_dim"], dtype=cfg["dtype"], bias=False)
        self.fc2 = nn.Linear(cfg["emb_dim"], cfg["hidden_dim"], dtype=cfg["dtype"], bias=False)
        self.fc3 = nn.Linear(cfg["hidden_dim"], cfg["emb_dim"], dtype=cfg["dtype"], bias=False)
        self.silu = SiLU()

    def forward(self, x):
        x_fc1 = self.fc1(x)
        x_fc2 = self.fc2(x)
        x = self.silu(x_fc1) * x_fc2
        return self.fc3(x)

2.4 实现旋转位置编码（RoPE）

RoPE（Rotary Position Embedding）是Llama2的核心创新之一，它通过对查询和键向量进行旋转变换，能够同时捕捉绝对位置和相对位置信息，极大提升了模型对长序列的建模能力。

RoPE的实现分为两个关键步骤：预计算旋转参数和应用旋转变换。

预计算旋转参数代码：

def precompute_rope_params(head_dim, theta_base=10_000, context_length=4096):
    assert head_dim % 2 == 0, "Embedding dimension must be even"
    
    # 计算逆频率
    inv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim, 2)[: (head_dim // 2)].float() / head_dim))
    
    # 生成位置索引
    positions = torch.arange(context_length)
    
    # 计算角度
    angles = positions.unsqueeze(1) * inv_freq.unsqueeze(0)  # 形状: (context_length, head_dim // 2)
    
    # 扩展角度以匹配head_dim
    angles = torch.cat([angles, angles], dim=1)  # 形状: (context_length, head_dim)
    
    # 预计算正弦和余弦
    cos = torch.cos(angles)
    sin = torch.sin(angles)
    
    return cos, sin

应用旋转变换代码：

def compute_rope(x, cos, sin):
    # x: (batch_size, num_heads, seq_len, head_dim)
    batch_size, num_heads, seq_len, head_dim = x.shape
    assert head_dim % 2 == 0, "Head dimension must be even"
    
    # 将x分为前半部分和后半部分
    x1 = x[..., : head_dim // 2]  # 前半部分
    x2 = x[..., head_dim // 2 :]  # 后半部分
    
    # 调整sin和cos的形状
    cos = cos[:seq_len, :].unsqueeze(0).unsqueeze(0)  # 形状: (1, 1, seq_len, head_dim)
    sin = sin[:seq_len, :].unsqueeze(0).unsqueeze(0)
    
    # 应用旋转变换
    rotated = torch.cat((-x2, x1), dim=-1)
    x_rotated = (x * cos) + (rotated * sin)
    
    return x_rotated.to(dtype=x.dtype)

2.5 更新多头注意力模块

将RoPE整合到多头注意力模块是转换过程的关键一步。与GPT不同，Llama2将位置编码应用于查询和键向量，而非输入嵌入。

实现代码位于ch05/07_gpt_to_llama/converting-gpt-to-llama2.ipynb：

class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, num_heads, dtype=None):
        super().__init__()
        assert d_out % num_heads == 0, "d_out must be divisible by n_heads"
        
        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads  # 降低投影维度以匹配期望的输出维度
        
        # 设置所有线性层的bias=False和dtype=dtype
        self.W_query = nn.Linear(d_in, d_out, bias=False, dtype=dtype)
        self.W_key = nn.Linear(d_in, d_out, bias=False, dtype=dtype)
        self.W_value = nn.Linear(d_in, d_out, bias=False, dtype=dtype)
        self.out_proj = nn.Linear(d_out, d_out, bias=False, dtype=dtype)  # 合并头输出的线性层
        self.register_buffer("mask", torch.triu(torch.ones(context_length, context_length), diagonal=1))
        
        # 预计算RoPE参数
        cos, sin = precompute_rope_params(head_dim=self.head_dim, context_length=context_length)
        self.register_buffer("cos", cos)
        self.register_buffer("sin", sin)
    
    
    def forward(self, x):
        
        b, num_tokens, d_in = x.shape
        
        keys = self.W_key(x)  # 形状: (b, num_tokens, d_out)
        queries = self.W_query(x)
        values = self.W_value(x)
        
        # 通过添加`num_heads`维度隐式分割矩阵
        # 展开最后一个维度: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
        
        # 转置: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)
        
        # 应用RoPE
        keys = compute_rope(keys, self.cos, self.sin)
        queries = compute_rope(queries, self.cos, self.sin)
        
        # 计算带因果掩码的缩放点积注意力
        attn_scores = queries @ keys.transpose(2, 3)  # 每个头的点积
        
        # 将原始掩码截断到token数量并转换为布尔值
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
        
        # 使用掩码填充注意力分数
        attn_scores.masked_fill_(mask_bool, -torch.inf)
        
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        
        # 形状: (b, num_tokens, num_heads, head_dim)
        context_vec = (attn_weights @ values).transpose(1, 2)
        
        # 合并头，其中self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.reshape(b, num_tokens, self.d_out)
        context_vec = self.out_proj(context_vec)  # 可选的投影
        
        return context_vec

2.6 组装TransformerBlock

完成上述模块转换后，我们需要更新TransformerBlock，将新的组件整合起来：

class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.att = MultiHeadAttention(
            d_in=cfg["emb_dim"],
            d_out=cfg["emb_dim"],
            context_length=cfg["context_length"],
            num_heads=cfg["n_heads"],
            dtype=cfg["dtype"],
        )
        self.ffn = FeedForward(cfg)
        self.norm1 = RMSNorm(emb_dim=cfg["emb_dim"], eps=cfg["rms_norm_eps"])
        self.norm2 = RMSNorm(emb_dim=cfg["emb_dim"], eps=cfg["rms_norm_eps"])

    def forward(self, x):
        # 自注意力子层
        x = x + self.att(self.norm1(x))
        
        # 前馈子层
        x = x + self.ffn(self.norm2(x))
        
        return x

三、加载与运行Llama2模型

完成模型架构转换后，我们需要加载Llama2的预训练权重并进行文本生成。

3.1 安装必要依赖

pip install llms_from_scratch blobfile

3.2 模型和文本生成设置

# 指定要使用的模型
MODEL_FILE = "llama3.2-1B-instruct.pth"
# MODEL_FILE = "llama3.2-1B-base.pth"
# MODEL_FILE = "llama3.2-3B-instruct.pth"
# MODEL_FILE = "llama3.2-3B-base.pth"

# 文本生成设置
if "instruct" in MODEL_FILE:
    PROMPT = "What do llamas eat?"
else:
    PROMPT = "Llamas eat"

MAX_NEW_TOKENS = 150
TEMPERATURE = 0.
TOP_K = 1

3.3 下载并加载模型权重

import os
import urllib.request
import torch
from llms_from_scratch.llama3 import Llama3Model

# 下载模型权重
url = f"https://huggingface.co/rasbt/llama-3.2-from-scratch/resolve/main/{MODEL_FILE}"

if not os.path.exists(MODEL_FILE):
    urllib.request.urlretrieve(url, MODEL_FILE)
    print(f"Downloaded to {MODEL_FILE}")

# 加载模型配置
if "1B" in MODEL_FILE:
    from llms_from_scratch.llama3 import LLAMA32_CONFIG_1B as LLAMA32_CONFIG
elif "3B" in MODEL_FILE:
    from llms_from_scratch.llama3 import LLAMA32_CONFIG_3B as LLAMA32_CONFIG
else:
    raise ValueError("Incorrect model file name")

# 初始化模型并加载权重
model = Llama3Model(LLAMA32_CONFIG)
model.load_state_dict(torch.load(MODEL_FILE, weights_only=True, map_location="cpu"))

# 将模型移动到适当的设备
device = (
    torch.device("cuda") if torch.cuda.is_available() else
    torch.device("mps") if torch.backends.mps.is_available() else
    torch.device("cpu")
)
model.to(device)

3.4 初始化分词器

from llms_from_scratch.llama3 import Llama3Tokenizer, ChatFormat, clean_text

TOKENIZER_FILE = "tokenizer.model"

# 下载分词器
url = f"https://huggingface.co/rasbt/llama-3.2-from-scratch/resolve/main/{TOKENIZER_FILE}"

if not os.path.exists(TOKENIZER_FILE):
    urllib.request.urlretrieve(url, TOKENIZER_FILE)
    print(f"Downloaded to {TOKENIZER_FILE}")
    
# 初始化分词器
tokenizer = Llama3Tokenizer("tokenizer.model")

# 如果是指令模型，应用聊天格式
if "instruct" in MODEL_FILE:
    tokenizer = ChatFormat(tokenizer)

3.5 生成文本

import time
from llms_from_scratch.ch05 import generate, text_to_token_ids, token_ids_to_text

torch.manual_seed(123)

start = time.time()

# 生成文本
token_ids = generate(
    model=model,
    idx=text_to_token_ids(PROMPT, tokenizer).to(device),
    max_new_tokens=MAX_NEW_TOKENS,
    context_size=LLAMA32_CONFIG["context_length"],
    top_k=TOP_K,
    temperature=TEMPERATURE
)

total_time = time.time() - start
print(f"Time: {total_time:.2f} sec")
print(f"{int(len(token_ids[0])/total_time)} tokens/sec")

# 打印内存使用情况（仅CUDA设备）
if torch.cuda.is_available():
    max_mem_bytes = torch.cuda.max_memory_allocated()
    max_mem_gb = max_mem_bytes / (1024 ** 3)
    print(f"Max memory allocated: {max_mem_gb:.2f} GB")

# 解码并打印生成的文本
output_text = token_ids_to_text(token_ids, tokenizer)
if "instruct" in MODEL_FILE:
    output_text = clean_text(output_text)

print("\n\nOutput text:\n\n", output_text)

四、性能优化技巧

4.1 使用FlashAttention加速推理

通过使用Llama3ModelFast替代Llama3Model，可以利用PyTorch的scaled_dot_product函数，该函数在Ampere及更新的GPU上使用FlashAttention，显著提高推理速度。

from llms_from_scratch.llama3 import Llama3ModelFast

model = Llama3ModelFast(LLAMA32_CONFIG)
model.load_state_dict(torch.load(MODEL_FILE, weights_only=True, map_location="cpu"))
model.to(device)

A100 GPU上的性能对比：

模型	tokens/sec	内存
Llama3Model	42	2.91 GB
Llama3ModelFast	54	2.91 GB

4.2 使用PyTorch编译加速

PyTorch的torch.compile可以显著加速模型推理，对于Llama模型，通常可以获得2-4倍的速度提升。

model = torch.compile(model)
model.to(device)

编译后的性能对比（A100 GPU）：

模型	tokens/sec	内存
Llama3Model	42	2.91 GB
Llama3Model (编译后)	170	3.12 GB
Llama3ModelFast (编译后)	177	3.61 GB

4.3 使用KV缓存优化长文本生成

KV缓存（Key-Value Cache）通过存储先前计算的键和值向量，避免重复计算，显著降低长文本生成时的计算成本和内存占用。

from llms_from_scratch.kv_cache.llama3 import Llama3Model
from llms_from_scratch.kv_cache.generate import generate_text_simple

model = Llama3Model(LLAMA32_CONFIG)
model.load_state_dict(torch.load(MODEL_FILE, weights_only=True, map_location="cpu"))
model.to(device)

# 使用KV缓存生成文本
token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(PROMPT, tokenizer).to(device),
    max_new_tokens=MAX_NEW_TOKENS,
    context_size=LLAMA32_CONFIG["context_length"],
)

不同设备上的性能对比：

模型	模式	硬件	tokens/sec	GPU内存
Llama3Model	常规	Mac Mini M4 CPU	1	-
Llama3Model	KV缓存	Mac Mini M4 CPU	68	-
Llama3Model	KV缓存+编译	Mac Mini M4 CPU	86	-
Llama3Model	常规	Nvidia A100 GPU	42	2.91 GB
Llama3Model	KV缓存+编译	Nvidia A100 GPU	161	3.61 GB

五、总结与展望

通过本文的步骤，我们成功将GPT模型转换为Llama2模型，涵盖了从核心模块替换到完整模型组装的全过程。关键收获包括：

理解了GPT与Llama2在架构上的核心差异
掌握了RMSNorm、SiLU、SwiGLU和RoPE等关键技术的实现
学会了如何加载和运行Llama2预训练权重
了解了多种优化推理性能的方法

随着LLM技术的快速发展，从Llama2进一步迁移到Llama3也非常简单，主要涉及少量额外修改，详细内容可参考converting-llama2-to-llama3.ipynb。

希望本文能帮助你更好地理解和应用不同LLM架构，为你的项目选择最适合的模型。如果你有任何问题或建议，欢迎在项目仓库中提出issue或PR。

点赞收藏本文，关注项目更新，不错过更多LLM技术实践指南！下一篇我们将探讨如何基于Llama2构建自定义对话系统，敬请期待。

完整代码实现可在项目仓库的ch05/07_gpt_to_llama目录下找到，包含所有转换步骤和示例。

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考