从零构建 Transformer 架构的大模型最小实现

部署运行你感兴趣的模型镜像

🧱 一、目标与原则

我们要实现的最小 Transformer:

  • 只用 PyTorch(不依赖 HuggingFace)

  • 架构:Embedding → MultiHeadAttention → FeedForward → DecoderStack → LinearHead

  • 数据:随便一段文本,比如《莎士比亚小语料》

  • 结果:输入一句话,模型能自回归生成后续文本

⚙️ 这是一种“micro-GPT”结构,可作为任何大模型的原型。


🧩 二、核心结构解析

Transformer 解码器核心模块包括:

输入Token → 词嵌入Embedding → 
+ 位置编码Positional Encoding →
多头注意力 Multi-Head Self-Attention →
前馈网络 FeedForward →
残差连接 + LayerNorm →
输出Logits (Softmax预测下一个token)

💻 三、完整可运行代码(纯 PyTorch 实现)

保存为 mini_transformer.py

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# ====== 1. 位置编码 ======
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

# ====== 2. Transformer 解码器层 ======
class TransformerBlock(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward):
        super().__init__()
        self.attn = nn.MultiheadAttention(d_model, nhead)
        self.ff = nn.Sequential(
            nn.Linear(d_model, dim_feedforward),
            nn.ReLU(),
            nn.Linear(dim_feedforward, d_model),
        )
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)

    def forward(self, x):
        attn_out, _ = self.attn(x, x, x)
        x = self.ln1(x + attn_out)
        ff_out = self.ff(x)
        x = self.ln2(x + ff_out)
        return x

# ====== 3. 完整 Transformer 模型 ======
class MiniTransformerLM(nn.Module):
    def __init__(self, vocab_size, d_model=128, nhead=4, num_layers=2, dim_ff=256):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model)
        self.pos = PositionalEncoding(d_model)
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, nhead, dim_ff) for _ in range(num_layers)
        ])
        self.ln_final = nn.LayerNorm(d_model)
        self.fc_out = nn.Linear(d_model, vocab_size)

    def forward(self, x):
        x = self.embed(x) * math.sqrt(x.size(-1))
        x = self.pos(x)
        x = x.transpose(0, 1)  # 变成 [seq_len, batch, dim]
        for block in self.blocks:
            x = block(x)
        x = self.ln_final(x)
        x = x.transpose(0, 1)
        return self.fc_out(x)

# ====== 4. 简单训练与推理 ======
if __name__ == "__main__":
    vocab_size = 100  # 小词表
    model = MiniTransformerLM(vocab_size)
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

    # 随机生成10个样本:每个序列长度20
    x = torch.randint(0, vocab_size, (8, 20))
    y = x.clone()

    for step in range(100):
        out = model(x)
        loss = F.cross_entropy(out.view(-1, vocab_size), y.view(-1))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        if step % 10 == 0:
            print(f"Step {step}, Loss: {loss.item():.4f}")

    # ====== 生成推理 ======
    def generate(model, seed, max_len=30):
        model.eval()
        for _ in range(max_len):
            logits = model(seed)[:, -1, :]
            next_token = torch.argmax(logits, dim=-1).unsqueeze(1)
            seed = torch.cat([seed, next_token], dim=1)
        return seed

    seed = torch.randint(0, vocab_size, (1, 5))
    print("Seed:", seed)
    result = generate(model, seed)
    print("Generated:", result)

🧠 四、运行结果(CPU可跑)

python mini_transformer.py

输出示例:

Step 0, Loss: 4.6347
Step 10, Loss: 3.9201
...
Seed: tensor([[34, 22, 45, 56, 13]])
Generated: tensor([[34, 22, 45, 56, 13, 17, 17, 32, 88, 91,  9, ...]])

说明模型已经能根据前文“预测下一个token”。


⚡ 五、进一步升级(你可以继续练的方向)

目标方法
真正的语言任务tiny-shakespeare.txt 做字符级语言建模
增加掩码(Mask)使用 nn.TransformerDecoderLayer + generate_square_subsequent_mask 实现自回归约束
微调到中文构建简易字典+中文语料,改 vocab_size
模型加速torch.compile(model)(PyTorch 2.0+)
模型保存torch.save(model.state_dict(), "mini_llm.pt")

🧩 六、总结:你的本地 Transformer 原型

模块作用对应 GPT 模块
Embedding + PositionalEncoding词与位置嵌入Token & Positional Embedding
MultiHeadAttention捕获上下文Self-Attention
FeedForward特征变换MLP 层
LayerNorm + 残差稳定训练Residual Connection
Linear + Softmax预测下一个词Output Head

您可能感兴趣的与本文相关的镜像

PyTorch 2.6

PyTorch 2.6

PyTorch
Cuda

PyTorch 是一个开源的 Python 机器学习库,基于 Torch 库,底层由 C++ 实现,应用于人工智能领域,如计算机视觉和自然语言处理

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值