【万字长文】逐层分解Transformer：从Embedding到输出 logits 的完整技术旅程

最新推荐文章于 2025-11-23 17:56:55 发布

原创最新推荐文章于 2025-11-23 17:56:55 发布 · 1.2k 阅读

8 ·

CC 4.0 BY-SA版权

文章标签：

#transformer #embedding #深度学习 #机器学习 #人工智能 #自动驾驶 #目标跟踪

鸟瞰：一张图先记住整体
数据入境：Tokenizer → Embedding → Positional Encoding
编码器塔：6×(Self-Attention + FFN + Residual + Norm)
解码器塔：6×(Masked Self-Attention + Cross-Attention + FFN)
多头注意力：数学、代码、复杂度、优化
残差与归一化：Pre-LN vs Post-LN、DeepNorm、RMSNorm
输出头：Linear + Softmax → logits
训练流水线：损失函数、标签平滑、学习率调度
推理阶段：KV-Cache、Beam Search、Top-k/p 采样
面试速查表：10 个高频问题与答案
延伸阅读 & 参考

1. 鸟瞰：整体信息流

代码预览

2. 数据入境：Tokenizer → Embedding → Positional Encoding

2.1 Tokenizer

BPE/WordPiece/SentencePiece：子词级，避免 OOV
实战：transformers 一行加载

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("gpt2")
ids = tok("I love NLP", return_tensors="pt")["input_ids"]

2.2 Embedding

可训练矩阵 E ∈ ℝ^(V×d_model)，查表即得
共享权重：输入/输出共享可减少参数量 25%

2.3 Positional Encoding

正弦余弦公式

PEpos,2i=sin(100002i/dpos)PEpos,2i+1=cos(100002i/dpos)

RoPE（旋转位置编码） 在 LLaMA/ChatGLM 中成为标配，解决外推问题。

3. 编码器塔：6 层黑箱 or 白盒？

每层 = Self-Attention → Add&Norm → FFN → Add&Norm

3.1 输入形状

[batch, seq_len, 512]（以 d_model=512 为例）

3.2 代码片段（单层）

class EncoderLayer(nn.Module):
    def __init__(self, d_model=512, heads=8, d_ff=2048, dropout=0.1):
        super().__init__()
        self.mha = MultiHeadAttention(d_model, heads, dropout)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Self-Attention
        attn_out = self.mha(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_out))
        # FFN
        ffn_out = self.ffn(x)
        x = self.norm2(x + self.dropout(ffn_out))
        return x

4. 解码器塔：掩码 + 交叉注意力

关键差异：

Masked Self-Attention：防止看到未来 token
Cross-Attention：K/V 来自编码器输出

class DecoderLayer(nn.Module):
    def forward(self, x, enc_out, src_mask=None, tgt_mask=None):
        # 1. Masked Self-Attention
        mha1_out = self.mha1(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(mha1_out))
        # 2. Cross-Attention
        mha2_out = self.mha2(x, enc_out, enc_out, src_mask)
        x = self.norm2(x + self.dropout(mha2_out))
        # 3. FFN
        ffn_out = self.ffn(x)
        x = self.norm3(x + self.dropout(ffn_out))
        return x

5. 多头注意力：拆开看

5.1 数学

Attention(Q,K,V)=softmax(dkQK⊤)V

多头：并行 h=8 个 head，最后 concat 再投影

5.2 复杂度

时间 & 空间：O(n²d)，FlashAttention 优化到 O(n d²) 且显存线性

5.3 代码

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, heads=8, dropout=0.1):
        super().__init__()
        assert d_model % heads == 0
        self.d_k = d_model // heads
        self.h = heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, q, k, v, mask=None):
        B, L, _ = q.shape
        Q = self.W_q(q).view(B, L, self.h, self.d_k).transpose(1, 2)
        K = self.W_k(k).view(B, -1, self.h, self.d_k).transpose(1, 2)
        V = self.W_v(v).view(B, -1, self.h, self.d_k).transpose(1, 2)

        scores = (Q @ K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        attn = F.softmax(scores, dim=-1)
        attn = self.dropout(attn)
        out = (attn @ V).transpose(1, 2).contiguous().view(B, L, -1)
        return self.W_o(out)

6. 残差与归一化：Pre-LN vs Post-LN

9. 推理优化

10. 面试速查表

11. 延伸阅读

Post-LN（原版）：残差→LayerNorm，训练需 warmup
Pre-LN（常用）：LayerNorm→子层→残差，更稳定
DeepNorm：每层缩放残差，可把层数推到 1000+

RMSNorm：去掉均值，速度+7%，效果持平

7. 输出头

class Generator(nn.Module):
    def __init__(self, d_model, vocab):
        super().__init__()
        self.proj = nn.Linear(d_model, vocab)

    def forward(self, x):
        return F.log_softmax(self.proj(x), dim=-1)

8. 训练流水线

Loss：交叉熵，label smoothing（ε=0.1）
调度器：Noam（lr ∝ d_model^{-0.5}·min(step^{-0.5}, step·warmup^{-1.5})）
正则：Dropout=0.1、权重共享、梯度裁剪 1.0
KV-Cache：把之前计算的 K/V 存起来，复杂度从 O(n²) 降为 O(n)
为什么除以 √d_k？
→ 防止 softmax 饱和，梯度消失。
如何并行？
→ 所有 token 同时计算 attention，矩阵乘法高度并行。
与 RNN 相比最大优势？
→ 长依赖路径长度 O(1)，RNN 为 O(n)。
位置编码可否学习？
→ 可以，但外推性差；RoPE 解决外推。
如何缓解显存爆炸？
→ Gradient checkpointing、FlashAttention、ZeRO、8-bit 量化。
为什么 Pre-LN 更稳定？
→ 梯度范数更一致，避免深层梯度爆炸。
共享 Embedding 的利弊？
→ 省参数量，但要求输入/输出词表一致。
Transformer 为何用 LayerNorm 而不用 BatchNorm？
→ 序列长度可变，BatchNorm 统计不稳定。
如何评估 attention 可视化？
→ 看对齐矩阵是否对齐源端关键词。
如何扩展到百万级上下文？
→ Longformer 稀疏注意力、ALiBi、StreamingLLM。
- Beam Search：beam=4 在 MT 中 BLEU↑2-3
- Top-k/p：生成式模型必备，防止重复
- Vaswani et al. 2017 Attention Is All You Need
- Xiong et al. 2020 On Layer Normalization in the Transformer Architecture
- Dao et al. 2022 FlashAttention: Fast and Memory-Efficient Exact Attention
- Su et al. 2021 RoFormer: Enhanced Transformer with Rotary Position Embedding