ChatGPT 背后技术拆解：Transformer 位置编码与注意力机制可视化实战

最新推荐文章于 2025-11-24 19:47:22 发布

原创最新推荐文章于 2025-11-24 19:47:22 发布 · 972 阅读

21 ·

CC 4.0 BY-SA版权

文章标签：

#chatgpt #transformer #深度学习 #tensorflow #pytorch #计算机视觉 #人工智能

一、前言：从 ChatGPT 到 Transformer

ChatGPT 的核心是 Transformer 架构，而 Transformer 的两大关键组件是：

位置编码（Positional Encoding）
注意力机制（Attention Mechanism）

本文将用可视化+PyTorch代码实战的方式，带你拆解这两大技术，让你“看得见”ChatGPT 是如何理解语言的。

二、位置编码：让模型“知道”词在哪

2.1 为什么需要位置编码？

Transformer 不像 RNN 那样按顺序处理输入，它是一次性并行处理整个句子，因此无法天然感知词语的顺序。为了解决这个问题，Transformer 引入了位置编码。

2.2 正弦位置编码公式

Transformer 原始论文中使用的是正弦-余弦位置编码：

PE(pos,2i)=sin(100002i/dpos)PE(pos,2i+1)=cos(100002i/dpos)

其中：

pos：词语在句子中的位置
i：向量维度索引
d：嵌入维度

2.3 实战可视化（PyTorch）

import matplotlib.pyplot as plt
import numpy as np

def get_positional_encoding(max_len, d_model):
    pe = np.zeros((max_len, d_model))
    for pos in range(max_len):
        for i in range(0, d_model, 2):
            pe[pos, i] = np.sin(pos / 10000**(2*i/d_model))
            pe[pos, i+1] = np.cos(pos / 10000**(2*i/d_model))
    return pe

pe = get_positional_encoding(100, 512)
plt.figure(figsize=(10, 6))
plt.imshow(pe, cmap='viridis', aspect='auto')
plt.title("Positional Encoding Heatmap")
plt.xlabel("Embedding Dimension")
plt.ylabel("Position")
plt.show()

📊 可视化结果：
https://img-blog.csdnimg.cn/direct/pos_encoding.png

三、注意力机制：让模型“看见”词与词的关系

3.1 自注意力机制（Self-Attention）

核心思想：每个词都要“看”句子中所有其他词，并决定哪些词更重要。

公式：

Attention(Q,K,V)=softmax(dkQKT)V

其中：

Q,K,V 分别由输入词向量乘以权重矩阵 WQ,WK,WV 得到

3.2 多头注意力（Multi-Head Attention）

将 Q/K/V 拆成多个“头”，每个头学习不同类型的关系，最后拼接：

MultiHead(Q,K,V)=Concat(head1,...,headh)WO

3.3 可视化实战：注意力权重热力图

以下代码基于优快云实战项目：

import torch
import seaborn as sns

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    attention_weights = torch.softmax(scores, dim=-1)
    output = torch.matmul(attention_weights, V)
    return output, attention_weights

# 模拟输入
batch_size, seq_len, d_model = 1, 5, 512
Q = torch.randn(batch_size, 8, seq_len, 64)  # 8 heads
K = torch.randn(batch_size, 8, seq_len, 64)
V = torch.randn(batch_size, 8, seq_len, 64)

output, attn_weights = scaled_dot_product_attention(Q, K, V)
attn_weights = attn_weights[0, 0].detach().numpy()

sns.heatmap(attn_weights, annot=True, fmt=".2f", cmap="Blues")
plt.title("Self-Attention Weights Heatmap")
plt.xlabel("Key Position")
plt.ylabel("Query Position")
plt.show()

📊 可视化结果：
https://img-blog.csdnimg.cn/direct/attention_heatmap.png

四、项目实战：构建一个微型 Transformer 块

你可以参考优快云完整 PyTorch 实现，下面是一个简化版：

class TransformerBlock(torch.nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.attn = torch.nn.MultiheadAttention(d_model, n_heads)
        self.ffn = torch.nn.Sequential(
            torch.nn.Linear(d_model, 2048),
            torch.nn.ReLU(),
            torch.nn.Linear(2048, d_model)
        )
        self.norm1 = torch.nn.LayerNorm(d_model)
        self.norm2 = torch.nn.LayerNorm(d_model)

    def forward(self, x):
        attn_out, _ = self.attn(x, x, x)
        x = self.norm1(x + attn_out)
        ffn_out = self.ffn(x)
        x = self.norm2(x + ffn_out)
        return x

五、总结：一张图看懂 ChatGPT 的“眼睛”和“耳朵”

组件	作用	可视化方式
位置编码	告诉模型“词在哪”	热力图（二维图）
注意力机制	告诉模型“哪些词更重要”	注意力权重热力图
多头注意力	让模型“从多个角度看关系”	多头权重对比图