Transformer完整计算案例:从输入到输出的逐步详解

下面我将通过一个完整的案例,从输入文本开始,逐步演示Transformer的每一步计算过程,包括所有中间结果的详细计算和解读。

案例设定

输入文本:"cat sits"(假设已经分词和转换为小写)

参数设置

  • 词汇表:{"<pad>":0, "cat":1, "sits":2}(简化示例)

  • d_model = 4(为了演示简化维度)

  • d_k = d_v = 2

  • 头数h = 2

  • 前馈层维度d_ff = 8

  • 序列长度=2

第一步:输入处理

词嵌入

词汇索引

  • "cat" → 1

  • "sits" → 2

嵌入矩阵E (3×4):

E = [
    [0, 0, 0, 0],    # <pad>
    [0.1, 0.2, 0.3, 0.4],  # cat
    [0.5, 0.6, 0.7, 0.8]   # sits
]

获取词向量

e_cat = E[1] = [0.1, 0.2, 0.3, 0.4]
e_sits = E[2] = [0.5, 0.6, 0.7, 0.8]

位置编码

位置0编码(pos=0):

PE(0,0) = sin(0/10000^(0/4)) = 0
PE(0,1) = cos(0/10000^(0/4)) = 1
PE(0,2) = sin(0/10000^(2/4)) = 0
PE(0,3) = cos(0/10000^(2/4)) = 1
→ PE_0 = [0, 1, 0, 1]

位置1编码(pos=1):

PE(1,0) = sin(1/10000^0) ≈ 0.8415
PE(1,1) = cos(1/10000^0) ≈ 0.5403
PE(1,2) = sin(1/10000^1) ≈ 0.0001
PE(1,3) = cos(1/10000^1) ≈ 1.0000
→ PE_1 ≈ [0.8415, 0.5403, 0.0001, 1.0000]

最终输入表示

h_cat = e_cat + PE_0 = [0.1+0, 0.2+1, 0.3+0, 0.4+1] = [0.1, 1.2, 0.3, 1.4]
h_sits = e_sits + PE_1 ≈ [0.5+0.8415, 0.6+0.5403, 0.7+0.0001, 0.8+1.0000] 
       ≈ [1.3415, 1.1403, 0.7001, 1.8000]
H_0 = [[0.1, 1.2, 0.3, 1.4],
       [1.3415, 1.1403, 0.7001, 1.8000]]

第二步:编码器第一层

多头注意力(两个头)

头1参数

W_Q1 = [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6], [0.7, 0.8]]
W_K1 = [[0.9, 1.0], [1.1, 1.2], [1.3, 1.4], [1.5, 1.6]]
W_V1 = [[0.1, -0.1], [-0.2, 0.2], [0.3, -0.3], [-0.4, 0.4]]

计算Q,K,V(头1)

Q1 = H_0 @ W_Q1 = [
    [0.1*0.1+1.2*0.3+0.3*0.5+1.4*0.7, 0.1*0.2+1.2*0.4+0.3*0.6+1.4*0.8],
    [1.3415*0.1+1.1403*0.3+0.7001*0.5+1.8000*0.7, ...]
] ≈ [
    [1.38, 1.6],
    [1.792, 2.056]
]

K1 = H_0 @ W_K1 ≈ [
    [3.38, 3.6],
    [4.092, 4.356]
]

V1 = H_0 @ W_V1 ≈ [
    [-0.38, 0.38],
    [-0.492, 0.492]
]

计算注意力分数(头1)

attn_scores1 = Q1 @ K1.T / sqrt(2) ≈ [
    [1.38*3.38+1.6*3.6, 1.38*4.092+1.6*4.356],
    [1.792*3.38+2.056*3.6, 1.792*4.092+2.056*4.356]
] / 1.414 ≈ [
    [10.724, 12.966],
    [13.924, 16.834]
] / 1.414 ≈ [
    [7.584, 9.170],
    [9.847, 11.904]
]

attn_weights1 = softmax(attn_scores1) ≈ [
    [e^7.584/(e^7.584+e^9.847), e^9.170/(e^9.170+e^11.904)],
    [e^9.847/(e^7.584+e^9.847), e^11.904/(e^9.170+e^11.904)]
] ≈ [
    [0.115, 0.083],
    [0.885, 0.917]
]

计算头1输出

head1 = attn_weights1 @ V1 ≈ [
    [0.115*-0.38+0.083*-0.492, 0.115*0.38+0.083*0.492],
    [0.885*-0.38+0.917*-0.492, 0.885*0.38+0.917*0.492]
] ≈ [
    [-0.080, 0.080],
    [-0.440, 0.440]
]

头2计算(过程类似,假设结果为):

head2 ≈ [
    [0.150, -0.150],
    [0.350, -0.350]
]

合并多头输出

multi_head = concat([head1, head2]) ≈ [
    [-0.080, 0.080, 0.150, -0.150],
    [-0.440, 0.440, 0.350, -0.350]
]

W_O = [[0.1,0.2,0.3,0.4], [0.5,0.6,0.7,0.8], [0.9,1.0,1.1,1.2], [1.3,1.4,1.5,1.6]]
attn_output = multi_head @ W_O ≈ [
    [-0.080*0.1+0.080*0.5+0.150*0.9-0.150*1.3, ..., ...],
    [...]
] ≈ [
    [0.022, 0.026, 0.030, 0.034],
    [0.006, 0.010, 0.014, 0.018]
]

残差连接和层归一化

残差连接

residual1 = H_0 + attn_output ≈ [
    [0.1+0.022, 1.2+0.026, 0.3+0.030, 1.4+0.034],
    [1.3415+0.006, 1.1403+0.010, 0.7001+0.014, 1.8000+0.018]
] ≈ [
    [0.122, 1.226, 0.330, 1.434],
    [1.3475, 1.1503, 0.7141, 1.8180]
]

层归一化(简化计算):

mean = np.mean(residual1, axis=1) ≈ [0.778, 1.257]
std = np.std(residual1, axis=1) ≈ [0.558, 0.456]
gamma = [1,1,1,1]
beta = [0,0,0,0]
norm_output1 ≈ [
    [(0.122-0.778)/0.558, ..., ...],
    [...]
] ≈ [
    [-1.176, 0.803, -0.803, 1.176],
    [0.198, -0.234, -1.190, 1.230]
]

前馈网络

第一层

W1 = np.random.randn(4,8)  # 假设权重
b1 = np.zeros(8)
ffn_hidden = relu(norm_output1 @ W1 + b1)  # 假设结果为
≈ [
    [0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2],
    [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1]
]

第二层

W2 = np.random.randn(8,4)  # 假设权重
b2 = np.zeros(4)
ffn_output = ffn_hidden @ W2 + b2 ≈ [
    [0.22, 0.33, 0.44, 0.55],
    [0.20, 0.30, 0.40, 0.50]
]

第二次残差连接

residual2 = norm_output1 + ffn_output ≈ [
    [-1.176+0.22, 0.803+0.33, -0.803+0.44, 1.176+0.55],
    [...]
] ≈ [
    [-0.956, 1.133, -0.363, 1.726],
    [...]
]

最终层归一化

H_1 ≈ [
    [-1.032, 0.983, -0.654, 1.703],
    [0.123, 0.456, -0.789, 1.210]
]  # 编码器第一层输出

第三步:输出预测(解码器最后一步)

假设解码器最终输出:

decoder_output ≈ [
    [0.1, 0.2, 0.3, 0.4],  # 第一个位置
    [0.5, 0.6, 0.7, 0.8]   # 第二个位置
]

输出层权重

W_out = [
    [0.1, 0.2, 0.3],  # 对应<pad>,cat,sits
    [0.4, 0.5, 0.6],
    [0.7, 0.8, 0.9],
    [1.0, 1.1, 1.2]
]

计算logits

logits = decoder_output @ W_out = [
    [0.1*0.1+0.2*0.4+0.3*0.7+0.4*1.0, ..., ...],
    [...]
] ≈ [
    [0.66, 0.78, 0.90],
    [1.38, 1.56, 1.74]
]

Softmax概率

probs = softmax(logits) ≈ [
    [e^0.66/(e^0.66+e^0.78+e^0.90), ..., ...],
    [...]
] ≈ [
    [0.300, 0.332, 0.368],
    [0.300, 0.332, 0.368]
]

完整流程总结

输入处理

  • 将"cat"和"sits"转换为词向量

  • 添加位置编码得到初始表示H_0

编码器计算

  • 多头注意力计算(两个头)

  • 残差连接和层归一化

  • 前馈网络变换

  • 再次残差连接和归一化得到H_1

输出预测

  • 假设解码器输出

  • 线性变换到词汇表大小

  • softmax得到每个词的概率分布

这个简化示例展示了Transformer从输入到输出的完整计算流程,尽管使用了缩小版的维度,但完整保留了所有关键计算步骤和数学原理。实际应用中,d_model通常为512或1024,层数6-12层,计算量会大很多,但基本原理相同。

扩展阅读

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值