下面我将通过一个完整的案例,从输入文本开始,逐步演示Transformer的每一步计算过程,包括所有中间结果的详细计算和解读。
案例设定
输入文本:"cat sits"(假设已经分词和转换为小写)
参数设置:
-
词汇表:{"<pad>":0, "cat":1, "sits":2}(简化示例)
-
d_model = 4(为了演示简化维度)
-
d_k = d_v = 2
-
头数h = 2
-
前馈层维度d_ff = 8
-
序列长度=2
第一步:输入处理
词嵌入
词汇索引:
-
"cat" → 1
-
"sits" → 2
嵌入矩阵E (3×4):
E = [
[0, 0, 0, 0], # <pad>
[0.1, 0.2, 0.3, 0.4], # cat
[0.5, 0.6, 0.7, 0.8] # sits
]
获取词向量:
e_cat = E[1] = [0.1, 0.2, 0.3, 0.4]
e_sits = E[2] = [0.5, 0.6, 0.7, 0.8]
位置编码
位置0编码(pos=0):
PE(0,0) = sin(0/10000^(0/4)) = 0
PE(0,1) = cos(0/10000^(0/4)) = 1
PE(0,2) = sin(0/10000^(2/4)) = 0
PE(0,3) = cos(0/10000^(2/4)) = 1
→ PE_0 = [0, 1, 0, 1]
位置1编码(pos=1):
PE(1,0) = sin(1/10000^0) ≈ 0.8415
PE(1,1) = cos(1/10000^0) ≈ 0.5403
PE(1,2) = sin(1/10000^1) ≈ 0.0001
PE(1,3) = cos(1/10000^1) ≈ 1.0000
→ PE_1 ≈ [0.8415, 0.5403, 0.0001, 1.0000]
最终输入表示:
h_cat = e_cat + PE_0 = [0.1+0, 0.2+1, 0.3+0, 0.4+1] = [0.1, 1.2, 0.3, 1.4]
h_sits = e_sits + PE_1 ≈ [0.5+0.8415, 0.6+0.5403, 0.7+0.0001, 0.8+1.0000]
≈ [1.3415, 1.1403, 0.7001, 1.8000]
H_0 = [[0.1, 1.2, 0.3, 1.4],
[1.3415, 1.1403, 0.7001, 1.8000]]
第二步:编码器第一层
多头注意力(两个头)
头1参数:
W_Q1 = [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6], [0.7, 0.8]]
W_K1 = [[0.9, 1.0], [1.1, 1.2], [1.3, 1.4], [1.5, 1.6]]
W_V1 = [[0.1, -0.1], [-0.2, 0.2], [0.3, -0.3], [-0.4, 0.4]]
计算Q,K,V(头1):
Q1 = H_0 @ W_Q1 = [
[0.1*0.1+1.2*0.3+0.3*0.5+1.4*0.7, 0.1*0.2+1.2*0.4+0.3*0.6+1.4*0.8],
[1.3415*0.1+1.1403*0.3+0.7001*0.5+1.8000*0.7, ...]
] ≈ [
[1.38, 1.6],
[1.792, 2.056]
]
K1 = H_0 @ W_K1 ≈ [
[3.38, 3.6],
[4.092, 4.356]
]
V1 = H_0 @ W_V1 ≈ [
[-0.38, 0.38],
[-0.492, 0.492]
]
计算注意力分数(头1):
attn_scores1 = Q1 @ K1.T / sqrt(2) ≈ [
[1.38*3.38+1.6*3.6, 1.38*4.092+1.6*4.356],
[1.792*3.38+2.056*3.6, 1.792*4.092+2.056*4.356]
] / 1.414 ≈ [
[10.724, 12.966],
[13.924, 16.834]
] / 1.414 ≈ [
[7.584, 9.170],
[9.847, 11.904]
]
attn_weights1 = softmax(attn_scores1) ≈ [
[e^7.584/(e^7.584+e^9.847), e^9.170/(e^9.170+e^11.904)],
[e^9.847/(e^7.584+e^9.847), e^11.904/(e^9.170+e^11.904)]
] ≈ [
[0.115, 0.083],
[0.885, 0.917]
]
计算头1输出:
head1 = attn_weights1 @ V1 ≈ [
[0.115*-0.38+0.083*-0.492, 0.115*0.38+0.083*0.492],
[0.885*-0.38+0.917*-0.492, 0.885*0.38+0.917*0.492]
] ≈ [
[-0.080, 0.080],
[-0.440, 0.440]
]
头2计算(过程类似,假设结果为):
head2 ≈ [
[0.150, -0.150],
[0.350, -0.350]
]
合并多头输出:
multi_head = concat([head1, head2]) ≈ [
[-0.080, 0.080, 0.150, -0.150],
[-0.440, 0.440, 0.350, -0.350]
]
W_O = [[0.1,0.2,0.3,0.4], [0.5,0.6,0.7,0.8], [0.9,1.0,1.1,1.2], [1.3,1.4,1.5,1.6]]
attn_output = multi_head @ W_O ≈ [
[-0.080*0.1+0.080*0.5+0.150*0.9-0.150*1.3, ..., ...],
[...]
] ≈ [
[0.022, 0.026, 0.030, 0.034],
[0.006, 0.010, 0.014, 0.018]
]
残差连接和层归一化
残差连接:
residual1 = H_0 + attn_output ≈ [
[0.1+0.022, 1.2+0.026, 0.3+0.030, 1.4+0.034],
[1.3415+0.006, 1.1403+0.010, 0.7001+0.014, 1.8000+0.018]
] ≈ [
[0.122, 1.226, 0.330, 1.434],
[1.3475, 1.1503, 0.7141, 1.8180]
]
层归一化(简化计算):
mean = np.mean(residual1, axis=1) ≈ [0.778, 1.257]
std = np.std(residual1, axis=1) ≈ [0.558, 0.456]
gamma = [1,1,1,1]
beta = [0,0,0,0]
norm_output1 ≈ [
[(0.122-0.778)/0.558, ..., ...],
[...]
] ≈ [
[-1.176, 0.803, -0.803, 1.176],
[0.198, -0.234, -1.190, 1.230]
]
前馈网络
第一层:
W1 = np.random.randn(4,8) # 假设权重
b1 = np.zeros(8)
ffn_hidden = relu(norm_output1 @ W1 + b1) # 假设结果为
≈ [
[0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2],
[0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1]
]
第二层:
W2 = np.random.randn(8,4) # 假设权重
b2 = np.zeros(4)
ffn_output = ffn_hidden @ W2 + b2 ≈ [
[0.22, 0.33, 0.44, 0.55],
[0.20, 0.30, 0.40, 0.50]
]
第二次残差连接:
residual2 = norm_output1 + ffn_output ≈ [
[-1.176+0.22, 0.803+0.33, -0.803+0.44, 1.176+0.55],
[...]
] ≈ [
[-0.956, 1.133, -0.363, 1.726],
[...]
]
最终层归一化:
H_1 ≈ [
[-1.032, 0.983, -0.654, 1.703],
[0.123, 0.456, -0.789, 1.210]
] # 编码器第一层输出
第三步:输出预测(解码器最后一步)
假设解码器最终输出:
decoder_output ≈ [
[0.1, 0.2, 0.3, 0.4], # 第一个位置
[0.5, 0.6, 0.7, 0.8] # 第二个位置
]
输出层权重:
W_out = [
[0.1, 0.2, 0.3], # 对应<pad>,cat,sits
[0.4, 0.5, 0.6],
[0.7, 0.8, 0.9],
[1.0, 1.1, 1.2]
]
计算logits:
logits = decoder_output @ W_out = [
[0.1*0.1+0.2*0.4+0.3*0.7+0.4*1.0, ..., ...],
[...]
] ≈ [
[0.66, 0.78, 0.90],
[1.38, 1.56, 1.74]
]
Softmax概率:
probs = softmax(logits) ≈ [
[e^0.66/(e^0.66+e^0.78+e^0.90), ..., ...],
[...]
] ≈ [
[0.300, 0.332, 0.368],
[0.300, 0.332, 0.368]
]
完整流程总结
输入处理
-
将"cat"和"sits"转换为词向量
-
添加位置编码得到初始表示H_0
编码器计算
-
多头注意力计算(两个头)
-
残差连接和层归一化
-
前馈网络变换
-
再次残差连接和归一化得到H_1
输出预测
-
假设解码器输出
-
线性变换到词汇表大小
-
softmax得到每个词的概率分布
这个简化示例展示了Transformer从输入到输出的完整计算流程,尽管使用了缩小版的维度,但完整保留了所有关键计算步骤和数学原理。实际应用中,d_model通常为512或1024,层数6-12层,计算量会大很多,但基本原理相同。