在《手撕Transformer!!从每一模块原理讲解到代码实现【超详细!】》中,对Transformer各模块原理及代码实现有详细阐述。
### 位置编码
位置编码用于为输入序列中的每个位置提供位置信息,因为Transformer本身不具备捕捉序列位置信息的能力。其原理是通过特定的公式生成不同频率的正弦和余弦波来表示位置。以下是可能的代码解释示例:
```python
import torch
import torch.nn as nn
import math
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super(PositionalEncoding, self).__init__()
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
self.register_buffer('pe', pe.unsqueeze(0))
def forward(self, x):
x = x + self.pe[:, :x.size(1)]
return x
```
### 多头注意力
多头注意力机制允许模型在不同的表示子空间中并行地关注输入序列的不同部分。它通过多个注意力头对输入进行处理,然后将结果拼接并线性变换。代码示例如下:
```python
import torch
import torch.nn as nn
class MultiHeadAttention(nn.Module):
def __init__(self, num_heads, d_model):
super(MultiHeadAttention, self).__init__()
assert d_model % num_heads == 0
self.d_k = d_model // num_heads
self.num_heads = num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def forward(self, Q, K, V, mask=None):
batch_size = Q.size(0)
Q = self.W_q(Q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
K = self.W_k(K).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
V = self.W_v(V).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
attention = torch.softmax(scores, dim=-1)
out = torch.matmul(attention, V)
out = out.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)
return self.W_o(out)
```
### 前馈神经网络(FeedForward)和层归一化(NormLayer)
- **FeedForward 模块**:前馈神经网络通常由两个线性层和一个激活函数组成,用于对多头注意力的输出进行进一步的非线性变换。代码解析如下:
```python
import torch
import torch.nn as nn
class FeedForward(nn.Module):
def __init__(self, d_model, d_ff):
super(FeedForward, self).__init__()
self.fc1 = nn.Linear(d_model, d_ff)
self.fc2 = nn.Linear(d_ff, d_model)
self.relu = nn.ReLU()
def forward(self, x):
return self.fc2(self.relu(self.fc1(x)))
```
- **NormLayer 模块**:层归一化用于对输入的每个样本进行归一化处理,有助于稳定模型的训练。代码解析如下:
```python
import torch
import torch.nn as nn
class NormLayer(nn.Module):
def __init__(self, d_model, eps=1e-6):
super(NormLayer, self).__init__()
self.gamma = nn.Parameter(torch.ones(d_model))
self.beta = nn.Parameter(torch.zeros(d_model))
self.eps = eps
def forward(self, x):
mean = x.mean(dim=-1, keepdim=True)
std = x.std(dim=-1, keepdim=True)
return self.gamma * (x - mean) / (std + self.eps) + self.beta
```
### Encoder
- **Encoder 类**:编码器由多个 EncoderLayer 堆叠而成,用于对输入序列进行编码。
```python
import torch
import torch.nn as nn
class Encoder(nn.Module):
def __init__(self, num_layers, d_model, num_heads, d_ff):
super(Encoder, self).__init__()
self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff) for _ in range(num_layers)])
def forward(self, x, mask):
for layer in self.layers:
x = layer(x, mask)
return x
```
- **EncoderLayer 类**:每个 EncoderLayer 包含多头注意力和前馈神经网络两个子层,并且在每个子层后都进行层归一化。
```python
import torch
import torch.nn as nn
class EncoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff):
super(EncoderLayer, self).__init__()
self.self_attn = MultiHeadAttention(num_heads, d_model)
self.feed_forward = FeedForward(d_model, d_ff)
self.norm1 = NormLayer(d_model)
self.norm2 = NormLayer(d_model)
def forward(self, x, mask):
attn_output = self.self_attn(x, x, x, mask)
x = self.norm1(x + attn_output)
ff_output = self.feed_forward(x)
x = self.norm2(x + ff_output)
return x
```
- **前向传播过程**:输入序列首先经过位置编码,然后依次通过多个 EncoderLayer 进行处理。
### Decoder
- **Decoder 类**:解码器同样由多个 DecoderLayer 堆叠而成,用于根据编码器的输出生成目标序列。
```python
import torch
import torch.nn as nn
class Decoder(nn.Module):
def __init__(self, num_layers, d_model, num_heads, d_ff):
super(Decoder, self).__init__()
self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff) for _ in range(num_layers)])
def forward(self, x, enc_output, src_mask, tgt_mask):
for layer in self.layers:
x = layer(x, enc_output, src_mask, tgt_mask)
return x
```
- **DecoderLayer 类**:每个 DecoderLayer 包含自注意力、编码器 - 解码器注意力和前馈神经网络三个子层,并且在每个子层后都进行层归一化。
```python
import torch
import torch.nn as nn
class DecoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff):
super(DecoderLayer, self).__init__()
self.self_attn = MultiHeadAttention(num_heads, d_model)
self.cross_attn = MultiHeadAttention(num_heads, d_model)
self.feed_forward = FeedForward(d_model, d_ff)
self.norm1 = NormLayer(d_model)
self.norm2 = NormLayer(d_model)
self.norm3 = NormLayer(d_model)
def forward(self, x, enc_output, src_mask, tgt_mask):
attn_output1 = self.self_attn(x, x, x, tgt_mask)
x = self.norm1(x + attn_output1)
attn_output2 = self.cross_attn(x, enc_output, enc_output, src_mask)
x = self.norm2(x + attn_output2)
ff_output = self.feed_forward(x)
x = self.norm3(x + ff_output)
return x
```
- **前向传播过程**:目标序列首先经过位置编码,然后依次通过多个 DecoderLayer 进行处理,同时利用编码器的输出进行跨注意力计算。
### Transformer整体框架
```python
import torch
import torch.nn as nn
class Transformer(nn.Module):
def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff):
super(Transformer, self).__init__()
self.encoder = Encoder(num_layers, d_model, num_heads, d_ff)
self.decoder = Decoder(num_layers, d_model, num_heads, d_ff)
self.src_embedding = nn.Embedding(src_vocab_size, d_model)
self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)
self.positional_encoding = PositionalEncoding(d_model)
self.fc = nn.Linear(d_model, tgt_vocab_size)
def forward(self, src, tgt, src_mask, tgt_mask):
src_embedded = self.positional_encoding(self.src_embedding(src))
tgt_embedded = self.positional_encoding(self.tgt_embedding(tgt))
enc_output = self.encoder(src_embedded, src_mask)
dec_output = self.decoder(tgt_embedded, enc_output, src_mask, tgt_mask)
output = self.fc(dec_output)
return output
```