好的,以下是一个基于PyTorch框架的Self-Attention模块的代码:
```python
import torch
import torch.nn as nn
class SelfAttention(nn.Module):
def __init__(self, embed_size, heads):
super(SelfAttention, self).__init__()
self.embed_size = embed_size
self.heads = heads
self.head_dim = embed_size // heads
assert (self.head_dim * heads == embed_size), "Embed size needs to be divisible by heads"
self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)
self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False)
self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False)
self.fc_out = nn.Linear(heads * self.head_dim, embed_size)
def forward(self, values, keys, query, mask):
# Get number of training examples
N = query.shape[0]
# Split the embedding into self.heads pieces
values = values.reshape(N, -1, self.heads, self.head_dim)
keys = keys.reshape(N, -1, self.heads, self.head_dim)
query = query.reshape(N, -1, self.heads, self.head_dim)
# Transpose to get dimensions [N, self.heads, sequence_length, self.head_dim]
values = self.values(values)
keys = self.keys(keys)
query = self.queries(query)
# Calculate dot product attention [N, self.heads, query_len, key_len]
energy = torch.einsum("nqhd,nkhd->nhqk", [query, keys])
if mask is not None:
# Apply mask
energy = energy.masked_fill(mask == 0, float("-1e20"))
# Apply softmax to get attention weights
attention = torch.softmax(energy / (self.embed_size ** (1/2)), dim=3)
# Multiply attention weights with values to get context vector
out = torch.einsum("nhql,nlhd->nqhd", [attention, values]).reshape(N, -1, self.heads * self.head_dim)
# Apply final linear layer to get the output of self-attention layer
out = self.fc_out(out)
return out
```
下面是每行代码的解释:
```python
import torch
import torch.nn as nn
```
导入PyTorch和PyTorch的神经网络模块。
```python
class SelfAttention(nn.Module):
def __init__(self, embed_size, heads):
super(SelfAttention, self).__init__()
self.embed_size = embed_size
self.heads = heads
self.head_dim = embed_size // heads
assert (self.head_dim * heads == embed_size), "Embed size needs to be divisible by heads"
```
定义Self-Attention类,其中`embed_size`表示词嵌入的维度,`heads`表示头的数量,`head_dim`表示每个头的维度。在初始化函数中,我们将`embed_size`和`heads`赋值给类的属性,并且计算每个头的维度。最后,我们使用一个断言语句来确保`embed_size`可以被`heads`整除,这是Self-Attention的一个重要限制。
```python
self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)
self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False)
self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False)
self.fc_out = nn.Linear(heads * self.head_dim, embed_size)
```
初始化Self-Attention类的线性层。我们使用线性层来将`values`、`keys`和`queries`转换为头的数量,这样我们才能够计算注意力权重。`fc_out`是一个最终的线性层,它将多个头的输出连接在一起,输出Self-Attention模块的最终结果。
```python
def forward(self, values, keys, query, mask):
# Get number of training examples
N = query.shape[0]
# Split the embedding into self.heads pieces
values = values.reshape(N, -1, self.heads, self.head_dim)
keys = keys.reshape(N, -1, self.heads, self.head_dim)
query = query.reshape(N, -1, self.heads, self.head_dim)
```
在前向传播函数中,我们将`values`、`keys`和`queries`转换为头的数量,并且使用`reshape`函数重新塑造张量的形状。我们将每个头的维度放在最后一维,这样我们就可以使用`einsum`函数计算注意力权重了。
```python
# Transpose to get dimensions [N, self.heads, sequence_length, self.head_dim]
values = self.values(values)
keys = self.keys(keys)
query = self.queries(query)
```
将`values`、`keys`和`queries`传递给线性层,这将使它们变成头的数量,我们现在可以使用`einsum`计算注意力权重。
```python
# Calculate dot product attention [N, self.heads, query_len, key_len]
energy = torch.einsum("nqhd,nkhd->nhqk", [query, keys])
```
使用`einsum`函数计算点积注意力。我们使用`query`和`keys`的点积来计算相似性矩阵,然后将其传递给softmax函数,以获得注意力权重。
```python
if mask is not None:
# Apply mask
energy = energy.masked_fill(mask == 0, float("-1e20"))
```
如果存在掩码,则使用掩码将相似性矩阵中的无法使用的位置置为负无穷。
```python
# Apply softmax to get attention weights
attention = torch.softmax(energy / (self.embed_size ** (1/2)), dim=3)
```
使用softmax函数将相似性矩阵转换为注意力权重。
```python
# Multiply attention weights with values to get context vector
out = torch.einsum("nhql,nlhd->nqhd", [attention, values]).reshape(N, -1, self.heads * self.head_dim)
```
使用`einsum`函数将注意力权重和`values`相乘,以获得上下文向量。我们使用`reshape`函数将多个头的输出连接在一起,以便我们可以将其传递给最终的线性层。
```python
# Apply final linear layer to get the output of self-attention layer
out = self.fc_out(out)
return out
```
使用最终的线性层将多个头的输出连接在一起,以获得Self-Attention模块的最终输出。