attation-----using the version correctly

博客指出XPsp1的DDK在sp2下不可用,在XPsp2环境下使用Windows Server 2003 Service Pack 1 DDK较好,微软官方网站有相关说明,并给出了网址。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

好的,以下是一个基于PyTorch框架的Self-Attention模块的代码: ```python import torch import torch.nn as nn class SelfAttention(nn.Module): def __init__(self, embed_size, heads): super(SelfAttention, self).__init__() self.embed_size = embed_size self.heads = heads self.head_dim = embed_size // heads assert (self.head_dim * heads == embed_size), "Embed size needs to be divisible by heads" self.values = nn.Linear(self.head_dim, self.head_dim, bias=False) self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False) self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False) self.fc_out = nn.Linear(heads * self.head_dim, embed_size) def forward(self, values, keys, query, mask): # Get number of training examples N = query.shape[0] # Split the embedding into self.heads pieces values = values.reshape(N, -1, self.heads, self.head_dim) keys = keys.reshape(N, -1, self.heads, self.head_dim) query = query.reshape(N, -1, self.heads, self.head_dim) # Transpose to get dimensions [N, self.heads, sequence_length, self.head_dim] values = self.values(values) keys = self.keys(keys) query = self.queries(query) # Calculate dot product attention [N, self.heads, query_len, key_len] energy = torch.einsum("nqhd,nkhd->nhqk", [query, keys]) if mask is not None: # Apply mask energy = energy.masked_fill(mask == 0, float("-1e20")) # Apply softmax to get attention weights attention = torch.softmax(energy / (self.embed_size ** (1/2)), dim=3) # Multiply attention weights with values to get context vector out = torch.einsum("nhql,nlhd->nqhd", [attention, values]).reshape(N, -1, self.heads * self.head_dim) # Apply final linear layer to get the output of self-attention layer out = self.fc_out(out) return out ``` 下面是每行代码的解释: ```python import torch import torch.nn as nn ``` 导入PyTorch和PyTorch的神经网络模块。 ```python class SelfAttention(nn.Module): def __init__(self, embed_size, heads): super(SelfAttention, self).__init__() self.embed_size = embed_size self.heads = heads self.head_dim = embed_size // heads assert (self.head_dim * heads == embed_size), "Embed size needs to be divisible by heads" ``` 定义Self-Attention类,其中`embed_size`表示词嵌入的维度,`heads`表示头的数量,`head_dim`表示每个头的维度。在初始化函数中,我们将`embed_size`和`heads`赋值给类的属性,并且计算每个头的维度。最后,我们使用一个断言语句来确保`embed_size`可以被`heads`整除,这是Self-Attention的一个重要限制。 ```python self.values = nn.Linear(self.head_dim, self.head_dim, bias=False) self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False) self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False) self.fc_out = nn.Linear(heads * self.head_dim, embed_size) ``` 初始化Self-Attention类的线性层。我们使用线性层来将`values`、`keys`和`queries`转换为头的数量,这样我们才能够计算注意力权重。`fc_out`是一个最终的线性层,它将多个头的输出连接在一起,输出Self-Attention模块的最终结果。 ```python def forward(self, values, keys, query, mask): # Get number of training examples N = query.shape[0] # Split the embedding into self.heads pieces values = values.reshape(N, -1, self.heads, self.head_dim) keys = keys.reshape(N, -1, self.heads, self.head_dim) query = query.reshape(N, -1, self.heads, self.head_dim) ``` 在前向传播函数中,我们将`values`、`keys`和`queries`转换为头的数量,并且使用`reshape`函数重新塑造张量的形状。我们将每个头的维度放在最后一维,这样我们就可以使用`einsum`函数计算注意力权重了。 ```python # Transpose to get dimensions [N, self.heads, sequence_length, self.head_dim] values = self.values(values) keys = self.keys(keys) query = self.queries(query) ``` 将`values`、`keys`和`queries`传递给线性层,这将使它们变成头的数量,我们现在可以使用`einsum`计算注意力权重。 ```python # Calculate dot product attention [N, self.heads, query_len, key_len] energy = torch.einsum("nqhd,nkhd->nhqk", [query, keys]) ``` 使用`einsum`函数计算点积注意力。我们使用`query`和`keys`的点积来计算相似性矩阵,然后将其传递给softmax函数,以获得注意力权重。 ```python if mask is not None: # Apply mask energy = energy.masked_fill(mask == 0, float("-1e20")) ``` 如果存在掩码,则使用掩码将相似性矩阵中的无法使用的位置置为负无穷。 ```python # Apply softmax to get attention weights attention = torch.softmax(energy / (self.embed_size ** (1/2)), dim=3) ``` 使用softmax函数将相似性矩阵转换为注意力权重。 ```python # Multiply attention weights with values to get context vector out = torch.einsum("nhql,nlhd->nqhd", [attention, values]).reshape(N, -1, self.heads * self.head_dim) ``` 使用`einsum`函数将注意力权重和`values`相乘,以获得上下文向量。我们使用`reshape`函数将多个头的输出连接在一起,以便我们可以将其传递给最终的线性层。 ```python # Apply final linear layer to get the output of self-attention layer out = self.fc_out(out) return out ``` 使用最终的线性层将多个头的输出连接在一起,以获得Self-Attention模块的最终输出。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值