Query-Key Normalization for Transformers

余弦相似注意力
 论文:Query-Key Normalization for Transformers
「简介:」 低资源语言翻译是一个具有挑战性但社会价值高的NLP任务。在最近针对这一设置调整Transformer规范化的工作基础上,作者提出了QKNorm,一种修改注意力机制的规范化技术,使得softmax函数不易受到任意饱和的影响,同时不牺牲表达能力。具体来说,作者在将查询和键矩阵相乘之前,沿着头部维度对它们应用ℓ2规范化,然后用一个可学习的参数进行放大,而不是除以嵌入维度的平方根。

### Key Components and Concepts in Transformers Architecture Transformers rely heavily on self-attention mechanisms, which constitute a critical part of their design[^1]. Self-attention allows models to weigh the significance of different words within a sentence relative to every other word. This mechanism enables more effective processing of sequential data without being constrained by fixed-length context windows. The architecture also incorporates multi-head attention layers that permit the model to jointly attend to information from different representation subspaces at various positions. Each head learns distinct patterns leading to richer representations overall. Positional encodings are added to the input embeddings since the self-attention layer does not inherently capture positional relationships between tokens. These encodings provide necessary ordering information about sequences so that the network can understand where each element appears in relation to others. Normalization techniques such as Layer Normalization play an essential role too; they stabilize training dynamics across multiple stacked transformer blocks ensuring consistent performance throughout deep networks. Feed-forward neural networks follow after normalization steps providing non-linearity required for learning complex mappings between inputs and outputs. ```python import torch.nn as nn class TransformerBlock(nn.Module): def __init__(self, embed_size, heads, dropout, forward_expansion): super(TransformerBlock, self).__init__() self.attention = MultiHeadAttention(embed_size, heads) self.norm1 = nn.LayerNorm(embed_size) self.norm2 = nn.LayerNorm(embed_size) self.feed_forward = nn.Sequential( nn.Linear(embed_size, forward_expansion * embed_size), nn.ReLU(), nn.Linear(forward_expansion * embed_size, embed_size), ) self.dropout = nn.Dropout(dropout) def forward(self, value, key, query, mask): attention = self.attention(value, key, query, mask) # Add & Norm x = self.dropout(self.norm1(attention + query)) forward = self.feed_forward(x) out = self.dropout(self.norm2(forward + x)) return out ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值