手撕transformer-基于numpy实现

最新推荐文章于 2025-06-25 20:28:51 发布

AlgorithmWillBeFine

最新推荐文章于 2025-06-25 20:28:51 发布

阅读量1.4k

点赞数 2

CC 4.0 BY-SA版权

文章标签： transformer numpy 深度学习人工智能计算机视觉 nlp

本文链接：https://blog.youkuaiyun.com/weixin_44491772/article/details/132351298

Attention is all you need
在这里插入图片描述

在Transformer模型中，输入首先通过一个嵌入层，得到每个词的嵌入表示，然后再加上位置编码（Positional Encoding）得到每个词的最终表示。得到这个最终表示后，为了计算注意力权重，我们需要为每个输入生成 Q (Query), K (Key), 和 V (Value)。

具体的转换过程如下：

词嵌入: 首先，我们有一个嵌入矩阵，其大小为 (vocab_size, d_model)，其中 vocab_size 是词汇表的大小，d_model 是模型的维度。输入句子中的每个词都会通过查找这个嵌入矩阵得到其嵌入表示。
```
embeddings = embedding_matrix[input_sentence]
```
位置编码: 之后，我们会加上位置编码。这一步是为了给模型提供词的位置信息，因为Transformer模型本身没有关于位置的固有概念。
```
embeddings += positional_encoding
```
生成 Q, K, 和 V: 接下来，我们将嵌入结果（词嵌入+位置编码）传递给三个不同的全连接层（dense layers），分别得到 Q, K 和 V。
```
Q = np.dot(embeddings, WQ)
K = np.dot(embeddings, WK)
V = np.dot(embeddings, WV)
```
这里的 WQ, WK, 和 WV 是三个权重矩阵，它们是模型需要学习的参数。

简而言之，输入首先被转换为嵌入表示，然后加上位置编码，最后通过三个不同的全连接层得到 Q, K 和 V。

Scaled Dot-Product Attention

在这里插入图片描述

import numpy as np

def scaled_dot_product_attention(q, k, v, mask=None):
   matmul_qk = np.dot(q, k.T)
   d_k = k.shape[-1]
   scaled_attention_logits = matmul_qk / np.sqrt(d_k)

   if mask is not None:
       scaled_attention_logits += (mask * -1e9)

   attention_weights = np.exp(scaled_attention_logits) / np.sum(np.exp(scaled_attention_logits), axis=-1, keepdims=True)
   output = np.dot(attention_weights, v)
   return output, attention_weights

# 测试
d_k = 3
batch_size = 1

# Query: 我们要查询的内容，这里假设是一个批次中的两个句子，每个句子有3个词，每个词的嵌入维度是3。
q = np.array([[1, 0, 1], [0, 2, 0], [1, 1, 0]]) 

# Key: 我们要匹配的内容
k = np.array([[1, 1, 0], [0, 1, 1], [1, 0, 1]</