来聊聊Q、K、V的计算

架构进化论

于 2025-05-31 15:06:11 发布

阅读量892

点赞数 18

CC 4.0 BY-SA版权

分类专栏：大模型文章标签：线性代数机器学习算法人工智能大模型 LLM transformer

本文链接：https://blog.youkuaiyun.com/jsntghf/article/details/148352064

大模型专栏收录该内容

73 篇文章

订阅专栏

开始之前，可以先回顾一下：初探注意力机制-优快云博客

简单的例子，来理解一下 Q、K、V。

Q（Query）：比如“今晚吃啥？”
K（Key）：菜单上的菜名（关键词），比如“红烧肉”“青菜”。
V（Value）：菜的实际内容（具体信息），比如“红烧肉=肥而不腻”。

注意力机制就是：用你的问题（Q）去匹配菜单（K），找到最相关的菜，然后返回它的描述（V）。

注意力机制的分数计算公式，如下：

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V=\text{softmax}\left(\frac{\begin{bmatrix} q_1^\top \\ q_2^\top \\\vdots \\ q_n^\top \\\end{bmatrix} \begin{bmatrix} k_1 & k_2 & \cdots & k_m \\\end{bmatrix}}{\sqrt{d_k}}\right) \begin{bmatrix} v_1^\top \\ v_2^\top \\\vdots \\ v_m^\top \\\end{bmatrix}$$

Softmax 将任意实数向量转换为概率分布：

$$\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$$

为避免指数溢出，实际实现中会减去最大值：

$$\sigma(\mathbf{z})_i = \frac{e^{z_i - \max(\mathbf{z})}}{\sum_{j=1}^{K} e^{z_j - \max(\mathbf{z})}}$$

举个简单的例子。

假设我们有⼀个句⼦：“The teacher said”，词向量如下：

$$\text{x}_{\text{The}} = [0.9, 2.1, 4.3]$$

$$\text{x}_{\text{teacher}} = [1.2, 3.4, 5.6]$$

$$\text{x}_{\text{said}} = [2.5, 3.6, 4.7]$$

假设$$\text{d}_{\text{k}} = 64$$，主要用于缩放，避免注意力分数过大，直接计算 softmax 后，导致赢者通吃的现象。

假设 Q、K、V 的权重矩阵都是 3 * 3 的矩阵，矩阵如下：

$$\text{W}^{\text{Q}} = \begin{bmatrix} 0.1 & 0.2 & 0.3 \\ 0.4 & 0.5 & 0.6 \\ 0.7 & 0.8 & 0.9 \\\end{bmatrix}$$

$$\text{W}^{\text{K}} = \begin{bmatrix} 0.9 & 0.8 & 0.7 \\ 0.6 & 0.5 & 0.4 \\ 0.3 & 0.2 & 0.1 \\\end{bmatrix}$$

$$\text{W}^{\text{V}} = \begin{bmatrix} 0.2 & 0.3 & 0.4 \\ 0.5 & 0.6 & 0.7 \\ 0.8 & 0.9 & 1.0 \\\end{bmatrix}$$

计算 Q

计算各自的 Q，如下：

$$\text{Q}_{\text{The}}=[0.9,2.1,4.3] \times \begin{bmatrix} 0.1 & 0.2 & 0.3 \\ 0.4 & 0.5 & 0.6 \\ 0.7 & 0.8 & 0.9 \\\end{bmatrix}=[3.94,4.67,5.4]$$

$$\text{Q}_{\text{teacher}}=[1.2,3.4,5.6] \times \begin{bmatrix} 0.1 & 0.2 & 0.3 \\ 0.4 & 0.5 & 0.6 \\ 0.7 & 0.8 & 0.9 \\\end{bmatrix}=[5.4,6.42,7.44]$$

$$\text{Q}_{\text{said}}=[2.5,3.6,4.7] \times \begin{bmatrix} 0.1 & 0.2 & 0.3 \\ 0.4 & 0.5 & 0.6 \\ 0.7 & 0.8 & 0.9 \\\end{bmatrix}=[4.98,6.06,7.14]$$

计算 K

计算各自的 K，如下：

$$\text{K}_{\text{The}}=[0.9,2.1,4.3] \times \begin{bmatrix} 0.9 & 0.8 & 0.7 \\ 0.6 & 0.5 & 0.4 \\ 0.3 & 0.2 & 0.1 \\\end{bmatrix}=[3.36,2.63,1.9]$$

$$\text{K}_{\text{teacher}}=[1.2,3.4,5.6] \times \begin{bmatrix} 0.9 & 0.8 & 0.7 \\ 0.6 & 0.5 & 0.4 \\ 0.3 & 0.2 & 0.1 \\\end{bmatrix}=[4.8,3.78,2.76]$$

$$\text{K}_{\text{said}}=[2.5,3.6,4.7] \times \begin{bmatrix} 0.9 & 0.8 & 0.7 \\ 0.6 & 0.5 & 0.4 \\ 0.3 & 0.2 & 0.1 \\\end{bmatrix}=[5.82,4.74,3.66]$$

计算 V

计算各自的 V，如下：

$$\text{V}_{\text{The}}=[0.9,2.1,4.3] \times \begin{bmatrix} 0.2 & 0.3 & 0.4 \\ 0.5 & 0.6 & 0.7 \\ 0.8 & 0.9 & 1.0 \\\end{bmatrix}=[4.67,5.4,6.13]$$

$$\text{V}_{\text{teacher}}=[1.2,3.4,5.6] \times \begin{bmatrix} 0.2 & 0.3 & 0.4 \\ 0.5 & 0.6 & 0.7 \\ 0.8 & 0.9 & 1.0 \\\end{bmatrix}=[6.42,7.44,8.46]$$

$$\text{V}_{\text{said}}=[2.5,3.6,4.7] \times \begin{bmatrix} 0.2 & 0.3 & 0.4 \\ 0.5 & 0.6 & 0.7 \\ 0.8 & 0.9 & 1.0 \\\end{bmatrix}=[6.06,7.14,8.22]$$

以上计算得到了三个词的 Q、K、V，共 9 个向量，每个词的这三个新的向量是自注意力层的基础。

计算注意力得分

接下来计算注意力得分。对于任意一个词，它的 Q 向量会和其他所有词的 K 向量做点积运算，得到一系列的注意力得分。得分越大，表示该词与对应的其他词相关性越强；得分越小，表示二者关系较弱。也可以认为这个得分代表了这个词对其他词的重要性。

计算 The 与所有词的注意力得分：

$$\text{score(The,The)}=\text{Q}_{\text{The}} \cdot \text{K}_{\text{The}}=3.94 \times 3.36 + 4.67 \times 2.63 + 5.4 \times 1.9=35.781$$

$$\text{score(The,teacher)}=\text{Q}_{\text{The}} \cdot \text{K}_{\text{teacher}}=3.94 \times 4.8 + 4.67 \times 3.78 + 5.4 \times 2.76=51.469$$

$$\text{score(The,said)}=\text{Q}_{\text{The}} \cdot \text{K}_{\text{said}}=3.94 \times 5.82 + 4.67 \times 4.74 + 5.4 \times 3.66=64.831$$

计算 teacher 与所有词的注意力得分：

$$\text{score(teacher,The)}=\text{Q}_{\text{teacher}} \cdot \text{K}_{\text{The}}=5.4 \times 3.36 + 6.42 \times 2.63 + 7.44 \times 1.9=49.165$$

$$\text{score(teacher,teacher)}=\text{Q}_{\text{teacher}} \cdot \text{K}_{\text{teacher}}=5.4 \times 4.8 + 6.42 \times 3.78 + 7.44 \times 2.76=70.722$$

$$\text{score(teacher,said)}=\text{Q}_{\text{teacher}} \cdot \text{K}_{\text{said}}=5.4 \times 5.82 + 6.42 \times 4.74 + 7.44 \times 3.66=89.089$$

计算 said 与所有词的注意力得分：

$$\text{score(said,The)}=\text{Q}_{\text{said}} \cdot \text{K}_{\text{The}}=4.98 \times 3.36 + 6.06 \times 2.63 + 7.14 \times 1.9=46.237$$

$$\text{score(said,teacher)}=\text{Q}_{\text{said}} \cdot \text{K}_{\text{teacher}}=4.98 \times 4.8 + 6.06 \times 3.78 + 7.14 \times 2.76=66.517$$

$$\text{score(said,said)}=\text{Q}_{\text{said}} \cdot \text{K}_{\text{said}}=4.98 \times 5.82 + 6.06 \times 4.74 + 7.14 \times 3.66=83.84$$

接下来计算缩放后的得分：

计算 The 与所有词的缩放得分：

$$\text{score(The,The)}=\frac{35.781}{8}=4.4726$$

$$\text{score(The,teacher)}=\frac{51.469}{8}=6.4336$$

$$\text{score(The,said)}=\frac{64.831}{8}=8.1039$$

计算 teacher 与所有词的缩放得分：

$$\text{score(teacher,The)}=\frac{49.165}{8}=6.1456$$

$$\text{score(teacher,teacher)}=\frac{70.722}{8}=8.8403$$

$$\text{score(teacher,said)}=\frac{89.089}{8}=11.1361$$

计算 said 与所有词的缩放得分：

$$\text{score(said,The)}=\frac{46.237}{8}=5.7796$$

$$\text{score(said,teacher)}=\frac{66.517}{8}=8.3146$$

$$\text{score(said,said)}=\frac{83.84}{8}=10.48$$

数值稳定性处理（减去最大值），调整后的结果：

计算 The 与所有词的缩放得分：

$$\text{score(The,The)}=4.4726-8.1039=-3.6313$$

$$\text{score(The,teacher)}=6.4336-8.1039=-1.6703$$

$$\text{score(The,said)}=8.1039-8.1039=0$$

计算 teacher 与所有词的缩放得分：

$$\text{score(teacher,The)}=6.1456-11.1361=-4.9905$$

$$\text{score(teacher,teacher)}=8.8403-11.1361=-2.2958$$

$$\text{score(teacher,said)}=11.1361-11.1361=0$$

计算 said 与所有词的缩放得分：

$$\text{score(said,The)}=5.7796-10.48=-4.7004$$

$$\text{score(said,teacher)}=8.3146-10.48=-2.1654$$

$$\text{score(said,said)}=10.48-10.48=0$$

接下来对得分进行 softmax 计算，目的是为了转化成总和为 1 的概率分布，并不会改变得分的意义。

计算 The 的三个注意力得分转化：

$$\text{Weight}_{1}=\frac{e^{-3.6313}}{e^{-3.6313}+e^{-1.6703}+e^{0}}≈0.0217$$

$$\text{Weight}_{2}=\frac{e^{-1.6703}}{e^{-3.6313}+e^{-1.6703}+e^{0}}≈0.1550$$

$$\text{Weight}_{3}=\frac{e^{0}}{e^{-3.6313}+e^{-1.6703}+e^{0}}≈0.8233$$

计算 teacher 的三个注意力得分转化：

$$\text{Weight}_{1}=\frac{e^{-4.9905}}{e^{-4.9905}+e^{-2.2958}+e^{0}}≈0.0061$$

$$\text{Weight}_{2}=\frac{e^{-2.2958}}{e^{-4.9905}+e^{-2.2958}+e^{0}}≈0.0911$$

$$\text{Weight}_{3}=\frac{e^{0}}{e^{-4.9905}+e^{-2.2958}+e^{0}}≈0.9028$$

计算 said 的三个注意力得分转化：

$$\text{Weight}_{1}=\frac{e^{-4.7004}}{e^{-4.7004}+e^{-2.1654}+e^{0}}≈0.0081$$

$$\text{Weight}_{2}=\frac{e^{-2.1654}}{e^{-4.7004}+e^{-2.1654}+e^{0}}≈0.1021$$

$$\text{Weight}_{3}=\frac{e^{0}}{e^{-4.7004}+e^{-2.1654}+e^{0}}≈0.8898$$

最后，用上一步得到的注意力权重与 V 向量进行加权求和，得到新的向量表示。

计算 The 的输出：

$$\begin{equation} \begin{aligned} \text{output}_{\text{The}}&=(0.0217 \times \text{V}_{\text{The}}) + (0.1550 \times \text{V}_{\text{teacher}}) + (0.8233 \times \text{V}_{\text{said}}) \\ &= \begin{bmatrix} 6.085637 & 7.148742 & 8.211847 \end{bmatrix} \end{aligned} \end{equation}$$

计算 teacher 的输出：

$$\begin{equation} \begin{aligned} \text{output}_{\text{teacher}}&=(0.0061 \times \text{V}_{\text{The}}) + (0.0911 \times \text{V}_{\text{teacher}}) + (0.9028 \times \text{V}_{\text{said}}) \\ &=\begin{bmatrix} 6.084317 & 7.156716 & 8.229115 \end{bmatrix} \end{aligned} \end{equation}$$

计算 said 的输出：

$$\begin{equation} \begin{aligned} \text{output}_{\text{said}}&=(0.0081 \times \text{V}_{\text{The}}) + (0.1021 \times \text{V}_{\text{teacher}}) + (0.8898 \times \text{V}_{\text{said}})\\ &=\begin{bmatrix} 6.085497&7.156536&8.227575 \end{bmatrix} \end{aligned} \end{equation}$$

最终每个词都会生成新的向量表示（上下文表示），这些表示会进一步传递到后续网络层或解码器。

结合图，再来理解一下。