从零开始逐步指导开发者构建自己的大型语言模型（LLM）学习笔记- 第3章编码注意力机制Coding Attention Mechanisms

本文链接：https://blog.youkuaiyun.com/chenchihwen/article/details/144936755

这段文本主要介绍了 “Chapter 3: Coding Attention Mechanisms”（第三章：编码注意力机制）的相关资料。

其中，“Main Chapter Code”（主章节代码）部分，

代码存于 [01_main-chapter-code] 中；

“Bonus Materials”（额外资料）部分包含两方面内容，

一是 [02_bonus_efficient-multihead-attention] 实现并比较了多头注意力的不同实现变体，

二是 [03_understanding-buffers] 阐释了 PyTorch 缓冲区背后的概念，该缓冲区在第三章用于实现因果注意力机制。

Chapter 3: Coding Attention Mechanisms

《从头构建大语言模型》（Build a Large Language Model From Scratch）一书中关于注意力机制编码的章节，围绕注意力机制这一 LLMs 的核心引擎展开，详细介绍了多种注意力机制的原理、代码实现及应用，具体内容如下：

Packages that are being used in this notebook:

[1]

from importlib.metadata import version

print("torch version:", version("torch"))
torch version: 2.4.0

This chapter covers attention mechanisms, the engine of LLMs:

3.1 The problem with modeling long sequences

长序列建模难题：在机器翻译任务中，逐词翻译因源语言和目标语言的语法结构差异而不可行。在 Transformer 模型出现之前，编码器 - 解码器 RNNs 常用于机器翻译，编码器通过隐藏状态生成输入序列的浓缩表示。
注意力机制的作用：通过注意力机制，网络的文本生成解码器部分能够有选择地访问所有输入令牌，意味着某些输入令牌在生成特定输出令牌时更重要。自注意力是一种增强输入表示的技术，使序列中的每个位置能够与其他位置相互作用并确定相关性。
Translating a text word by word isn't feasible due to the differences in grammatical structures between the source and target languages:

Prior to the introduction of transformer models, encoder-decoder RNNs were commonly used for machine translation tasks
In this setup, the encoder processes a sequence of tokens from the source language, using a hidden state—a kind of intermediate layer within the neural network—to generate a condensed representation of the entire input sequence:

- 在引入变压器模型之前，编码器 - 解码器循环神经网络（RNN）通常用于机器翻译任务。
- “变压器模型”指的是一种特定的机器学习模型，通常在自然语言处理等领域有广泛应用。
- “编码器 - 解码器循环神经网络（RNN）”是一种用于处理序列数据的神经网络结构，由编码器和解码器两部分组成，在机器翻译中，编码器将源语言的序列转换为一种中间表示，解码器再将这种中间表示转换为目标语言的序列。
- “机器翻译任务”即把一种语言的文本自动转换为另一种语言的文本的任务。
- 在这种设置中，编码器处理来自源语言的一系列标记，使用隐藏状态（神经网络中的一种中间层）来生成整个输入序列的浓缩表示。
- “设置”在这里指的是使用编码器 - 解码器 RNN 进行机器翻译的这种情况。
- “源语言”是要进行翻译的原始语言。
- “一系列标记”可以理解为源语言文本被分割成的一个个单元，比如单词或字符等。
- “隐藏状态”是 RNN 中的一种内部状态，它在处理序列数据的过程中不断更新，用于存储对之前输入的信息的记忆，从而帮助网络更好地理解当前的输入和生成输出。
- “中间层”是神经网络中的一层，在编码器 - 解码器 RNN 中，隐藏状态所在的层起到了连接输入和输出的中间作用，它将输入序列进行编码，生成一个能够代表整个序列的浓缩表示。
- “浓缩表示”是对输入序列的一种简化和抽象，它包含了输入序列的关键信息，以便解码器能够根据这个表示来生成目标语言的输出。
- “整个输入序列”即源语言的文本序列。

3.2 Capturing data dependencies with attention mechanisms

章节 3.2， “利用注意力机制捕获数据依赖关系”。这里可能是在讨论一种技术方法，即通过注意力机制来获取数据之间的依赖关系，注意力机制通常用于有选择地关注输入数据的特定部分，从而更好地理解数据的结构和关系。在特定的技术领域，如机器学习、深度学习等领域，这可能涉及到对数据特征的提取和分析，以提高模型的性能和准确性。

No code in this section
Through an attention mechanism, the text-generating decoder segment of the network is capable of selectively accessing all input tokens, implying that certain input tokens hold more significance than others in the generation of a specific output token:

通过注意力机制，网络的文本生成解码器部分能够有选择地访问所有输入标记，这意味着在生成特定输出标记时，某些输入标记比其他标记更重要。

Self-attention in transformers is a technique designed to enhance input representations by enabling each position in a sequence to engage with and determine the relevance of every other position within the same sequence

在 Transformer 中，自注意力（Self-attention）是一种旨在通过使序列中的每个位置都能够与同一序列中的其他每个位置进行交互并确定其相关性，从而增强输入表示的技术。

具体来说，自注意力机制允许模型在处理一个序列时，不同位置的元素能够根据彼此的重要性进行动态调整权重。例如在自然语言处理任务中，一个句子中的不同单词可以通过自注意力机制来确定彼此对于当前任务的重要程度。这样可以更好地捕捉句子中的长距离依赖关系以及不同部分之间的复杂语义联系。

比如在翻译任务中，模型可以利用自注意力机制来确定源语言句子中哪些单词对于目标语言的翻译结果最为关键，从而更准确地进行翻译。同时，自注意力机制也使得模型能够并行地处理整个序列，提高了计算效率。

3.3 Attending to different parts of the input with self-attention

3.3 使用自注意力关注输入的不同部分

3.3.1 A simple self-attention mechanism without trainable weights一种没有可训练权重的简单自注意力机制）

This section explains a very simplified variant of self-attention, which does not contain any trainable weights
This is purely for illustration purposes and NOT the attention mechanism that is used in transformers
The next section, section 3.3.2, will extend this simple attention mechanism to implement the real self-attention mechanism
Suppose we are given an input sequence 𝑥(1)x(1) to 𝑥(𝑇)x(T)
- The input is a text (for example, a sentence like "Your journey starts with one step") that has already been converted into token embeddings as described in chapter 2
- For instance, 𝑥(1)x(1) is a d-dimensional vector representing the word "Your", and so forth
Goal: compute context vectors 𝑧(𝑖)z(i) for each input sequence element 𝑥(𝑖)x(i) in 𝑥(1)x(1) to 𝑥(𝑇)x(T) (where 𝑧z and 𝑥x have the same dimension)
- A context vector 𝑧(𝑖)z(i) is a weighted sum over the inputs 𝑥(1)x(1) to 𝑥(𝑇)x(T)
- The context vector is "context"-specific to a certain input
  - Instead of 𝑥(𝑖)x(i) as a placeholder for an arbitrary input token, let's consider the second input, 𝑥(2)x(2)
  - And to continue with a concrete example, instead of the placeholder 𝑧(𝑖)z(i), we consider the second output context vector, 𝑧(2)z(2)
  - The second context vector, 𝑧(2)z(2), is a weighted sum over all inputs 𝑥(1)x(1) to 𝑥(𝑇)x(T) weighted with respect to the second input element, 𝑥(2)x(2)
  - The attention weights are the weights that determine how much each of the input elements contributes to the weighted sum when computing 𝑧(2)z(2)
  - In short, think of 𝑧(2)z(2) as a modified version of 𝑥(2)x(2) that also incorporates information about all other input elements that are relevant to a given task at hand

(Please note that the numbers in this figure are truncated to one
digit after the decimal point to reduce visual clutter; similarly, other figures may also contain truncated values)

By convention, the unnormalized attention weights are referred to as "attention scores" whereas the normalized attention scores, which sum to 1, are referred to as "attention weights"

The code below walks through the figure above step by step

Step 1: compute unnormalized attention scores 𝜔ω
Suppose we use the second input token as the query, that is, 𝑞(2)=𝑥(2)q(2)=x(2), we compute the unnormalized attention scores via dot products:
- 𝜔21=𝑥(1)𝑞(2)⊤ω21=x(1)q(2)⊤
- 𝜔22=𝑥(2)𝑞(2)⊤ω22=x(2)q(2)⊤
- 𝜔23=𝑥(3)𝑞(2)⊤ω23=x(3)q(2)⊤
- ...
- 𝜔2𝑇=𝑥(𝑇)𝑞(2)⊤ω2T=x(T)q(2)⊤
Above, 𝜔ω is the Greek letter "omega" used to symbolize the unnormalized attention scores
- The subscript "21" in 𝜔21ω21 means that input sequence element 2 was used as a query against input sequence element 1

Suppose we have the following input sentence that is already embedded in 3-dimensional vectors as described in chapter 3 (we use a very small embedding dimension here for illustration purposes, so that it fits onto the page without line breaks):

[2]

import torch

inputs = torch.tensor(

[[0.43, 0.15, 0.89], # Your (x^1)

[0.55, 0.87, 0.66], # journey (x^2)

[0.57, 0.85, 0.64], # starts (x^3)

[0.22, 0.58, 0.33], # with (x^4)

[0.77, 0.25, 0.10], # one (x^5)

[0.05, 0.80, 0.55]] # step (x^6)

)

(In this book, we follow the common machine learning and deep learning convention where training examples are represented as rows and feature values as columns; in the case of the tensor shown above, each row represents a word, and each column represents an embedding dimension)
The primary objective of this section is to demonstrate how the context vector 𝑧(2)z(2)
is calculated using the second input sequence, 𝑥(2)x(2), as a query
The figure depicts the initial step in this process, which involves calculating the attention scores ω between 𝑥(2)x(2)
and all other input elements through a dot product operation