cs336 lecture2

原创于 2025-07-28 15:50:22 发布 · 869 阅读

20 ·

CC 4.0 BY-SA版权

文章标签：

#机器学习 #人工智能 #深度学习 #算法

cs336 专栏收录该内容

1 篇文章

订阅专栏

PyTorch 2.5

PyTorch

Cuda

PyTorch 是一个开源的 Python 机器学习库，基于 Torch 库，底层由 C++ 实现，应用于人工智能领域，如计算机视觉和自然语言处理

lecture2

总 FLOPs：
$\times \text{模型参数数量} \times \text{token 数量}$

这里的 6 是一个经验常数：前向传播 + 反向传播 + 梯度更新的总 FLOPs。

估算问题（napkin math）

训练时间估算公式：
$\text{时间} = \frac{\text{总 FLOPs}}{\text{每秒 FLOPs × GPU 数量 × 利用率}}$

PyTorch 基础

x = torch.randn(4, 8)  # 4x8 matrix of iid Normal(0, 1) samples @inspect x

创建一个 4×8 的张量，元素是从 标准正态分布 N(0, 1) 中随机采样。
常用于神经网络权重的随机初始化。

nn.init.trunc_normal_(x, mean=0, std=1, a=-2, b=2)  # @inspect x

截断正态分布：只保留在 [a, b] 范围内的样本。
这里是 mean=0, std=1, a=-2, b=2，即从标准正态分布中采样，但只保留 -2 到 2 之间的值。

这是比 torch.randn 更“保守”的初始化方式，常用于防止初始化极端值带来的训练不稳定。

    x = torch.zeros(4, 8)  # @inspect x
    assert x.dtype == torch.float32  # Default type
    assert x.numel() == 4 * 8
    assert x.element_size() == 4  # Float is 4 bytes
    assert get_memory_usage(x) == 4 * 8 * 4  # 128 bytes

x.numel() 返回张量中 所有元素的数量。

x.element_size() 返回每个元素在内存中占用的 字节数。

get_memory_usage(x) 是一个辅助函数，用于获取张量所占的总内存（单位：字节）。

常见数据类型：
- Float32（单精度）：4字节，深度学习标准精度。
- Float16（半精度）：2字节，动态范围小，可能出现下溢问题。
- BFloat16：与Float16内存相同，但动态范围更大，适合深度学习。
- FP8：2022年出现，8位精度，更节省内存，但动态范围和精度更低。
建议：使用混合精度训练，关键部分（如优化器状态、梯度）保留Float32。

    text("Let's compare the dynamic ranges and memory usage of the different data types:")
    float32_info = torch.finfo(torch.float32)  # @inspect float32_info
    float16_info = torch.finfo(torch.float16)  # @inspect float16_info
    bfloat16_info = torch.finfo(torch.bfloat16)  # @inspect bfloat16_info

torch.finfo(dtype) 用于 获取浮点数据类型的数值信息，包括：

属性	含义
`.bits`	总位数（bit 数）
`.eps`	可表示的最小正数与 1 的差（精度）
`.max`	可表示的最大数
`.min`	可表示的最小数（负）
`.tiny`	最小的正正规数（非 subnormal）

    if not torch.cuda.is_available():
        return

    num_gpus = torch.cuda.device_count()  # @inspect num_gpus
    for i in range(num_gpus):
        properties = torch.cuda.get_device_properties(i)  # @inspect properties

    memory_allocated = torch.cuda.memory_allocated()  # @inspect memory_allocated

num_gpus = torch.cuda.device_count()获取当前系统中 可用的 GPU 数量。

get_device_properties(i) 会返回 第 i 张显卡的属性

memory_allocated = torch.cuda.memory_allocated()获取 当前 PyTorch 在默认 GPU（索引 0）上已经分配的内存（单位：字节）。

注意：这只统计 PyTorch 分配的内存，不是整个 GPU 的总用量（比如系统进程用的就不会统计）。

    x = torch.tensor([
        [0., 1, 2, 3],
        [4, 5, 6, 7],
        [8, 9, 10, 11],
        [12, 13, 14, 15],
    ])

    text("To go to the next row (dim 0), skip 4 elements in storage.")
    assert x.stride(0) == 4

    text("To go to the next column (dim 1), skip 1 element in storage.")
    assert x.stride(1) == 1

stride(0) 表示：要从第 i 行跳到第 i+1 行，需跳过多少个元素？

由于每行有 4 个元素，所以是 4。

stride(1) 表示：在一行内部，从第 j 列到第 j+1 列跳过几个元素？

每列之间的间隔是 1，因为它们在内存中是连续的。

    x = torch.tensor([[1., 2, 3], [4, 5, 6]])  # @inspect x

    text("Many operations simply provide a different **view** of the tensor.")
    text("This does not make a copy, and therefore mutations in one tensor affects the other.")

    text("Get row 0:")
    y = x[0]  # @inspect y
    assert torch.equal(y, torch.tensor([1., 2, 3]))
    assert same_storage(x, y)

    text("Get column 1:")
    y = x[:, 1]  # @inspect y
    assert torch.equal(y, torch.tensor([2, 5]))
    assert same_storage(x, y)

    text("View 2x3 matrix as 3x2 matrix:")
    y = x.view(3, 2)  # @inspect y
    assert torch.equal(y, torch.tensor([[1, 2], [3, 4], [5, 6]]))
    assert same_storage(x, y)

    text("Transpose the matrix:")
    y = x.transpose(1, 0)  # @inspect y
    assert torch.equal(y, torch.tensor([[1, 4], [2, 5], [3, 6]]))
    assert same_storage(x, y)

    text("Check that mutating x also mutates y.")
    x[0][0] = 100  # @inspect x, @inspect y
    assert y[0][0] == 100

    text("Note that some views are non-contiguous entries, which means that further views aren't possible.")
    x = torch.tensor([[1., 2, 3], [4, 5, 6]])  # @inspect x
    y = x.transpose(1, 0)  # @inspect y
    assert not y.is_contiguous()
    try:
        y.view(2, 3)
        assert False
    except RuntimeError as e:
        assert "view size is not compatible with input tensor's size and stride" in str(e)

    text("One can enforce a tensor to be contiguous first:")
    y = x.transpose(1, 0).contiguous().view(2, 3)  # @inspect y
    assert not same_storage(x, y)
    text("Views are free, copying take both (additional) memory and compute.")

获取行、列是视图，即：

x 和 y 共用同一块内存
所以：assert same_storage(x, y) 成立

.view() 是 reshape，不复制数据，仍然是共享内存：same_storage(x, y) 成立

.transpose() 也是视图，共享存储。但！这类视图是非连续（non-contiguous）的

因为 .transpose() 后，y 的内存排列是非连续的，不能直接用 .view()，会报错：

.contiguous()：复制数据、重新排布成内存连续。.view()：现在可以 reshape 了。这时 x 和 y 不再共享内存：not same_storage(x, y)

    x = torch.tensor([1, 4, 9])
    assert torch.equal(x.rsqrt(), torch.tensor([1, 1 / 2, 1 / 3]))  # i -> 1/sqrt(x_i)

    text("`triu` takes the upper triangular part of a matrix.")
    x = torch.ones(3, 3).triu()  # @inspect x
    assert torch.equal(x, torch.tensor([
        [1, 1, 1],
        [0, 1, 1],
        [0, 0, 1]],
    ))
    text("This is useful for computing an causal attention mask, where M[i, j] is the contribution of i to j.")

x.rsqrt()：倒数开方（reciprocal sqrt）

torch.ones(3, 3) 创建一个 3×3 的全 1 矩阵；.triu() 保留上三角（包括对角线），其余置为 0。

    x = torch.ones(2, 2, 3)  # batch, sequence, hidden  @inspect x
    y = torch.ones(2, 2, 3)  # batch, sequence, hidden  @inspect y
    z = x @ y.transpose(-2, -1)  # batch, sequence, sequence  @inspect z

最后z.shape = (2, 2, 2)

    text("Define two tensors:")
    x: Float[torch.Tensor, "batch seq1 hidden"] = torch.ones(2, 3, 4)  # @inspect x
    y: Float[torch.Tensor, "batch seq2 hidden"] = torch.ones(2, 3, 4)  # @inspect y

    text("Old way:")
    z = x @ y.transpose(-2, -1)  # batch, sequence, sequence  @inspect z

    text("New (einops) way:")
    z = einsum(x, y, "batch seq1 hidden, batch seq2 hidden -> batch seq1 seq2")  # @inspect z
    text("Dimensions that are not named in the output are summed over.")

    text("Or can use `...` to represent broadcasting over any number of dimensions:")
    z = einsum(x, y, "... seq1 hidden, ... seq2 hidden -> ... seq1 seq2")  # @inspect z

对比	传统写法	`einsum` 写法
隐式维度	是，比如 `@` 需要你记住哪个是 -2、-1	否，所有维度命名清晰
维度错误难调试	是	否，因为维度完全由你控制
通用性	不高，矩阵乘法固定模式	很高，可做加权和、双线性映射、注意力等复杂操作
可读性	差	高（尤其适合教学和论文复现）

    x: Float[torch.Tensor, "batch seq hidden"] = torch.ones(2, 3, 4)  # @inspect x

    text("Old way:")
    y = x.mean(dim=-1)  # @inspect y

    text("New (einops) way:")
    y = reduce(x, "... hidden -> ...", "sum")  # @inspect y

传统写法	`einops.reduce` 写法	优势
`x.mean(dim=2)`	`reduce(x, "... hidden -> ...", "mean")`	维度含义更清楚（不用记数字）
`x.sum(dim=1)`	`reduce(x, "batch seq hidden -> batch hidden", "sum")`	明确地指出哪个维度被消去
`x.max(dim=-1)`	`reduce(x, "... hidden -> ...", "max")`	可读性高，适合复杂张量操作场景

def einops_rearrange():
    text("Sometimes, a dimension represents two dimensions")
    text("...and you want to operate on one of them.")

    x: Float[torch.Tensor, "batch seq total_hidden"] = torch.ones(2, 3, 8)  # @inspect x
    text("...where `total_hidden` is a flattened representation of `heads * hidden1`")
    w: Float[torch.Tensor, "hidden1 hidden2"] = torch.ones(4, 4)

    text("Break up `total_hidden` into two dimensions (`heads` and `hidden1`):")
    x = rearrange(x, "... (heads hidden1) -> ... heads hidden1", heads=2)  # @inspect x

    text("Perform the transformation by `w`:")
    x = einsum(x, w, "... hidden1, hidden1 hidden2 -> ... hidden2")  # @inspect x

    text("Combine `heads` and `hidden2` back together:")
    x = rearrange(x, "... heads hidden2 -> ... (heads hidden2)")  # @inspect x

步骤	操作	说明
1	flatten → reshape	把 total_hidden 分成 (heads × hidden1)
2	einsum	对每个 head 的 hidden1 做线性映射
3	reshape → flatten	把 (heads × hidden2) 再拼回来

B = 16384  # 批大小（batch size）
D = 32768  # 输入特征维度
K = 8192   # 输出特征维度

x = torch.ones(B, D)  # 输入
w = torch.randn(D, K)  # 权重
y = x @ w  # 矩阵乘法

乘法：每个输出元素都需要 D 次乘法

加法：D-1 次加法 ≈ D 次

因此总 FLOP 是 2 × B × D × K

计算资源

默认Tensor存储在CPU，需要手动移至GPU以提高计算效率：
```
x = torch.zeros(32, 32).to('cuda:0')
```
数据传输存在额外开销。

    text("Forward pass: compute loss")
    x = torch.tensor([1., 2, 3])
    w = torch.tensor([1., 1, 1], requires_grad=True)  # Want gradient
    pred_y = x @ w
    loss = 0.5 * (pred_y - 5).pow(2)

    text("Backward pass: compute gradients")
    loss.backward()
    assert loss.grad is None
    assert pred_y.grad is None
    assert x.grad is None
    assert torch.equal(w.grad, torch.tensor([1, 2, 3]))

对于每个中间变量（如 loss、pred_y、x）默认情况下不会计算梯度，只有设置了 requires_grad=True 的变量（如 w）才有梯度。

def gradients_flops():
    text("Let us do count the FLOPs for computing gradients.")

    text("Revisit our linear model")
    if torch.cuda.is_available():
        B = 16384  # Number of points
        D = 32768  # Dimension
        K = 8192   # Number of outputs
    else:
        B = 1024
        D = 256
        K = 64

    device = get_device()
    x = torch.ones(B, D, device=device)
    w1 = torch.randn(D, D, device=device, requires_grad=True)
    w2 = torch.randn(D, K, device=device, requires_grad=True)

    text("Model: x --w1--> h1 --w2--> h2 -> loss")
    h1 = x @ w1
    h2 = h1 @ w2
    loss = h2.pow(2).mean()

    text("Recall the number of forward FLOPs: "), link(tensor_operations_flops)
    text("- Multiply x[i][j] * w1[j][k]")
    text("- Add to h1[i][k]")
    text("- Multiply h1[i][j] * w2[j][k]")
    text("- Add to h2[i][k]")
    num_forward_flops = (2 * B * D * D) + (2 * B * D * K)  # @inspect num_forward_flops

    text("How many FLOPs is running the backward pass?")
    h1.retain_grad()  # For debugging
    h2.retain_grad()  # For debugging
    loss.backward()

    text("Recall model: x --w1--> h1 --w2--> h2 -> loss")

    text("- h1.grad = d loss / d h1")
    text("- h2.grad = d loss / d h2")
    text("- w1.grad = d loss / d w1")
    text("- w2.grad = d loss / d w2")

    text("Focus on the parameter w2.")
    text("Invoke the chain rule.")

    num_backward_flops = 0  # @inspect num_backward_flops

    text("w2.grad[j,k] = sum_i h1[i,j] * h2.grad[i,k]")
    assert w2.grad.size() == torch.Size([D, K])
    assert h1.size() == torch.Size([B, D])
    assert h2.grad.size() == torch.Size([B, K])
    text("For each (i, j, k), multiply and add.")
    num_backward_flops += 2 * B * D * K  # @inspect num_backward_flops

    text("h1.grad[i,j] = sum_k w2[j,k] * h2.grad[i,k]")
    assert h1.grad.size() == torch.Size([B, D])
    assert w2.size() == torch.Size([D, K])
    assert h2.grad.size() == torch.Size([B, K])
    text("For each (i, j, k), multiply and add.")
    num_backward_flops += 2 * B * D * K  # @inspect num_backward_flops

    text("This was for just w2 (D*K parameters).")
    text("Can do it for w1 (D*D parameters) as well (though don't need x.grad).")
    num_backward_flops += (2 + 2) * B * D * D  # @inspect num_backward_flops

    text("A nice graphical visualization: "), article_link("https://medium.com/@dzmitrybahdanau/the-flops-calculus-of-language-model-training-3b19c1f025e4")
    image("https://miro.medium.com/v2/resize:fit:1400/format:webp/1*VC9y_dHhCKFPXj90Qshj3w.gif", width=500)

    text("Putting it togther:")
    text("- Forward pass: 2 (# data points) (# parameters) FLOPs")
    text("- Backward pass: 4 (# data points) (# parameters) FLOPs")
    text("- Total: 6 (# data points) (# parameters) FLOPs")

retain_grad() 用于保留中间节点的梯度（通常用于调试）。

前向传播需要两次矩阵乘法，每次矩阵乘法的FLOPs计算方法：

对于矩阵乘法 $(B×D)×(D×D)(B\times D) \times (D\times D)$ ，所需FLOPs为：
$\times B \times D \times D$
对于第二次矩阵乘法 $(B×D)×(D×K)(B\times D) \times (D\times K)$ ，所需FLOPs为：
$\times B \times D \times K$

反向传播时，需要计算以下梯度：

$\frac{\partial loss}{\partial h1}$
$\frac{\partial loss}{\partial h2}$
$\frac{\partial loss}{\partial w1}$
$\frac{\partial loss}{\partial w2}$

以权重 $w 2$ 为例的梯度计算与FLOPs估算：

计算公式为（链式法则）：

$\sum_i h1[i,j] \times h2.grad[i,k]$

所需FLOPs：

每个元素相乘一次、相加一次，因此每对元素计算需2次浮点运算。
总计算量为 $\times B \times D \times K$ 。

同理，计算隐藏层 $h 1. g r a d$ 时：

$\sum_k w2[j,k] \times h2.grad[i,k]$
FLOPs 同样为 $\times B \times D \times K$ 。

阶段	FLOPs计算公式
前向传播	$\times 数据点数 \times 参数数$
反向传播	$\times 数据点数 \times 参数数$
总计	$\times 数据点数 \times 参数数$

def module_parameters():
    input_dim = 16384
    output_dim = 32

    text("Model parameters are stored in PyTorch as `nn.Parameter` objects.")
    w = nn.Parameter(torch.randn(input_dim, output_dim))
    assert isinstance(w, torch.Tensor)  # Behaves like a tensor
    assert type(w.data) == torch.Tensor  # Access the underlying tensor

    text("## Parameter initialization")

    text("Let's see what happens.")
    x = nn.Parameter(torch.randn(input_dim))
    output = x @ w  # @inspect output
    assert output.size() == torch.Size([output_dim])
    text(f"Note that each element of `output` scales as sqrt(input_dim): {output[0]}.")
    text("Large values can cause gradients to blow up and cause training to be unstable.")

    text("We want an initialization that is invariant to `input_dim`.")
    text("To do that, we simply rescale by 1/sqrt(input_dim)")
    w = nn.Parameter(torch.randn(input_dim, output_dim) / np.sqrt(input_dim))
    output = x @ w  # @inspect output
    text(f"Now each element of `output` is constant: {output[0]}.")

    text("Up to a constant, this is Xavier initialization. "), link(title="[paper]", url="https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf"), link(title="[stackexchange]", url="https://ai.stackexchange.com/questions/30491/is-there-a-proper-initialization-technique-for-the-weight-matrices-in-multi-head")

    text("To be extra safe, we truncate the normal distribution to [-3, 3] to avoid any chance of outliers.")
    w = nn.Parameter(nn.init.trunc_normal_(torch.empty(input_dim, output_dim), std=1 / np.sqrt(input_dim), a=-3, b=3))

这里定义了一个模型参数矩阵w，大小为(input_dim, output_dim)。

nn.Parameter的作用：

自动追踪梯度：PyTorch会自动将其纳入到梯度计算中。
用于模型训练的可学习参数。

实际上，nn.Parameter本质是一个特殊的Tensor（张量）。

如果模型参数随机初始化规模过大，可能导致输出值极大或极小，使得模型在训练时梯度“爆炸”（explode）或“消失”（vanish），训练无法收敛。

通常参数初始化方法都试图让模型输出大小维持在一定范围内，与输入维度无关，从而保持训练稳定性。

Xavier初始化（Xavier Initialization）

为避免输出规模随输入维度扩大而变化，提出了Xavier（或称Glorot）初始化方法：
- 每个权重参数值服从均值为0、标准差为1/sqrt(input_dim)的正态分布。

每个输出元素的方差为：
$Var(output)=input_dim×Var(w)=input_dim×1input_dim=1 Var(output) = input\_dim \times Var(w) = input\_dim \times \frac{1}{input\_dim} = 1$
实现了输出方差为常数（1），规模适中、稳定。

截断正态分布 (trunc_normal) 会将随机取值范围限制在一定区间内（如[-3,3]），避免极端值（outliers）出现。

def note_about_randomness():
    text("Randomness shows up in many places: parameter initialization, dropout, data ordering, etc.")
    text("For reproducibility, we recommend you always pass in a different random seed for each use of randomness.")
    text("Determinism is particularly useful when debugging, so you can hunt down the bug.")

    text("There are three places to set the random seed which you should do all at once just to be safe.")

    # Torch
    seed = 0
    torch.manual_seed(seed)

    # NumPy
    import numpy as np
    np.random.seed(seed)

    # Python
    import random
    random.seed(seed)

我么可以在这些里面设置随机种子以确保random的稳定性

def data_loading():
    text("In language modeling, data is a sequence of integers (output by the tokenizer).")

    text("It is convenient to serialize them as numpy arrays (done by the tokenizer).")
    orig_data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=np.int32)
    orig_data.tofile("data.npy")

    text("You can load them back as numpy arrays.")
    text("Don't want to load the entire data into memory at once (LLaMA data is 2.8TB).")
    text("Use memmap to lazily load only the accessed parts into memory.")
    data = np.memmap("data.npy", dtype=np.int32)
    assert np.array_equal(data, orig_data)

    text("A *data loader* generates a batch of sequences for training.")
    B = 2  # Batch size
    L = 4  # Length of sequence
    x = get_batch(data, batch_size=B, sequence_length=L, device=get_device())
    assert x.size() == torch.Size([B, L])

使用 numpy 数组存储，数据类型为 int32，效率高，存取快速。

tofile("data.npy") 将 numpy 数组以二进制形式写入磁盘文件。

内存映射（Memory Mapping, memmap） 是一种高效处理大型文件的方法。

memmap 只在你实际访问数据时才将相应的数据加载到内存，而不是一次性全部加载（lazy loading）。

这种方式特别适合语言模型训练数据巨大时（如 LLaMA 模型数据高达2.8TB）。

通过 assert np.array_equal，确保懒加载的数据与原始数据完全一致。

    # Parameters
    num_parameters = (D * D * num_layers) + D  # @inspect num_parameters
    assert num_parameters == get_num_parameters(model)

    # Activations
    num_activations = B * D * num_layers  # @inspect num_activations

    # Gradients
    num_gradients = num_parameters  # @inspect num_gradients

    # Optimizer states
    num_optimizer_states = num_parameters  # @inspect num_optimizer_states

    # Putting it all together, assuming float32
    total_memory = 4 * (num_parameters + num_activations + num_gradients + num_optimizer_states)  # @inspect total_memory

内存开销主要来自四个部分：

模型参数（Parameters）
激活值（Activations）
梯度（Gradients）
优化器状态（Optimizer States）

def checkpointing():
    text("Training language models take a long time and certainly will certainly crash.")
    text("You don't want to lose all your progress.")

    text("During training, it is useful to periodically save your model and optimizer state to disk.")

    model = Cruncher(dim=64, num_layers=3).to(get_device())
    optimizer = AdaGrad(model.parameters(), lr=0.01)

    text("Save the checkpoint:")
    checkpoint = {
        "model": model.state_dict(),
        "optimizer": optimizer.state_dict(),
    }
    torch.save(checkpoint, "model_checkpoint.pt")

    text("Load the checkpoint:")
    loaded_checkpoint = torch.load("model_checkpoint.pt")

训练语言模型通常耗时很长，动辄数小时甚至数天，训练中断（如断电、系统崩溃）是常见的风险。
因此，为了避免进度丢失，我们需要定期将模型状态和优化器状态保存到磁盘，这就是所谓的 checkpointing（检查点机制）。

checkpoint 是一个 Python 字典，包含模型和优化器的所有重要信息。
使用 torch.save 将字典序列化并保存为本地文件（名为 "model_checkpoint.pt"）。
- 你可以定期保存这个文件，比如每训练100步保存一次。
使用 torch.load 加载之前保存的 checkpoint 文件。
这会返回一个与保存时一致的 Python 字典，其中包含：
- "model"：模型参数的状态
- "optimizer"：优化器的状态（如学习率、动量、梯度历史）

（虽然示例中只加载了 checkpoint，但实际应用中你通常还需要把它 restore 回模型里

def mixed_precision_training():
    text("Choice of data type (float32, bfloat16, fp8) have tradeoffs.")
    text("- Higher precision: more accurate/stable, more memory, more compute")
    text("- Lower precision: less accurate/stable, less memory, less compute")

    text("How can we get the best of both worlds?")

    text("Solution: use float32 by default, but use {bfloat16, fp8} when possible.")

    text("A concrete plan:")
    text("- Use {bfloat16, fp8} for the forward pass (activations).")
    text("- Use float32 for the rest (parameters, gradients).")

    text("- Mixed precision training "), link("https://arxiv.org/pdf/1710.03740.pdf")

    text("Pytorch has an automatic mixed precision (AMP) library.")
    link("https://pytorch.org/docs/stable/amp.html")
    link("https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/")

    text("NVIDIA's Transformer Engine supports FP8 for linear layers")
    text("Use FP8 pervasively throughout training "), link("https://arxiv.org/pdf/2310.18313.pdf")

您可能感兴趣的与本文相关的镜像

PyTorch 2.5

PyTorch

Cuda

PyTorch 是一个开源的 Python 机器学习库，基于 Torch 库，底层由 C++ 实现，应用于人工智能领域，如计算机视觉和自然语言处理