torch.nn到底是什么?(精简版)

本文从零开始构建神经网络,逐步引入PyTorch的高级功能,如torch.nn、torch.optim、Dataset和DataLoader,展示如何高效地训练模型。
部署运行你感兴趣的模型镜像

此文为《torch.nn到底是什么?》的总结版。

首先创建基本的神经网络,然后逐步添加torch.nntorch.optimDatesetDataLoader的功能,以显示每一部分的具体作用。

1、设置MNIST数据

使用经典的 MNIST 数据集,该数据集由手写数字(0-9)的黑白图像组成。

使用 pathlib 来处理路径(Python3标准库的一部分),用 requests 下载数据。

from pathlib import Path
import requests

DATA_PATH = Path("data")
PATH = DATA_PATH / "mnist"

PATH.mkdir(parents=True, exist_ok=True)

URL = "http://deeplearning.net/data/mnist/"
FILENAME = "mnist.pkl.gz"

if not (PATH / FILENAME).exists():
        content = requests.get(URL + FILENAME).content
        (PATH / FILENAME).open("wb").write(content)

该数据集的格式为NumPy array,使用 pickle 存储。

import pickle
import gzip

with gzip.open((PATH / FILENAME).as_posix(), "rb") as f:
        ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding="latin-1")

每个图片大小为28x28,并存储为长度为784(=28x28)的扁平行。

查看其中的一个图片:

from matplotlib import pyplot
import numpy as np

pyplot.imshow(x_train[0].reshape((28, 28)), cmap="gray")
print(x_train.shape)

输出为:
在这里插入图片描述

(50000, 784)

PyTorch使用 tensor 而不是 NumPy array,所以我们需要将其转换。

import torch

x_train, y_train, x_valid, y_valid = map(
    torch.tensor, (x_train, y_train, x_valid, y_valid)
)
n, c = x_train.shape
x_train, x_train.shape, y_train.min(), y_train.max()
print(x_train, y_train)
print(x_train.shape)
print(y_train.min(), y_train.max())

输出:

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]]) tensor([5, 0, 4,  ..., 8, 4, 8])
torch.Size([50000, 784])
tensor(0) tensor(9)

2、从头构建神经网络(不使用 torch.nn

首先只使用PyTorch tensor 操作创建一个模型。

#initializing the weights with Xavier initialisation (by multiplying with 1/sqrt(n)).

import math

weights = torch.randn(784, 10) / math.sqrt(784)
weights.requires_grad_()
bias = torch.zeros(10, requires_grad=True)

def log_softmax(x):
    return x - x.exp().sum(-1).log().unsqueeze(-1)

def model(xb):
    return log_softmax(xb @ weights + bias)

def nll(input, target):
    return -input[range(target.shape[0]), target].mean()

def accuracy(out, yb):
    preds = torch.argmax(out, dim=1)
    return (preds == yb).float().mean()
    
loss_func = nll

bs = 64  # batch size

xb = x_train[0:bs]  # a mini-batch from x
yb = y_train[0:bs]

preds = model(xb)  # predictions

print(preds[0], preds.shape)
print(loss_func(preds, yb))
print(accuracy(preds, yb))

输出:

tensor([-1.7022, -3.0342, -2.4138, -2.6452, -2.7764, -2.0892, -2.2945, -2.5480,
        -2.3732, -1.8915], grad_fn=<SelectBackward>) torch.Size([64, 10])

tensor(2.3783, grad_fn=<NegBackward>)
tensor(0.0938)

现在我们可以进行训练。对于每次迭代,将会做以下几件事:

  • 选择一批数据(mini-batch)
  • 使用模型进行预测
  • 计算损失
  • loss.backward() 更新模型的梯度,即权重和偏置
from IPython.core.debugger import set_trace

lr = 0.5  # learning rate
epochs = 2  # how many epochs to train for

for epoch in range(epochs):
    for i in range((n - 1) // bs + 1):
       #set_trace()
        start_i = i * bs
        end_i = start_i + bs
        xb = x_train[start_i:end_i]
        yb = y_train[start_i:end_i]
        pred = model(xb)
        loss = loss_func(pred, yb)

        loss.backward()
        with torch.no_grad():
            weights -= weights.grad * lr
            bias -= bias.grad * lr
            weights.grad.zero_()
            bias.grad.zero_()

print(loss_func(model(xb), yb), accuracy(model(xb), yb))

输出:

tensor(0.0806, grad_fn=<NegBackward>) tensor(1.)

3、使用 torch.nn.functional

如果使用了负对数似然损失函数和 log softnax 激活函数,那么Pytorch提供的F.cross_entropy 结合了两者。所以我们甚至可以从我们的模型中移除激活函数。

import torch.nn.functional as F

loss_func = F.cross_entropy

def model(xb):
    return xb @ weights + bias

注意,在 model 函数中我们不再需要调用 log_softmax。让我们确认一下,损失和精确度与前边计算的一样:

print(loss_func(model(xb), yb), accuracy(model(xb), yb))

输出:

tensor(0.0806, grad_fn=<NllLossBackward>) tensor(1.)

4、使用 nn.Module 重构

继承 nn.Module(它本身是一个类并且能够跟踪状态)建立子类,并实例化模型:

from torch import nn

class Mnist_Logistic(nn.Module):
    def __init__(self):
        super().__init__()
        self.weights = nn.Parameter(torch.randn(784, 10) / math.sqrt(784))
        self.bias = nn.Parameter(torch.zeros(10))

    def forward(self, xb):
        return xb @ self.weights + self.bias
        
model = Mnist_Logistic()

print(loss_func(model(xb), yb))

输出:

tensor(2.3558, grad_fn=<NllLossBackward>)

将训练循环包装到一个 fit 函数中,以便我们以后运行。

def fit():
    for epoch in range(epochs):
        for i in range((n - 1) // bs + 1):
            start_i = i * bs
            end_i = start_i + bs
            xb = x_train[start_i:end_i]
            yb = y_train[start_i:end_i]
            pred = model(xb)
            loss = loss_func(pred, yb)

            loss.backward()
            with torch.no_grad():
                for p in model.parameters():
                    p -= p.grad * lr
                model.zero_grad()

fit()

print(loss_func(model(xb), yb))

输出:

tensor(0.0826, grad_fn=<NllLossBackward>)

5、使用 nn.Linear 重构

使用PyTorch 的 nn.Linear 类建立一个线性层,以替代手动定义和初始化 self.weightsself.bias、计算 xb @ self.weights + self.bias 等工作。

class Mnist_Logistic(nn.Module):
    def __init__(self):
        super().__init__()
        self.lin = nn.Linear(784, 10)

    def forward(self, xb):
        return self.lin(xb)

model = Mnist_Logistic()
print(loss_func(model(xb), yb))

输出:

tensor(2.3156, grad_fn=<NllLossBackward>)

我们仍然能够像之前那样使用 fit 方法

fit()

print(loss_func(model(xb), yb))

输出:

tensor(0.0809, grad_fn=<NllLossBackward>)

6、使用 optim 重构

定义一个函数来创建模型和优化器,以便将来可以重用它。

from torch import optim

def get_model():
    model = Mnist_Logistic()
    return model, optim.SGD(model.parameters(), lr=lr)

model, opt = get_model()
print(loss_func(model(xb), yb))

for epoch in range(epochs):
    for i in range((n - 1) // bs + 1):
        start_i = i * bs
        end_i = start_i + bs
        xb = x_train[start_i:end_i]
        yb = y_train[start_i:end_i]
        pred = model(xb)
        loss = loss_func(pred, yb)

        loss.backward()
        opt.step()
        opt.zero_grad()

print(loss_func(model(xb), yb))

输出:

tensor(2.2861, grad_fn=<NllLossBackward>)
tensor(0.0815, grad_fn=<NllLossBackward>)

7、使用 Dataset 重构

from torch.utils.data import TensorDataset

train_ds = TensorDataset(x_train, y_train)
model, opt = get_model()

for epoch in range(epochs):
    for i in range((n - 1) // bs + 1):
        xb, yb = train_ds[i * bs: i * bs + bs]
        pred = model(xb)
        loss = loss_func(pred, yb)

        loss.backward()
        opt.step()
        opt.zero_grad()

print(loss_func(model(xb), yb))

输出:

tensor(0.0800, grad_fn=<NllLossBackward>)

8、使用 DataLoader 重构

from torch.utils.data import DataLoader

train_ds = TensorDataset(x_train, y_train)
train_dl = DataLoader(train_ds, batch_size=bs)

for epoch in range(epochs):
    for xb, yb in train_dl:
        pred = model(xb)
        loss = loss_func(pred, yb)

        loss.backward()
        opt.step()
        opt.zero_grad()

print(loss_func(model(xb), yb))

输出:

tensor(0.0821, grad_fn=<NllLossBackward>)

9、增加验证

train_ds = TensorDataset(x_train, y_train)
train_dl = DataLoader(train_ds, batch_size=bs, shuffle=True)

valid_ds = TensorDataset(x_valid, y_valid)
valid_dl = DataLoader(valid_ds, batch_size=bs * 2)

我们将在每个epoch结束时计算和打印验证损失。(注意,我们总是在训练之前调用model.train(),在推理之前调用 model.eval(),因为这些由诸如 nn.BatchNorm2dnn.Dropout 等层使用,以确保这些不同阶段的适当行为。)

model, opt = get_model()

for epoch in range(epochs):
    model.train()
    for xb, yb in train_dl:
        pred = model(xb)
        loss = loss_func(pred, yb)

        loss.backward()
        opt.step()
        opt.zero_grad()

    model.eval()
    with torch.no_grad():
        valid_loss = sum(loss_func(model(xb), yb) for xb, yb in valid_dl)

    print(epoch, valid_loss / len(valid_dl))

输出:

0 tensor(0.2981)
1 tensor(0.3033)

10、创建 fit()get_data()

loss_batch 函数计算每个批次的损失。

def loss_batch(model, loss_func, xb, yb, opt=None):
    loss = loss_func(model(xb), yb)

    if opt is not None:
        loss.backward()
        opt.step()
        opt.zero_grad()

    return loss.item(), len(xb)

fit 运行必要的操作来训练我们的模型并计算每个epoch的训练和验证损失。

import numpy as np

def fit(epochs, model, loss_func, opt, train_dl, valid_dl):
    for epoch in range(epochs):
        model.train()
        for xb, yb in train_dl:
            loss_batch(model, loss_func, xb, yb, opt)

        model.eval()
        with torch.no_grad():
            losses, nums = zip(
                *[loss_batch(model, loss_func, xb, yb) for xb, yb in valid_dl]
            )
        val_loss = np.sum(np.multiply(losses, nums)) / np.sum(nums)

        print(epoch, val_loss)

get_data 为训练集合验证集返回 DataLoader

def get_data(train_ds, valid_ds, bs):
    return (
        DataLoader(train_ds, batch_size=bs, shuffle=True),
        DataLoader(valid_ds, batch_size=bs * 2),
    )

现在,我们获取 DataLoader 和拟合模型的整个过程可以在3行代码中运行:

train_dl, valid_dl = get_data(train_ds, valid_ds, bs)
model, opt = get_model()
fit(epochs, model, loss_func, opt, train_dl, valid_dl)

输出:

0 0.3055081913471222
1 0.31777948439121245

11、总结

我们现在有一个通用数据流水线和训练循环,你可以使用它来训练多种类型PyTorch模型。 各部分的功能总结如下:

  • torch.nn
    • Module:创建一个可调用的对象,其行为类似于一个函数,但也可以包含状态(例如神经网络层权重)。 它知道它包含哪些参数,并且可以将所有梯度归零,循环遍历它们更新权重等。
    • Parametertensor 的包装器(wrapper),它告诉 Module 它具有在反向传播期间需要更新的权重。 只更新具有 requires_grad 属性的 tensor
    • functional:一个模块(通常按惯例导入到F命名空间中),它包含激活函数,损失函数等,以及非状态(non-stateful)版本的层,如卷积层和线性层。
  • torch.optim:包含 SGD 等优化器,可在后向传播步骤中更新 Parameter 的权重。
  • Dataset:带有 __len____getitem__ 的对象的抽象接口,包括 PyTorch 提供的类,如TensorDataset
  • DataLoader:获取任何 Dataset 并创建一个返回批量数据的迭代器。

您可能感兴趣的与本文相关的镜像

PyTorch 2.5

PyTorch 2.5

PyTorch
Cuda

PyTorch 是一个开源的 Python 机器学习库,基于 Torch 库,底层由 C++ 实现,应用于人工智能领域,如计算机视觉和自然语言处理

# %% import torch import torch.nn as nn import math from typing import Optional, Tuple, Dict, List import matplotlib.pyplot as plt import numpy as np import random # 设置随机种子 torch.manual_seed(42) np.random.seed(42) random.seed(42) # %% [markdown] # 1. RoPE: 旋转位置编码(完全自实现) # %% def precompute_freqs_cis( hidden_size: int, max_seq_len: int, base: int = 10000, num_attention_heads: int = 8 ) -> torch.Tensor: """ 预计算复数形式的旋转位置编码 (RoPE) Args: hidden_size: 模型维度 max_seq_len: 最大序列长度 base: 频率基数 num_attention_heads: 注意力头数 Returns: freqs_cis: complex tensor of shape (max_seq_len, head_dim // 2) """ head_dim = hidden_size // num_attention_heads inv_freq = 1.0 / (base ** (torch.arange(0, head_dim, 2).float() / head_dim)) t = torch.arange(max_seq_len) # 位置索引 [0, 1, ..., S-1] freqs = torch.einsum("s,d->sd", t, inv_freq) # (S, D//2) # 转为复数角度:cosθ + i*sinθ freqs_cis = torch.polar(torch.ones_like(freqs), freqs) return freqs_cis # shape: (max_seq_len, head_dim // 2) def _rotate_half(x: torch.Tensor) -> torch.Tensor: """ 在最后一个维度上,将后半部分移到前面并取负。 例如:x = [x0, x1, x2, x3] -> [-x2, -x3, x0, x1] 这对应于乘以 i 的操作(复数旋转90度) """ x1, x2 = x.chunk(2, dim=-1) # 分成前后两半 return torch.cat((-x2, x1), dim=-1) # 后半取负放前 def apply_rotary_pos_emb( q: torch.Tensor, k: torch.Tensor, freqs_cis: torch.Tensor ) -> Tuple[torch.Tensor, torch.Tensor]: """ 将 RoPE 应用于 query 和 key 张量。 Args: q: (bsz, n_heads, seq_len, head_dim) k: (bsz, n_heads, seq_len, head_dim) freqs_cis: (seq_len, head_dim // 2) Returns: q_embed, k_embed """ # 提取实部和虚部,并扩展维度以便广播 cos = freqs_cis.real.view(1, 1, -1, 1) # (1, 1, S, 1) sin = freqs_cis.imag.view(1, 1, -1, 1) # 使用标准公式: q * cos + rotate_half(q) * sin q_out = (q * cos[:, :, :q.size(2), :]) + (_rotate_half(q) * sin[:, :, :q.size(2), :]) k_out = (k * cos[:, :, :k.size(2), :]) + (_rotate_half(k) * sin[:, :, :k.size(2), :]) return q_out.type_as(q), k_out.type_as(k) # %% [markdown] # 2. Hybrid Embedding Layer (实现字符与数值的综合判别) # %% class HybridEmbedding(nn.Module): def __init__( self, vocab_size: int, hidden_size: int, max_value: int = 10000, value_offset: int = 30000 ): super().__init__() self.hidden_size = hidden_size self.max_value = max_value self.value_offset = value_offset # 1. 标准词表 embedding(用于语言) self.word_embeddings = nn.Embedding(vocab_size, hidden_size) self.word_embeddings.weight.data.normal_(mean=0.0, std=0.02) # 2. 数值感知 embedding(用于真实数值) self.value_embedding = nn.Linear(1, hidden_size) nn.init.xavier_uniform_(self.value_embedding.weight, gain=1e-2) if self.value_embedding.bias is not None: nn.init.zeros_(self.value_embedding.bias) # 3. 类型标识:0=word, 1=value self.type_embedding = nn.Embedding(2, hidden_size) nn.init.normal_(self.type_embedding.weight, mean=0.0, std=0.02) # 4. 位置嵌入(标准绝对位置编码) self.position_embeddings = nn.Embedding(8192, hidden_size) nn.init.normal_(self.position_embeddings.weight, mean=0.0, std=0.02) self.dropout = nn.Dropout(0.1) def forward( self, input_ids: torch.LongTensor, value_mask: Optional[torch.BoolTensor] = None, position_ids: Optional[torch.LongTensor] = None ) -> torch.Tensor: """ input_ids: (B, S) - < 30000: 正常词汇 - >=30000: 表示数值,减去 offset 得到真实数值 value_mask: (B, S), bool, True 表示该位置是数值 """ batch_size, seq_length = input_ids.shape if position_ids is None: position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device) position_ids = position_ids.unsqueeze(0).expand(batch_size, -1) embeddings = torch.zeros(batch_size, seq_length, self.hidden_size, device=input_ids.device) # ==== Step 1: 处理普通词汇 ==== word_mask = ~value_mask if value_mask is not None else (input_ids < self.value_offset) if word_mask.any(): word_embs = self.word_embeddings(input_ids[word_mask]) embeddings[word_mask] = word_embs # ==== Step 2: 处理数值 ==== if value_mask is not None and value_mask.any(): raw_values = input_ids[value_mask].float() - self.value_offset # 解码真实值 values = raw_values.unsqueeze(-1) # (N, 1) value_embs = torch.tanh(self.value_embedding(values)) # 稳定激活 embeddings[value_mask] = value_embs # ==== Step 3: 加入类型和位置信息 ==== type_ids = torch.where(value_mask, 1, 0) if value_mask is not None \ else torch.zeros_like(input_ids) type_embs = self.type_embedding(type_ids) pos_embs = self.position_embeddings(position_ids) embeddings = embeddings + type_embs + pos_embs embeddings = self.dropout(embeddings) return embeddings # %% [markdown] # 3. 自定义注意力层 # %% class YiziAttention(nn.Module): def __init__(self, hidden_size: int, num_heads: int): super().__init__() self.hidden_size = hidden_size self.num_heads = num_heads self.head_dim = hidden_size // num_heads assert self.head_dim * num_heads == hidden_size, "hidden_size 必须能被 num_heads 整除" self.scale = self.head_dim ** -0.5 # 缩放因子 # QKV 投影层 self.q_proj = nn.Linear(hidden_size, hidden_size, bias=False) self.k_proj = nn.Linear(hidden_size, hidden_size, bias=False) self.v_proj = nn.Linear(hidden_size, hidden_size, bias=False) self.o_proj = nn.Linear(hidden_size, hidden_size, bias=False) # 存储最后一次 attention 权重(用于可视化) self.attn_weights = None def forward(self, x: torch.Tensor, freqs_cis: torch.Tensor) -> torch.Tensor: bsz, seq_len, _ = x.shape # 投影到 QKV query_states = self.q_proj(x).view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2) key_states = self.k_proj(x).view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2) value_states = self.v_proj(x).view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2) # 应用 RoPE query_states, key_states = apply_rotary_pos_emb(query_states, key_states, freqs_cis) # Scaled Dot-Product Attention attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) * self.scale attn_weights = torch.softmax(attn_weights, dim=-1) # 存储 attention map(可用于可视化) self.attn_weights = attn_weights.detach() # 合并头 attn_output = torch.matmul(attn_weights, value_states) attn_output = attn_output.transpose(1, 2).contiguous().view(bsz, seq_len, self.hidden_size) return self.o_proj(attn_output) # %% [markdown] # 4. 前馈网络 + 残差连接 # %% class YiziBlock(nn.Module): def __init__(self, hidden_size: int, num_heads: int, intermediate_size: int): super().__init__() self.attn = YiziAttention(hidden_size, num_heads) self.norm1 = nn.LayerNorm(hidden_size) self.mlp = nn.Sequential( nn.Linear(hidden_size, intermediate_size), nn.GELU(), nn.Linear(intermediate_size, hidden_size) ) self.norm2 = nn.LayerNorm(hidden_size) def forward(self, x: torch.Tensor, freqs_cis: torch.Tensor) -> torch.Tensor: # 注意力残差连接 x = x + self.attn(self.norm1(x), freqs_cis) # MLP 残差连接 x = x + self.mlp(self.norm2(x)) return x # %% [markdown] # 5. 主模型:YiziLM # %% class YiziLMConfig: def __init__( self, vocab_size: int = 30000, hidden_size: int = 512, num_hidden_layers: int = 6, num_attention_heads: int = 8, max_position_embeddings: int = 8192, intermediate_size: int = 2048, value_offset: int = 30000 ): self.vocab_size = vocab_size self.hidden_size = hidden_size self.num_hidden_layers = num_hidden_layers self.num_attention_heads = num_attention_heads self.max_position_embeddings = max_position_embeddings self.intermediate_size = intermediate_size self.value_offset = value_offset class YiziLM(nn.Module): def __init__(self, config: YiziLMConfig): super().__init__() self.config = config # 使用 Hybrid Embedding 替代传统嵌入 self.embed_tokens = HybridEmbedding( vocab_size=config.vocab_size, hidden_size=config.hidden_size, max_value=10000, value_offset=config.value_offset ) # # 词嵌入 # self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size) # 预计算所有位置的 RoPE 编码,并注册为 buffer(随模型保存) freqs_cis = precompute_freqs_cis( hidden_size=config.hidden_size, max_seq_len=config.max_position_embeddings, base=10000, num_attention_heads=config.num_attention_heads ) self.register_buffer("freqs_cis", freqs_cis) # 注册为 buffer # Transformer 层堆叠 self.layers = nn.ModuleList([ YiziBlock( hidden_size=config.hidden_size, num_heads=config.num_attention_heads, intermediate_size=config.intermediate_size ) for _ in range(config.num_hidden_layers) ]) # 输出归一化 self.norm = nn.LayerNorm(config.hidden_size) # === 双头输出 === # 头 1: 语言 token 分类 (0 ~ vocab_size-1) self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False) self.lm_head.weight = self.embed_tokens.word_embeddings.weight # 共享权重 # 头 2: 数值回归 head —— 输出一个 scalar(预测数值) self.value_head = nn.Linear(config.hidden_size, 1) # 输出 1 维实数 # 初始化保守一些,避免初始输出过大 nn.init.xavier_uniform_(self.value_head.weight, gain=1e-3) nn.init.zeros_(self.value_head.bias) # 数值归一化参数(训练/推理一致) self.value_mean = 500.0 # 数据中心 self.value_scale = 500.0 # 标准差估计 # self.embed_tokens.weight.data.normal_(mean=0.0, std=0.02) # 标准初始化 # 初始化 self.apply(self._init_weights) def _init_weights(self, module): """递归初始化所有线性层和嵌入层""" if isinstance(module, (nn.Linear, nn.Embedding)): torch.nn.init.normal_(module.weight, mean=0.0, std=0.02) if hasattr(module, 'bias') and module.bias is not None: torch.nn.init.zeros_(module.bias) def forward( self, input_ids: torch.LongTensor, value_mask: Optional[torch.BoolTensor] = None, labels: Optional[torch.LongTensor] = None ) -> Dict[str, torch.Tensor]: # 获取当前序列长度对应的 freqs_cis seq_len = input_ids.size(1) freqs_cis = self.freqs_cis[:seq_len] # 使用 hybrid embedding inputs_embeds = self.embed_tokens(input_ids, value_mask) hidden_states = inputs_embeds for layer in self.layers: hidden_states = layer(hidden_states, freqs_cis) hidden_states = self.norm(hidden_states) # (B, S, D) # === 双头预测 === word_logits = self.lm_head(hidden_states) # (B, S, V) value_pred_raw = self.value_head(hidden_states).squeeze(-1) # (B, S) # 我们不直接返回 value_pred 到 logits,而是用于特殊位置训练 output = {} if labels is not None and value_mask is not None: total_loss = 0.0 n_losses = 0 # ==== 1. 普通 token 损失:交叉熵 ==== word_positions = ~(value_mask) # 非数值位置 if word_positions.any(): word_labels = labels[word_positions] # 过滤掉 ignore_index == -100 valid_mask = (word_labels != -100) if valid_mask.any(): pred = word_logits[word_positions][valid_mask] target = word_labels[valid_mask] # 确保 target 在 [0, vocab_size) target = torch.clamp(target, 0, self.config.vocab_size - 1) loss_word = nn.functional.cross_entropy(pred, target) total_loss += loss_word n_losses += 1 # ==== 2. 数值位置损失:回归 MSE Loss ==== value_positions = value_mask & (labels != -100) # 数值且非 padding if value_positions.any(): # 提取真实值并归一化 raw_values = labels[value_positions].float() - self.config.value_offset normalized_targets = (raw_values - self.value_mean) / self.value_scale # 对预测也进行相同缩放(数值更稳定) pred_normalized = value_pred_raw[value_positions] # 使用 Smooth L1 更鲁棒 loss_value = nn.functional.smooth_l1_loss( pred_normalized, normalized_targets, beta=0.1 ) total_loss += loss_value n_losses += 1 loss = total_loss / n_losses if n_losses > 0 else total_loss output["loss"] = loss # 返回两个 logits 方便生成时决策 output["word_logits"] = word_logits output["value_pred"] = value_pred_raw # 原始输出,供 generate 使用 return output @torch.no_grad() def generate( self, input_ids: torch.LongTensor, value_mask: Optional[torch.BoolTensor] = None, max_new_tokens: int = 128, temperature: float = 0.7, top_k: int = 50, eos_token_id: Optional[int] = None, strategy: str = "hybrid" # 'hybrid', 'force_value', 'force_word' ) -> Tuple[torch.LongTensor, torch.BoolTensor]: """ 自回归生成,支持 hybrid 输出策略 """ for _ in range(max_new_tokens): outputs = self.forward(input_ids, value_mask) word_logits = outputs["word_logits"][:, -1, :] # (B, V) value_pred_raw = outputs["value_pred"][:, -1] # (B,) next_token_id = None is_value = False if strategy == "hybrid": # ✅ 修复:检查“倒数第二个”是否为数值 seq_len = value_mask.size(1) if seq_len >= 2 and value_mask[0, -2]: # 倒数第二位是数值 # 反归一化并四舍五入 raw_pred = value_pred_raw.item() denormalized_val = raw_pred * self.value_scale + self.value_mean predicted_value = int(round(denormalized_val)) next_token_id = predicted_value + self.config.value_offset is_value = True else: # 走语言路径 logit = word_logits / temperature if top_k > 0: v, _ = torch.topk(logit, min(top_k, logit.size(-1))) pivot = v[:, [-1]] logit = torch.where(logit < pivot, -float('inf'), logit) prob = torch.softmax(logit, dim=-1) next_token_id = torch.multinomial(prob, num_samples=1).item() is_value = False elif strategy == "force_value": raw_pred = value_pred_raw.item() denormalized_val = raw_pred * self.value_scale + self.value_mean predicted_value = int(round(denormalized_val)) next_token_id = predicted_value + self.config.value_offset is_value = True else: logit = word_logits / temperature if top_k > 0: v, _ = torch.topk(logit, min(top_k, logit.size(-1))) pivot = v[:, [-1]] logit = torch.where(logit < pivot, -float('inf'), logit) prob = torch.softmax(logit, dim=-1) next_token_id = torch.multinomial(prob, num_samples=1).item() is_value = False # 构造新 token new_token = torch.tensor([[next_token_id]], device=input_ids.device) input_ids = torch.cat([input_ids, new_token], dim=1) new_is_value = torch.tensor([[is_value]], dtype=torch.bool, device=value_mask.device) value_mask = torch.cat([value_mask, new_is_value], dim=1) if eos_token_id is not None and next_token_id == eos_token_id: break return input_ids, value_mask # %% [markdown] # 6. 数据生成 # %% # 词汇表模拟(简化版) word_to_id = { "the": 100, "next": 101, "number": 102, "after": 103, "is": 104, "predict": 105, "?": 106, "[EOS]": 107 } UNK_ID = 108 VALUE_OFFSET = 30000 def encode_input(tokens: List) -> Tuple[torch.Tensor, torch.BoolTensor]: """ 将混合 tokens 编码为 input_ids 和 value_mask 支持字符串或整数 """ input_ids = [] value_mask = [] for t in tokens: if isinstance(t, str): tid = word_to_id.get(t.lower(), UNK_ID) input_ids.append(tid) value_mask.append(False) elif isinstance(t, int): raw_id = t + VALUE_OFFSET input_ids.append(raw_id) value_mask.append(True) return ( torch.tensor([input_ids], dtype=torch.long), torch.tensor([value_mask], dtype=torch.bool) ) def generate_arithmetic_batch_v2( batch_size: int = 32, context_len: int = 4, pred_len: int = 1, num_range: tuple = (100, 999), step_candidates: list = None ): """ 生成带有自然语言提示的等差数列预测任务 输入:["the", "sequence", 260, 270, 280, "?"] 输出:label = [?, 260, 270, 280, 290, -100] """ if step_candidates is None: step_candidates = [1, -1, 2, -2, 5, -5, 10, -10] all_input_ids = [] all_labels = [] all_value_masks = [] for _ in range(batch_size): start = random.randint(*num_range) step = random.choice(step_candidates) seq = [start + i * step for i in range(context_len)] if any(x < 0 or x > 9999 for x in seq): continue # 构造带语言提示的输入 prompt = ["the", "next", "after"] + seq[:-1] + ["?"] target = seq[-1] input_ids, value_mask = encode_input(prompt) label_ids = input_ids.clone() label_ids[0, -1] = target + VALUE_OFFSET # 最后一个 ? 替换为答案 label_ids = torch.roll(label_ids, shifts=-1, dims=1) # 左移一位作为 label label_ids[0, -1] = -100 # 忽略最后 padding all_input_ids.append(input_ids[0]) all_labels.append(label_ids[0]) all_value_masks.append(value_mask[0]) return ( torch.stack(all_input_ids), torch.stack(all_labels), torch.stack(all_value_masks) ) # %% [markdown] # 7. 初始化模型 # %% config = YiziLMConfig( vocab_size=30000, hidden_size=512, num_hidden_layers=4, num_attention_heads=8, intermediate_size=2048, max_position_embeddings=8192, value_offset=VALUE_OFFSET ) model = YiziLM(config) total_params = sum(p.numel() for p in model.parameters()) print(f"🚀 Total Parameters: {total_params:,}") # %% [markdown] # 8. 模型训练 # %% optimizer = torch.optim.AdamW(model.parameters(), lr=4e-4) scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=1000, eta_min=1e-5) print("🔥 开始训练...") for step in range(500): input_ids, labels, value_mask = generate_arithmetic_batch_v2(batch_size=16) outputs = model(input_ids=input_ids, value_mask=value_mask, labels=labels) loss = outputs["loss"] loss.backward() grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) optimizer.step() optimizer.zero_grad() scheduler.step() if step % 50 == 0: print(f"Step {step} | Loss: {loss.item():.4f} | Grad Norm: {grad_norm:.4f}") # %% [markdown] # 9. 测试和推理 # %% @torch.no_grad() def test_predict_next(current_seq: List[int]): model.eval() prompt = ["the", "next", "after"] + current_seq + ["?"] input_ids, value_mask = encode_input(prompt) generated, _ = model.generate( input_ids=input_ids, value_mask=value_mask, max_new_tokens=1, temperature=0.1, top_k=10 ) pred_token = generated[0, -1].item() if pred_token >= VALUE_OFFSET: predicted_value = pred_token - VALUE_OFFSET expected = current_seq[-1] + (current_seq[-1] - current_seq[-2]) # 等差推断 print(f"🎯 Input: {current_seq} → Predicted: {predicted_value}, Expected: {expected} {'✔️' if predicted_value == expected else '❌'}") else: word = [k for k, v in word_to_id.items() if v == pred_token] print(f"🔤 Word Output: {word}") print("\n🧪 测试结果:") test_predict_next([260, 270, 280]) test_predict_next([500, 490, 480]) test_predict_next([101, 103, 105]) # %% [markdown] # 9. 保存模型 # %% save_dict = { 'model_state_dict': model.state_dict(), 'config': config.__dict__, 'word_to_id': word_to_id, 'value_offset': VALUE_OFFSET } torch.save(save_dict, "yizilm_hybrid_numerical.pth") print("💾 模型已保存至 yizilm_hybrid_numerical.pth") # %% [markdown] # 10. 可视化注意力 # %% attn_weights = model.layers[0].attn.attn_weights if attn_weights is not None and attn_weights.ndim == 4: plt.figure(figsize=(6, 6)) avg_attn = attn_weights[0].mean(0).cpu().numpy() # average over heads plt.imshow(avg_attn, cmap='viridis', aspect='auto') plt.title("Average Attention Map (Layer 1)") plt.xlabel("Key Position") plt.ylabel("Query Position") plt.colorbar() plt.tight_layout() plt.show() 我在自制小模型,现在的模型会使用字符串和整型两种embedding,我想要去掉模型对整型的感知能力,因为将整型和字符串分开感知会严重影响模型的泛化能力,给我删去的完整代码
最新发布
12-07
``` import torch import torch.nn as nn import torch.optim as optim import torch.nn.functional as F import numpy as np class ActorCritic(nn.Module): def __init__(self, state_dim, action_dim): super(ActorCritic, self).__init__() self.fc1 = nn.Linear(state_dim, 128) self.fc2 = nn.Linear(128, 128) self.actor = nn.Linear(128, action_dim) self.critic = nn.Linear(128, 1) def forward(self, x): x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) action_probs = F.softmax(self.actor(x), dim=-1) state_value = self.critic(x) return action_probs, state_value class A2CScheduler: def __init__(self, state_dim, action_dim, lr=0.001, gamma=0.99): self.model = ActorCritic(state_dim, action_dim) self.optimizer = optim.Adam(self.model.parameters(), lr=lr) self.gamma = gamma def select_action(self, state): state = torch.FloatTensor(state).unsqueeze(0) action_probs, _ = self.model(state) action = torch.multinomial(action_probs, 1).item() return action, action_probs[:, action] def update(self, trajectory): rewards, log_probs, state_values = [], [], [] for (state, action, reward, log_prob, state_value) in trajectory: rewards.append(reward) log_probs.append(log_prob) state_values.append(state_value) returns = [] R = 0 for r in reversed(rewards): R = r + self.gamma * R returns.insert(0, R) returns = torch.tensor(returns) log_probs = torch.stack(log_probs) state_values = torch.stack(state_values).squeeze() advantage = returns - state_values actor_loss = -log_probs * advantage.detach() critic_loss = F.mse_loss(state_values, returns) loss = actor_loss.mean() + critic_loss self.optimizer.zero_grad() loss.backward() self.optimizer.step() # 结合 `mp-quic-go` 使用 # 1. 获取状态信息 (如带宽、RTT、丢包等) # 2. 选择路径 (基于 `select_action` 方法) # 3. 收集数据并训练模型 (基于 `update` 方法)```请详细解释每一行代码的含义和意义
04-02
# %% import torch import torch.nn as nn import math from typing import Optional, Tuple, Dict # %% [markdown] # 1. RoPE: 旋转位置编码(完全自实现) # %% def precompute_freqs_cis(hidden_size: int, max_seq_len: int, base: int = 10000, num_attention_heads: int = 8) -> torch.Tensor: """ 预计算复数形式的旋转位置编码 (RoPE) Args: hidden_size: 模型维度(如 512) max_seq_len: 最大序列长度(如 8192) base: 频率基数,默认 10000 Returns: freqs_cis: complex tensor of shape (max_seq_len, head_dim // 2) """ head_dim = hidden_size // num_attention_heads inv_freq = 1.0 / (base ** (torch.arange(0, head_dim, 2).float() / head_dim)) t = torch.arange(max_seq_len, device=inv_freq.device) # positions freqs = torch.outer(t, inv_freq) # (seq_len, head_dim // 2) # 转为复数:cosθ + i*sinθ freqs_cis = torch.polar(torch.ones_like(freqs), freqs) return freqs_cis # shape: (max_seq_len, head_dim//2) def apply_rotary_pos_emb(q: torch.Tensor, k: torch.Tensor, freqs_cis: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]: """ 将 RoPE 应用于 query 和 key 张量。 Args: q: (bsz, n_heads, seq_len, head_dim) k: (bsz, n_heads, seq_len, head_dim) freqs_cis: (seq_len, head_dim // 2) Returns: q_embed, k_embed """ bsz, n_heads, seq_len, head_dim = q.shape half_dim = head_dim // 2 # Reshape to (bsz, n_heads, seq_len, head_dim//2, 2) def reshape_for_complex(tensor: torch.Tensor) -> torch.Tensor: return tensor.view(bsz, n_heads, seq_len, half_dim, 2).float() q_ = reshape_for_complex(q) k_ = reshape_for_complex(k) cos = freqs_cis.real sin = freqs_cis.imag # 然后扩展维度以便广播 cos = cos.view(1, 1, seq_len, head_dim//2) # → (1,1,10,32,1) sin = sin.view(1, 1, seq_len, head_dim//2) # 复数乘法: (a + bi)(c + di) = (ac - bd) + (ad + bc)i # 这里使用实数模拟复数运算 q_out_real = q_[..., 0] * cos - q_[..., 1] * sin q_out_imag = q_[..., 0] * sin + q_[..., 1] * cos q_out = torch.stack([q_out_real, q_out_imag], dim=-1).flatten(-2) k_out_real = k_[..., 0] * cos - k_[..., 1] * sin k_out_imag = k_[..., 0] * sin + k_[..., 1] * cos k_out = torch.stack([k_out_real, k_out_imag], dim=-1).flatten(-2) return q_out.type_as(q), k_out.type_as(k) # %% [markdown] # 2. 自定义注意力层 # %% class YiziAttention(nn.Module): def __init__(self, hidden_size: int, num_heads: int): super().__init__() self.hidden_size = hidden_size self.num_heads = num_heads self.head_dim = hidden_size // num_heads self.scale = self.head_dim ** -0.5 self.q_proj = nn.Linear(hidden_size, hidden_size, bias=False) self.k_proj = nn.Linear(hidden_size, hidden_size, bias=False) self.v_proj = nn.Linear(hidden_size, hidden_size, bias=False) self.o_proj = nn.Linear(hidden_size, hidden_size, bias=False) def forward(self, x: torch.Tensor, freqs_cis: torch.Tensor) -> torch.Tensor: bsz, seq_len, _ = x.shape # 投影到 QKV query_states = self.q_proj(x).view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2) key_states = self.k_proj(x).view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2) value_states = self.v_proj(x).view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2) # 应用 RoPE query_states, key_states = apply_rotary_pos_emb(query_states, key_states, freqs_cis) # Scaled Dot-Product Attention attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) * self.scale attn_weights = torch.softmax(attn_weights, dim=-1) attn_output = torch.matmul(attn_weights, value_states) # 合并头 attn_output = attn_output.transpose(1, 2).contiguous().view(bsz, seq_len, self.hidden_size) return self.o_proj(attn_output) # %% [markdown] # 3. 前馈网络 + 残差连接 # %% class YiziBlock(nn.Module): def __init__(self, hidden_size: int, num_heads: int, intermediate_size: int): super().__init__() self.attn = YiziAttention(hidden_size, num_heads) self.norm1 = nn.LayerNorm(hidden_size) self.mlp = nn.Sequential( nn.Linear(hidden_size, intermediate_size), nn.GELU(), nn.Linear(intermediate_size, hidden_size) ) self.norm2 = nn.LayerNorm(hidden_size) def forward(self, x: torch.Tensor, freqs_cis: torch.Tensor) -> torch.Tensor: # 注意力残差连接 x = x + self.attn(self.norm1(x), freqs_cis) # MLP 残差连接 x = x + self.mlp(self.norm2(x)) return x # %% [markdown] # 4. 主模型:YiziLM # %% class YiziLMConfig: def __init__( self, vocab_size: int = 30000, hidden_size: int = 512, num_hidden_layers: int = 6, num_attention_heads: int = 8, max_position_embeddings: int = 8192, intermediate_size: int = 2048 ): self.vocab_size = vocab_size self.hidden_size = hidden_size self.num_hidden_layers = num_hidden_layers self.num_attention_heads = num_attention_heads self.max_position_embeddings = max_position_embeddings self.intermediate_size = intermediate_size class YiziLM(nn.Module): def __init__(self, config: YiziLMConfig): super().__init__() self.config = config # 词嵌入 self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size) # 预计算所有位置的 RoPE 编码,并注册为 buffer(随模型保存) freqs_cis = precompute_freqs_cis( hidden_size=config.hidden_size, max_seq_len=config.max_position_embeddings, base=10000, num_attention_heads=config.num_attention_heads ) self.register_buffer("freqs_cis", freqs_cis) # Transformer 层堆叠 self.layers = nn.ModuleList([ YiziBlock( hidden_size=config.hidden_size, num_heads=config.num_attention_heads, intermediate_size=config.intermediate_size ) for _ in range(config.num_hidden_layers) ]) # 输出归一化与解码头 self.norm = nn.LayerNorm(config.hidden_size) self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False) # 权重共享(可选):词嵌入与 lm_head 共享参数 self.lm_head.weight = self.embed_tokens.weight def forward(self, input_ids: torch.LongTensor, labels: Optional[torch.LongTensor] = None) -> Dict[str, torch.Tensor]: inputs_embeds = self.embed_tokens(input_ids) bsz, seq_len, _ = inputs_embeds.shape # 获取当前位置对应的 freqs_cis freqs_cis = self.freqs_cis[:seq_len] hidden_states = inputs_embeds for layer in self.layers: hidden_states = layer(hidden_states, freqs_cis) hidden_states = self.norm(hidden_states) logits = self.lm_head(hidden_states) output = {"logits": logits} if labels is not None: loss = nn.CrossEntropyLoss()(logits.view(-1, logits.size(-1)), labels.view(-1)) output["loss"] = loss return output @torch.no_grad() def generate( self, input_ids: torch.LongTensor, max_new_tokens: int = 128, temperature: float = 0.7, top_k: int = 50, eos_token_id: Optional[int] = None ) -> torch.LongTensor: """ 自回归生成文本 """ for _ in range(max_new_tokens): logits = self.forward(input_ids)["logits"] next_token_logits = logits[:, -1, :] / temperature # Top-k filtering if top_k > 0: v, _ = torch.topk(next_token_logits, top_k) pivot = v[:, [-1]] next_token_logits = torch.where(next_token_logits < pivot, torch.full_like(next_token_logits, -float('inf')), next_token_logits) probs = torch.softmax(next_token_logits, dim=-1) next_token = torch.multinomial(probs, num_samples=1) input_ids = torch.cat([input_ids, next_token], dim=1) if eos_token_id is not None and next_token.item() == eos_token_id: break return input_ids # %% # 创建配置 config = YiziLMConfig( vocab_size=30000, hidden_size=512, num_hidden_layers=6, num_attention_heads=8, max_position_embeddings=8192 ) # 构建模型 model = YiziLM(config) # 输入数据 input_ids = torch.randint(0, 30000, (2, 10)) # batch=2, seq_len=10 labels = input_ids.clone() # 训练模式 output = model(input_ids=input_ids, labels=labels) print("Loss:", output["loss"].item()) print("Logits shape:", output["logits"].shape) # 推理生成 prompt = torch.tensor([[100, 200, 300]]) # 初始 token generated = model.generate(prompt, max_new_tokens=20, eos_token_id=2) print("Generated:", generated[0].tolist()) # %% torch.save(model.state_dict(), "yizilm_70m.pth") 结果 Loss: 0.0 Logits shape: torch.Size([2, 10, 30000]) Generated: [100, 200, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300]
12-01
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值