动手学习Pytorch:过拟合欠拟合篇

言敔

于 2025-01-06 18:50:43 发布

阅读量1.6k

点赞数 58

文章标签：深度学习 pytorch 神经网络人工智能

本文链接：https://blog.youkuaiyun.com/weixin_48047941/article/details/144969940

版权

基本概念

训练误差：模型在训练数据集上计算得到的误差

泛化误差：模型在原始分布中抽取的无限多的样本上的误差

生成数据集

根据这个函数来生成数据集

生成数据集的顺序如下：

生成原始数据x->计算x的多个次方->除以(n-1)!->跟系数点乘得到最后的结果

max_degree = 20  # 多项式的最大阶数
n_train, n_test = 100, 100  # 训练和测试数据集大小
true_w = np.zeros(max_degree)  # 分配大量的空间
true_w[0:4] = np.array([5, 1.2, -3.4, 5.6])

features = np.random.normal(size=(n_train + n_test, 1))
np.random.shuffle(features)
poly_features = np.power(features, np.arange(max_degree).reshape(1, -1))
for i in range(max_degree):
    poly_features[:, i] /= math.gamma(i + 1)  # gamma(n)=(n-1)!
# labels的维度:(n_train+n_test,)

labels = np.dot(poly_features, true_w)
labels += np.random.normal(scale=0.1, size=labels.shape)

#转换成tensor类型的数据
true_w, features, poly_features, labels = [torch.tensor(x, dtype=
    torch.float32) for x in [true_w, features, poly_features, labels]]

#展示一下前两个数据
features[:2], poly_features[:2, :], labels[:2]

训练与测试

def evaluate_loss(net,data_iter,loss):
    metric = d2l.Accumulator(2)
    for X,y in data_iter:
        out = net(X)
        y = y.reshape(out.shape)
        l = loss(out,y)
        metric.add(l.sum(),y.numel())
    return metric[0]/metric[1]

def train(train_features, test_features, train_labels, test_labels,
          num_epochs=400):
    #定义损失函数
    loss = nn.MSELoss(reduction='none')
    input_shape = train_features.shape[-1]
    # 定义网络模型，不设置偏置，因为我们已经在多项式中实现了它
    net = nn.Sequential(nn.Linear(input_shape, 1, bias=False))
    batch_size = min(10, train_labels.shape[0])

    #手动划分训练集和测试集
    train_iter = d2l.load_array((train_features, train_labels.reshape(-1,1)),
                                batch_size)
    test_iter = d2l.load_array((test_features, test_labels.reshape(-1,1)),
                               batch_size, is_train=False)

    #设置优化器与绘图工具
    trainer = torch.optim.SGD(net.parameters(), lr=0.01)
    animator = d2l.Animator(xlabel='epoch', ylabel='loss', yscale='log',
                            xlim=[1, num_epochs], ylim=[1e-3, 1e2],
                            legend=['train', 'test'])

    #开始训练
    for epoch in range(num_epochs):
        d2l.train_epoch_ch3(net, train_iter, loss, trainer)
        if epoch == 0 or (epoch + 1) % 20 == 0:
            animator.add(epoch + 1, (evaluate_loss(net, train_iter, loss),
                                     evaluate_loss(net, test_iter, loss)))
    print('weight:', net[0].weight.data.numpy())

对前四个维度进行训练，这与我们真实生成多项式数据的维度相同

# 从多项式特征中选择前4个维度，即1,x,x^2/2!,x^3/3!
train(poly_features[:n_train, :4], poly_features[n_train:, :4],
      labels[:n_train], labels[n_train:])

可以发现结果跟真实结果相近

而当只选取前两个维度的时候

# 从多项式特征中选择前2个维度，即1和x
train(poly_features[:n_train, :2], poly_features[n_train:, :2],
      labels[:n_train], labels[n_train:])

可以发现loss值始终降不下去，训练跟测试的损失都偏高

而当选取更高维度的线性层对它进行拟合时

# 从多项式特征中选取所有维度
train(poly_features[:n_train, :], poly_features[n_train:, :],
      labels[:n_train], labels[n_train:], num_epochs=1500)

会发现虽然在训练集上取得很低的loss,但是对于测试的时候还是有很高的损失，说明过度拟合了

解决方法

对于过拟合问题，

在线性回归中，如果模型参数量过大，但是样本不够，可能会出现

在神经网络中，模型更看重与不同特征之间的关联，可能会过度强化关联，导致过拟合

究其根本，

一类过拟合是由于样本不足，但参数过大，导致模型受到噪声干扰严重

一类是过度挖掘特征之间的关联，强制赋予关联

对于前者，我们使用正则化技术，通过权重衰退来缓解噪声干扰

对于后者，使用dropout技术，在训练过程中随机撤销某些参数，增强模型的鲁棒性

权重衰减

解决过拟合的思路：由于模型的参数过多，而样本数据比较小，模型就可能会一些微小扰动非常敏感，这导致了模型的过拟合，所以为了解决这个问题，就是让模型对某些特征上的数据不那么敏感，就可以采用缩小权重的方法来解决，这就称为权重衰减，有助于模型学习到更一般的特征，减少噪声的干扰

在损失函数中再加上对权重的惩罚项之和，这样如果模型过于复杂，就会控制权重，让模型有更好的泛化能力

在原先我们介绍过L1跟L2范数，都是对权重的一种描述类型

在这里我们选取的是L2范数，二者区别如下：

我们的目标是想要模型的权重在大量特征上均匀分布，这样会在针对单一变量中的观测误差更小

而L1范数会让某一部分的特征被突出，不符合当前场景，更适合特征选择

接下来针对下面这个多项式进行代码实现

#预设真实参数，并生成数据集
n_train, n_test, num_inputs, batch_size = 20, 100, 200, 5
true_w, true_b = torch.ones((num_inputs, 1)) * 0.01, 0.05
train_data = d2l.synthetic_data(true_w, true_b, n_train)
train_iter = d2l.load_array(train_data, batch_size)
test_data = d2l.synthetic_data(true_w, true_b, n_test)
test_iter = d2l.load_array(test_data, batch_size, is_train=False)

#初始化参数
def init_params():
    w = torch.normal(0,1,size=(num_inputs,1),requires_grad=1)
    b = torch.zeros(1,requires_grad=1)
    return [w,b]

#定义惩罚项，L2范数
def L2_penalty(w):
    return torch.sum(w.power(2))/2

def train(lambd):
    #初始化参数
    w,b = init_params()
    #定义模型与损失，lambda X:定义了一个匿名函数，输入是X
    net , loss =  lambda X: d2l.linreg(X,w,b), d2l.square_loss

    #初始化训练轮次与学习率，并制定绘图方式
    num_epochs,lr = 100,0.03
    animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
                            xlim=[5, num_epochs], legend=['train', 'test'])

    for epoch in range(num_epochs):
        for X,y in train_iter:
            #定义带有惩罚项的损失
            l = loss(net(X),y) + lambd * L2_penalty(w)
            l.sum().backward()
            d2l.sgd([w,b],lr,batch_size)

        #每个五个绘制标点
        if (epoch + 1) % 5 == 0:
            animator.add(epoch + 1, (d2l.evaluate_loss(net, train_iter, loss),
                                     d2l.evaluate_loss(net, test_iter, loss)))
    print('w的L2范数是：', torch.norm(w).item())

首先先对不使用惩罚项的损失函数进行训练

可以发现，虽然训练损失下降了，但是测试损失并没有改变，所以模型很可能出现了过拟合的现象

接着引入惩罚项，可以发现虽然训练损失没有原来低，但是测试损失明显下降，正则化达到效果

接着结合Pytorch库里面对上面的代码进行简洁实现

深度学习框架将权重衰退添加到优化器中，这样方便跟任何损失函数结合使用

def train_concise(wd):
    #定义网络
    net = nn.Sequential(nn.Linear(num_inputs,1))
    #初始化参数
    for param in net.parameters():
        param.data.normal_()
    #定义损失
    loss = nn.MSEloss(reduction = 'None')
    num_epochs,lr = 100,0.03
    #定义优化器
    trainer = torch.optim.SGD([
        {"params":net[0].weight,'weight_decay':wd},
        {"params":net[0].bias}]
        lr = lr
    )
    #绘图工具
    animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
                             xlim=[5, num_epochs], legend=['train', 'test'])
    #训练
    for epoch in range(num_epochs):
        for X,y in train_iter:
            trainer.zero_grad()
            l = loss(net(X),y)
            l.mean().backward()
            trainer.step()
            if (epoch + 1) % 5 == 0:
                animator.add(epoch + 1,
                         (d2l.evaluate_loss(net, train_iter, loss),
                          d2l.evaluate_loss(net, test_iter, loss)))
    print('w的L2范数：',net[0].weight.norm().item())

同样达到先前的效果

不过在这里并没有讲的很详细，权重衰减是如何配合损失函数一起使用的

在优化器中实现权重衰减而不是直接修改损失函数有以下优点：

效率高：在优化器中直接对梯度进行调整，无需显式计算正则化项。
灵活性：可以对不同参数（如权重和偏置）设置不同的正则化规则。例如，上述代码中，权重 net[0].weight 有正则化，而偏置 net[0].bias 没有。

暂退法 dropout

使用dropout的目的：

(1) 减少过拟合

问题：神经网络往往拥有大量参数，在训练时可能过度拟合训练数据，学习到噪声或特定数据模式，导致泛化能力下降。
Dropout 的作用：

- 通过随机丢弃部分神经元，迫使网络学习冗余和通用的特征，而非过度依赖某些神经元。
- 提高模型对噪声的鲁棒性。

(2) 增强网络的鲁棒性

问题：如果网络的某些神经元对输出结果影响过大，模型可能会对小的输入变化非常敏感。
Dropout 的作用：

- 模拟多个不同的网络架构（类似于模型集成），使模型更加鲁棒。
- 防止模型对某些特定神经元的过度依赖。

(3) 减少参数之间的相互依赖

问题：神经元之间可能形成某种依赖关系，影响模型学习更好的特征。
Dropout 的作用：

- 打破这种依赖关系，促进模型参数独立地学习。

比如在MLP模型中使用dropout的效果如下：

def dropout(X,dropout):
    assert 0 <= dropout <=1
    #全部丢弃
    if dropout == 1:
        return torch.zeros_like(X)
    #全部保留
    if dropout == 0:
        return X
    #生成随机掩码
    mask = (torch.rand(X.shape) > dropout).float()

    #应用掩码并进行缩放
    return mask * X / (1.0 - dropout)

这里虽然说是以dropout概率丢弃输入X中的元素，但是实际上在小样本上不一定是这样的

因为在真实实现中，是使用rand随机生成的，然后去跟dropout判断

比如在下面的实际实现中，16个元素就别丢弃了11个，但是随着样本增加，会逼近dropout的概率

接下来，我们要实现对具有两个隐藏层的MLP进行dropout操作

#初始化参数
num_inputs,num_outputs,num_hiddens1,num_hiddens2 = 784, 10, 256, 256

dropout1 ,dropout2 = 0.2,0.5

#构建模型
class Net(nn.Module):
    def __init__(self,num_inputs,num_outputs,num_hiddens1,num_hiddens2,is_training =True):
        super(Net,self).__init__()
        self.num_inputs = num_inputs
        self.training = istraining
        self.lin1 = nn.Linear(num_inputs,num_hiddens1)
        self.lin2 = nn.Linear(num_hiddens1,num_hiddens2)
        self.lin3 = nn.Linear(num_hiddens2,num_outputs)
        self.relu = nn.ReLU()

    def forward(self,X):
        H1 = self.relu(self.lin1(X))
        if self.training:
            H1 = dropout(H1,dropout1)
        H2 = self.relu(self.lin2(H1))
        if self.training:
            H2 = dropout(H2,dropout2)
        out = self.lin3(H2)
        return out

net = Net(num_inputs,num_outputs,num_hiddens1,num_hiddens2)

简介实现，由于深度学习框架里面已经实现了dropout层，所以直接使用即可

#定义网络结构
net = nn.Sequential(
    nn.Flatten(),
    nn.Linear(784,256),
    nn.ReLU(),
    nn.Dropout(dropout1),
    nn.Linear(256,256),
    nn.ReLU(),
    nn.Dropout(dropout2),
    nn.Linear(256,10)
)

def init_weight(m):
    if type(m)==nn.Linear:
        nn.init.normal_(m.weight,std = 0.1)

net.apply(init_weights);

trainer = torch.optim.SGD(net.parameters(), lr=lr)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)