深度学习——多层感知机（三）_softmax dropout-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_52952281/article/details/144477593

一、多层感知机

线性模型的弊端

softmax回归中通过单个仿射变换将输入直接映射成输出，然后在进行softmax操作。
但是仿射变换中的线性是一个很强的假设，有一些关系并非是线性的而是非线性。

2. 隐藏层

通过在⽹络中加⼊⼀个或多个隐藏层来克服线性模型的限制，使其能处理更普遍的函数关系类型。要做到这⼀点，最简单的⽅法是将许多全连接层堆叠在⼀起。每⼀层都输出到上⾯的层，直到⽣成最后的输出。我们可以把前L− 1层看作表⽰，把最后⼀层看作线性预测器。这种架构通常称为多层感知机

3.激活函数

ReLU函数：修正线性单元，将输⼊压缩转换到区间(0, max)上,舍去负数的影响
sigmoid函数：将输⼊压缩转换到区间(0, 1)上，降低最大、最小值的影响
tanh函数：将输⼊压缩转换到区间(-1, 1)上

二、多层感知机实现

import torch
from torch import nn
from d2l import torch as d2l

# 1. 单隐藏层的多层感知机实现
num_epochs, lr = 10, 0.01
batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)

num_inputs, num_outputs, num_hiddens = 784, 10, 256

w1 = nn.Parameter(torch.randn(num_inputs, num_hiddens, requires_grad=True) * 0.01)
b1 = nn.Parameter(torch.zeros(num_hiddens, requires_grad=True) * 0.01)
w2 = nn.Parameter(torch.randn(num_hiddens, num_outputs, requires_grad=True) * 0.01)
b2 = nn.Parameter(torch.zeros(num_outputs, requires_grad=True) * 0.01)

params = [w1, b1, w2, b2]

def relu(x):
    a = torch.zeros_like(x)
    return torch.max(x,a)

def net(X):
    X = X.reshape((-1, num_inputs))
    H = relu(X @ w1 + b1) # 这⾥“@”代表矩阵乘法
    return (H @ w2 + b2)

loss = nn.CrossEntropyLoss(reduction='none')

updater = torch.optim.SGD(params, lr=lr)

# 2. 模型训练


d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, updater)

三、多层感知机简洁实现

import torch
import torch.nn as nn
import torch.nn.functional as F

net = nn.Sequential(nn.Flatten(),
                    nn.Linear(784, 256),
                    nn.ReLU(),
                    nn.Linear(256, 64)                    
                    )
def init_weight(m):
    if type(m) == nn.Linear:
        nn.init.normal_(m.weight, std=0.01)

net.apply(init_weight); 
batch_size, lr, num_epochs = 256, 0.1, 40

loss = nn.CrossEntropyLoss(reduction='none')
optimer = torch.optim.SGD(net.parameters(), lr=lr)
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)

d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, optimer)

import torch
import torch.nn as nn
import torch.nn.functional as F

net = nn.Sequential(nn.Flatten(),
                    nn.Linear(784, 256),
                    nn.ReLU(),
                    nn.Linear(256, 64),
                    nn.ReLU(),
                    nn.Linear(64, 10),     
                    )

def init_weight(m):
    if type(m) == nn.Linear:
        nn.init.normal(m.weight, std=0.01)

net.apply(init_weight); 
batch_size, lr, num_epochs = 256, 0.1, 20

loss = nn.CrossEntropyLoss(reduction='none')
optimer = torch.optim.SGD(net.parameters(), lr=lr)
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)

d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, optimer)

四、正则化

1. 权值衰减

当发生过拟合时，我们需要采用一些正则化计数去缓解过拟合，在训练参数化机器学习模型时，权重衰减（weight decay）是最⼴泛使⽤的正则化的技术之⼀，它通常也被称为L2正则化。

L2正则化线性模型构成经典的岭回归（ridge regression）算法， L1正则化线性回归是统计学中类似的基本模型，通常被称为套索回归（lasso regression）。使⽤L2范数的⼀个原因是它对权重向量的⼤分量施加了巨⼤的惩罚。

import torch
import torch.nn as nn

def train_concise(wd):
    net = nn.Sequential(nn.Linear(28 *28, 10))
    # 初始化网络参数
    for param in net.parameters():
        param.data.normal_()
    print(f'{net.parameters}')
    loss = nn.MSELoss(reduction=None)   # 损失函数
    num_epochs, lr = 100, 0.003
    # 结构优化器
    optimer = torch.optim.SGD([
        {"params":net[0].weight, 'weight_decay':wd},
        {"params":net[0].bias}], lr=lr)
    # 训练
    for epoch in range(num_epochs):
        for x, y in train_iter:
            optimer.zero_grad()
            ls = loss(net(x), y)
            ls.mean().backward()
            optimer.step()
        if (epoch + 1) % 5 == 0:
            print(f'epoch:{epoch}')

    print('w的L2范数： ', net[0].weight.norm().item())

train_concise(0)

2. 暂退法（Dropout）

在标准暂退法正则化中，通过按保留（未丢弃）的节点的分数进⾏规范化来消除每⼀层的偏差。换⾔之，每个中间活性值h以暂退概率p由随机变量h′替换。

我们实现 dropout_layer 函数，该函数以dropout的概率丢弃张量输⼊X中的元素，如上所述重新缩放剩余部分：将剩余部分除以1.0-dropout。

import torch  
import torch.nn as nn 
from d2l import torch as d2l
x = torch.arange(0, 16, 1).reshape((2, 8))
# Dropout层
def dropout_layer(X, dropout):
    assert 0 <= dropout <= 1
    # 在本情况中，所有元素都被丢弃
    if dropout == 1:
        return torch.zeros_like(X)
    # 在本情况中，所有元素都被保留
    if dropout == 0:
        return X
    mask = (torch.rand(X.shape) > dropout).float()
    return mask * X / (1.0 - dropout)


num_inputs, num_outputs, num_hiddens1, num_hiddens2 = 784, 10, 256, 256

# 定义模型
dropout1, dropout2 = 0.2, 0.5
class Net(nn.Module):
    def __init__(self, num_inputs, num_outputs, num_hiddens1, num_hiddens2, is_training = True):
        super(Net, self).__init__()
        self.num_inputs = num_inputs
        self.training = is_training
        self.lin1 = nn.Linear(num_inputs, num_hiddens1)
        self.lin2 = nn.Linear(num_hiddens1, num_hiddens2)
        self.lin3 = nn.Linear(num_hiddens2, num_outputs)
        self.relu = torch.nn.ReLU()

    def forward(self, x):
        H1 = self.relu(self.lin1(x.reshape((-1,self.num_inputs))))
        if(self.training == True):
            H1 = dropout_layer(H1, dropout=dropout1)
        H2 =self.relu(self.lin2(H1))
        if(self.training == True):
            H2 = dropout_layer(H2, dropout=dropout2)
        out = self.lin3(H2)
        return out

net = Net(num_inputs, num_outputs, num_hiddens1, num_hiddens2)

num_epochs, lr, batch_size = 10, 0.5, 256
loss = nn.CrossEntropyLoss(reduction='none')
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
trainer = torch.optim.SGD(net.parameters(), lr=lr)

d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)

# 简洁实现
num_inputs, num_outputs, num_hiddens1, num_hiddens2 = 784, 10, 256, 256

# 定义模型
dropout1, dropout2 = 0.2, 0.5
net = nn.Sequential(nn.Flatten(),
                    nn.Linear(num_inputs, num_hiddens1),
                    nn.ReLU(),
                    nn.Dropout(dropout1),
                    nn.Linear(num_hiddens1,num_hiddens2),
                    nn.ReLU(),
                    nn.Dropout(dropout1),
                    nn.Linear(num_hiddens2, num_outputs))
def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.normal_(m.weight, std=0.01)

net.apply(init_weights)

trainer = torch.optim.SGD(net.parameters(), lr=lr)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)

五、前向传播、反向传播和计算图

我们在使用小批量随机梯度下降算法训练模型时，只考虑了前向传播，在计算梯度时，我们只调⽤了深度学习框架提供的反向传播函数

前向传播（forward propagation或forward pass）指的是：按顺序（从输⼊层到输出层）计算和存储神经⽹
络中每层的结果。

反向传播（backward propagation或backpropagation）指的是计算神经⽹络参数梯度的⽅法。简⾔之，该⽅
法根据微积分中的链式规则，按相反的顺序从输出层到输⼊层遍历⽹络。该算法存储了计算某些参数梯度时
所需的任何中间变量（偏导数

在训练神经⽹络时，前向传播和反向传播相互依赖。对于前向传播，我们沿着依赖的⽅向遍历计算图

六、数值稳定性和模型初始化

稳定梯度带来的⻛险不⽌在于数值表⽰；不稳定梯度也威胁到我们优化算法的稳定性。我们可能⾯临⼀些问题。要么是梯度爆炸（gradient exploding）问题：参数更新过⼤，破坏了模型的稳定收敛；要么是梯度消失（gradient vanishing）问题：参数更新过⼩，在每次更新时⼏乎不会移动，导致模型⽆法学习

# sigmoid函数导致梯度消失的原因
# %matplotlib inline 可以在Ipython编译器里直接使用，功能是可以内嵌绘图，并且可以省略掉plt.show()这一步。
import matplotlib.pyplot as plt
import torch
from d2l import torch as d2l
x = torch.arange(-8.0, 8.0, 0.1, requires_grad=True)
y = torch.sigmoid(x)
y.backward(torch.ones_like(x))

fig, ax1 = plt.subplots()

# 绘制sigmoid函数的输出值
ax1.set_xlabel('x')
ax1.set_ylabel('sigmoid')
ax1.plot(x.detach().numpy(), y.detach().numpy(), color='red', label='sigmoid')
ax1.tick_params(axis='y')

# 创建第二个y轴，用于绘制梯度
ax2 = ax1.twinx()  
ax2.set_ylabel('gradient')  
ax2.plot(x.detach().numpy(), x.grad.numpy(), label='gradient')
ax2.tick_params(axis='y')

# 添加图例
fig.legend(loc="upper right")
plt.show()

# 梯度爆炸

M = torch.normal(0, 1, size=(4,4))
print('⼀个矩阵 \n',M)
for i in range(100):
    M = torch.mm(M,torch.normal(0, 1, size=(4, 4)))
print('乘以100个矩阵后\n', M)

七、学习问题的分类法

**批量学习 **

在批量学习（batch learning）中，我们可以访问⼀组训练特征和标签 f(x1; y1); : : : ; (xn; yn)g，我们使⽤这些特性和标签训练f(x)。然后，我们部署此模型来对来⾃同⼀分布的新数据(x; y)进⾏评分。例如，我们可以根据猫和狗的⼤量图⽚训练猫检测器。⼀旦我们训练了它，我们就把它作为智能猫⻔计算视觉系统的⼀部分，来控制只允许猫进⼊。然后这个系统会被安装在客⼾家中，基本再也不会更新

在线学习

除了“批量”地学习，我们还可以单个“在线”学习数据(xi; yi)。更具体地说，我们⾸先观测到xi，然后我们得出⼀个估计值f(xi)，只有当我们做到这⼀点后，我们才观测到yi。然后根据我们的决定，我们会得到奖励或损失。许多实际问题都属于这⼀类。例如，我们需要预测明天的股票价格，这样我们就可以根据这个预测进⾏交易。在⼀天结束时，我们会评估我们的预测是否盈利。换句话说，在在线学习（online learning）中，
我们有以下的循环。在这个循环中，给定新的观测结果，我们会不断地改进我们的模型

控制

在很多情况下，环境会记住我们所做的事。不⼀定是以⼀种对抗的⽅式，但它会记住，⽽且它的反应将取决于之前发⽣的事情。例如，咖啡锅炉控制器将根据之前是否加热锅炉来观测到不同的温度。在这种情况下， PID（⽐例—积分—微分）控制器算法是⼀个流⾏的选择。同样，⼀个⽤⼾在新闻⽹站上的⾏为将取决于之前向她展⽰的内容（例如，⼤多数新闻她只阅读⼀次）。许多这样的算法形成了⼀个环境模型，在这个模型中，他们的⾏为使得他们的决策看起来不那么随机。近年来，控制理论（如PID的变体）也被⽤于⾃动调整超参数，以获得更好的解构和重建质量，提⾼⽣成⽂本的多样性和⽣成图像的重建质量 (Shao et al., 2020)。

强化学习

强化学习（reinforcement learning）强调如何基于环境⽽⾏动，以取得最⼤化的预期利益。国际象棋、围棋、西洋双陆棋或星际争霸都是强化学习的应⽤实例。再⽐如，为⾃动驾驶汽⻋制造⼀个控制器，或者以其他⽅
式对⾃动驾驶汽⻋的驾驶⽅式做出反应（例如，试图避开某物体，试图造成事故，或者试图与其合作）。