PyTorch深度学习实战：激活函数与神经网络优化

原创已于 2025-11-18 17:32:56 修改 · 1.1k 阅读

23 ·

CC 4.0 BY-SA版权

文章标签：

#深度学习 #pytorch #神经网络 #激活函数

于 2025-11-10 16:12:16 首次发布

部署运行你感兴趣的模型镜像

激活函数

学习目标

通过本课程，你将深入研究常见的激活函数，并探究激活函数对神经网络优化特性的影响。本课程的目的是展示选择一个良好激活函数的重要性，以及如何选择，如果不这样做可能会出现哪些问题。

学习内容

1 激活函数

激活函数是神经网络中的重要概念，作用是向神经网络中引入非线性因素，使网络能处理更复杂的问题，提升模型的表示能力。

常见的激活函数有 Sigmoid、Tanh、ReLU 等。Sigmoid 函数将值映射到 (0,1) 区间，Tanh 把值映射到 (-1,1)，ReLU 在输入大于 0 时输出输入值，小于 0 时输出 0。

选择激活函数要考虑是否容易计算、是否有梯度消失问题及模型需求等因素。

1.1 安装项目所需依赖库

%pip install seaborn ipywidgets

在本课程开始之前，先导入常用库并设置基本函数：

## 标准库
import os
import json
import math
import numpy as np 


import matplotlib.pyplot as plt
%matplotlib inline 
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('svg', 'pdf')
import seaborn as sns
sns.set()


from tqdm.notebook import tqdm

## Pytorch对应库
import torch
import torch_npu
from torch_npu.contrib import transfer_to_npu
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as data
import torch.optim as optim

定义一个函数，为在本课程中可能会用到的所有库（此处为 numpy 和 torch）设置随机种子。这能让实验的训练过程具有可重复性。不过请注意，与 CPU 不同，在不同的 NPU架构上使用相同的随机种子可能会产生不同的结果。

此外，以下代码单元定义了两条路径：DATASET_PATH 和 CHECKPOINT_PATH。数据集路径是本课程中所用数据集的目录。检查点路径是实验中将存储训练好的模型权重及其他文件的目录。

# 数据集已下载或应下载到的文件夹路径
DATASET_PATH = "./data"
# 预训练模型保存的文件夹路径
CHECKPOINT_PATH = "./saved_models/tutorial3"

# 设置随机数种子
def set_seed(seed):
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.npu.is_available():
        torch.npu.manual_seed(seed)
        torch.npu.manual_seed_all(seed)
set_seed(42)

device = torch.device("cpu") if not torch.npu.is_available() else torch.device("npu")
print("Using device", device)

Out：
Using device npu

下载数据文件和模型文件

!wget https://model-community-picture.obs.cn-north-4.myhuaweicloud.com/ascend-zone/notebook_datasets/a3bbe07ce84111efb94afa163edcddae/data.zip --no-check-certificate

!wget https://model-community-picture.obs.cn-north-4.myhuaweicloud.com/ascend-zone/notebook_models/a3ac2aece84111efb94afa163edcddae/saved_models.zip --no-check-certificate

解压缩下载的数据文件和模型文件。

!unzip saved_models.zip
!unzip data.zip

1.2 常见的激活函数

本课程将自行实现一些常见的激活函数。当然，其中大多数函数在 torch.nn 包中也能找到。不过，为了更好地理解和深入认识，在这里编写实验所需的函数。

为了更方便地比较各种激活函数，首先定义一个基类，后续所有模块都将从这个基类继承：

class ActivationFunction(nn.Module):
    
    def __init__(self):
        super().__init__()
        self.name = self.__class__.__name__
        self.config = {"name": self.name}

每个激活函数都将是一个 nn.Module，这样就能很好地将激活函数集成到网络中。可以使用 config 字典来存储某些激活函数的可调节参数。

接下来，实验中实现两种仍广泛用于各种任务的激活函数：sigmoid 和 tanh。sigmoid 和 tanh 激活函数在 PyTorch 中既可以作为函数（torch.sigmoid，torch.tanh），也可以作为模块（nn.Sigmoid，nn.Tanh）使用。以下在实验中手动实现它们：

##############################

class Sigmoid(ActivationFunction):
    
    def forward(self, x):
        return 1 / (1 + torch.exp(-x))

##############################   
    
class Tanh(ActivationFunction):
    
    def forward(self, x):
        x_exp, neg_x_exp = torch.exp(x), torch.exp(-x)
        return (x_exp - neg_x_exp) / (x_exp + neg_x_exp)
    
##############################

另一种使得深度网络训练成为可能的流行激活函数是修正线性单元（ReLU）。尽管它只是一个分段线性函数，结构简单，但与sigmoid和tanh相比，ReLU有一个主要优势：在很大取值范围内都有强大且稳定的梯度。

基于这一理念，人们提出了许多ReLU的变体，本课程将实现其中以下三种：LeakyReLU、ELU和Swish。

LeakyReLU在负值部分用一个较小的斜率取代了零的设定，以使梯度在输入的这一部分也能流动。同样，ELU用指数衰减取代了负值部分。第三种，也是最近提出的激活函数是Swish，它实际上是一项大规模实验的成果，该实验旨在寻找 “最优” 激活函数。与其他激活函数相比，Swish既平滑又非单调（即在梯度中包含符号变化）。事实证明，这能防止像标准ReLU激活函数中出现的神经元死亡问题，尤其适用于深度网络。如果感兴趣，可在这篇论文《Searching for Activation Functions》中找到关于Swish优势的更详细讨论。

下面本课程将实现这四种激活函数：

##############################

class ReLU(ActivationFunction):
    
    def forward(self, x):
        return x * (x > 0).float()

##############################

class LeakyReLU(ActivationFunction):
    
    def __init__(self, alpha=0.1):
        super().__init__()
        self.config["alpha"] = alpha
        
    def forward(self, x):
        return torch.where(x > 0, x, self.config["alpha"] * x)

##############################
    
class ELU(ActivationFunction):
    
    def forward(self, x):
        return torch.where(x > 0, x, torch.exp(x)-1)

##############################
    
class Swish(ActivationFunction):
    
    def forward(self, x):
        return x * torch.sigmoid(x)
    
##############################

为了后续使用，本课程将所有激活函数汇总在一个字典中，该字典将激活函数的名称映射到类对象。如果你自行实现了一个新的激活函数，也将其添加在此处，以便在未来的比较中纳入该函数：

act_fn_by_name = {
    "sigmoid": Sigmoid,
    "tanh": Tanh,
    "relu": ReLU,
    "leakyrelu": LeakyReLU,
    "elu": ELU,
    "swish": Swish
}

1.3 激活函数可视化

为了了解每个激活函数实际的作用，本课程将在下面对激活函数进行可视化。除了实际的激活值，函数的梯度也是一个重要方面，因为它对于优化神经网络至关重要。PyTorch 允许实验通过简单地调用 backward 函数来计算梯度：

def get_grads(act_fn, x):
    """
    计算激活函数在指定位置的梯度。
    
    输入：
        act_fn - “ActivationFunction” 类的一个对象，其中实现了前向传播。
        x - 一维输入张量。 
    输出：
        一个与x大小相同的张量，包含激活函数act_fn在x处的梯度。
    """
    x = x.clone().requires_grad_() # 将输入标记为希望存储梯度的张量
    out = act_fn(x)
    out.sum().backward() # 对结果求和会使梯度以相同的方式流向 `x` 中的每个元素。 
    return x.grad # 通过 “x.grad” 访问 x 的梯度

现在可以将所有激活函数及其梯度进行可视化展示了：

def vis_act_fn(act_fn, ax, x):
    # 运行激活函数
    y = act_fn(x)
    y_grads = get_grads(act_fn, x)
    # 将 x、y 和梯度数据移回 CPU 以便进行绘图
    x, y, y_grads = x.cpu().numpy(), y.cpu().numpy(), y_grads.cpu().numpy()
    ## Plotting
    ax.plot(x, y, linewidth=2, label="ActFn")
    ax.plot(x, y_grads, linewidth=2, label="Gradient")
    ax.set_title(act_fn.name)
    ax.legend()
    ax.set_ylim(-1.5, x.max())

# 如果需要的话，可以添加更多激活函数。
act_fns = [act_fn() for act_fn in act_fn_by_name.values()]
x = torch.linspace(-5, 5, 1000) # 想要可视化激活函数的范围functions

rows = math.ceil(len(act_fns)/2.0)
fig, ax = plt.subplots(rows, 2, figsize=(8, rows*4))
for i, act_fn in enumerate(act_fns):
    vis_act_fn(act_fn, ax[divmod(i,2)], x)
fig.subplots_adjust(hspace=0.3)
plt.show()

1.4 分析激活函数的影响

在实现并可视化激活函数之后，本课程旨在深入了解激活函数的影响。为此，本课程将使用一个在FashionMNIST数据集上训练的简单神经网络，并研究该模型的各个方面，包括性能和梯度流。
首先，搭建一个神经网络。这个选定的网络将图像视为一维张量，并让它们依次通过一系列线性层和指定的激活函数。

class BaseNetwork(nn.Module):
    
    def __init__(self, act_fn, input_size=784, num_classes=10, hidden_sizes=[512, 256, 256, 128]):
        """
        Inputs:
            act_fn - Object of the activation function that should be used as non-linearity in the network.
            input_size - Size of the input images in pixels
            num_classes - Number of classes we want to predict
            hidden_sizes - A list of integers specifying the hidden layer sizes in the NN
        """
        super().__init__()
        
        # 根据指定的隐藏层大小创建网络
        layers = []
        layer_sizes = [input_size] + hidden_sizes
        for layer_index in range(1, len(layer_sizes)):
            layers += [nn.Linear(layer_sizes[layer_index-1], layer_sizes[layer_index]),
                       act_fn]
        layers += [nn.Linear(layer_sizes[-1], num_classes)]
        self.layers = nn.Sequential(*layers) # `nn.Sequential` 会将一系列模块汇总成一个单一的模块，并按顺序依次应用这些模块。
        
        # 将所有超参数存储在一个字典中，以便保存和加载模型。
        self.config = {"act_fn": act_fn.config, "input_size": input_size, "num_classes": num_classes, "hidden_sizes": hidden_sizes} 
        
    def forward(self, x):
        x = x.view(x.size(0), -1) # 将图像重塑为一维向量
        out = self.layers(x)
        return out

添加用于加载和保存模型的函数。超参数会存储在一个配置文件（简单的 JSON 文件）中：

def _get_config_file(model_path, model_name):
    # 用于存储超参数详细信息的文件名称
    return os.path.join(model_path, model_name + ".config")

def _get_model_file(model_path, model_name):
    # 用于存储网络参数的文件名称
    return os.path.join(model_path, model_name + ".tar")

def load_model(model_path, model_name, net=None):
    """
    Loads a saved model from disk.
    
    Inputs:
        model_path - Path of the checkpoint directory
        model_name - Name of the model (str)
        net - (Optional) If given, the state dict is loaded into this model. Otherwise, a new model is created.
    """
    config_file, model_file = _get_config_file(model_path, model_name), _get_model_file(model_path, model_name)
    assert os.path.isfile(config_file), f"Could not find the config file \"{config_file}\". Are you sure this is the correct path and you have your model config stored here?"
    assert os.path.isfile(model_file), f"Could not find the model file \"{model_file}\". Are you sure this is the correct path and you have your model stored here?"
    with open(config_file, "r") as f:
        config_dict = json.load(f)
    if net is None:
        act_fn_name = config_dict["act_fn"].pop("name").lower()
        act_fn = act_fn_by_name[act_fn_name](**config_dict.pop("act_fn"))
        net = BaseNetwork(act_fn=act_fn, **config_dict)
    net.load_state_dict(torch.load(model_file, map_location=device))
    return net
    
def save_model(model, model_path, model_name):
    """
    Given a model, we save the state_dict and hyperparameters.
    
    Inputs:
        model - Network object to save parameters from
        model_path - Path of the checkpoint directory
        model_name - Name of the model (str)
    """
    config_dict = model.config
    os.makedirs(model_path, exist_ok=True)
    config_file, model_file = _get_config_file(model_path, model_name), _get_model_file(model_path, model_name)
    with open(config_file, "w") as f:
        json.dump(config_dict, f)
    torch.save(model.state_dict(), model_file)

本课程还设置了要在其上进行训练的数据集，即FashionMNIST。FashionMNIST是MNIST的一个更复杂版本，它包含的是服装的黑白图像，而非数字图像。这10个类别包括裤子、外套、鞋子、包包等等。为了加载这个数据集，实验将使用PyTorch的另一个包，即torchvision。torchvision包含了流行的数据集、模型架构以及用于计算机视觉的常见图像变换。

通过加载这个数据集，并可视化一些图像，以便对数据有个直观的印象。

import torchvision
from torchvision.datasets import FashionMNIST
from torchvision import transforms

# 对每张图像应用的变换 => 首先将它们转换为张量，然后将其归一化到 -1 到 1 的范围
transform = transforms.Compose([transforms.ToTensor(), 
                                transforms.Normalize((0.5,), (0.5,))])

# 加载训练数据集。需要将其划分为训练集和验证集两部分。
# 修改成本地读取的方式
train_dataset = FashionMNIST(root=DATASET_PATH, train=True, transform=transform, download=False)
train_set, val_set = torch.utils.data.random_split(train_dataset, [50000, 10000])

# 加载测试集
test_set = FashionMNIST(root=DATASET_PATH, train=False, transform=transform, download=False)

# 定义了一组数据加载器，供后续各种用途使用。
# 请注意，在实际训练模型时，将使用批量大小更小的不同数据加载器。 
train_loader = data.DataLoader(train_set, batch_size=1024, shuffle=True, drop_last=False)
val_loader = data.DataLoader(val_set, batch_size=1024, shuffle=False, drop_last=False)
test_loader = data.DataLoader(test_set, batch_size=1024, shuffle=False, drop_last=False)

exmp_imgs = [train_set[i][0] for i in range(16)]
# 将图像整理成网格形式，以便更好地可视化。
img_grid = torchvision.utils.make_grid(torch.stack(exmp_imgs, dim=0), nrow=4, normalize=True, pad_value=0.5)
img_grid = img_grid.permute(1, 2, 0)

plt.figure(figsize=(8,8))
plt.title("FashionMNIST examples")
plt.imshow(img_grid)
plt.axis('off')
plt.show()
plt.close()

1.5 初始化后可视化梯度流

激活函数的一个重要方面在于它们如何在网络中传播梯度。设想一下，有一个深度超过50层的深度神经网络。输入层（即第一层）的梯度，要经过50多次激活函数运算，但实验中仍希望这些梯度大小处于合理范围。如果经过激活函数后的梯度（平均而言）显著小于1，那么在梯度到达输入层之前就会消失。如果经过激活函数后的梯度大于1，梯度就会呈指数级增长，甚至可能发生梯度爆炸。

为了了解每个激活函数对梯度的影响，实验中可以观察一个刚初始化的网络，并针对一批图像测量每个参数的梯度：

def visualize_gradients(net, color="C0"):
    """
    Inputs:
        net - Object of class BaseNetwork
        color - Color in which we want to visualize the histogram (for easier separation of activation functions)
    """
    net.eval()
    small_loader = data.DataLoader(train_set, batch_size=256, shuffle=False)
    imgs, labels = next(iter(small_loader))
    imgs, labels = imgs.to(device), labels.to(device)
    
    # 将一个批次的数据通过网络，并计算权重的梯度。
    net.zero_grad()
    preds = net(imgs)
    loss = F.cross_entropy(preds, labels)
    loss.backward()
    # 为减少绘图数量，我们将可视化范围限定在权重参数上，不包括偏置。 
    grads = {name: params.grad.data.view(-1).cpu().clone().numpy() for name, params in net.named_parameters() if "weight" in name}
    net.zero_grad()
    
   
    columns = len(grads)
    fig, ax = plt.subplots(1, columns, figsize=(columns*3.5, 2.5))
    fig_index = 0
    for key in grads:
        key_ax = ax[fig_index%columns]
        sns.histplot(data=grads[key], bins=30, ax=key_ax, color=color, kde=True)
        key_ax.set_title(str(key))
        key_ax.set_xlabel("Grad magnitude")
        fig_index += 1
    fig.suptitle(f"Gradient magnitude distribution for activation function {net.config['act_fn']['name']}", fontsize=14, y=1.05)
    fig.subplots_adjust(wspace=0.45)
    plt.show()
    plt.close()

# 如果直方图包含较小的值，Seaborn 会打印警告信息。目前我们可以忽略这些警告。
import warnings
warnings.filterwarnings('ignore')
## 为每个激活函数创建一个绘图
for i, act_fn_name in enumerate(act_fn_by_name):
    set_seed(42) # 设置随机种子可确保每个激活函数的权重初始化方式相同。
    act_fn = act_fn_by_name[act_fn_name]()
    net_actfn = BaseNetwork(act_fn=act_fn).to(device)
    visualize_gradients(net_actfn, color=f"C{i}")

Sigmoid激活函数表现出明显不理想的特性。虽然输出层的梯度很大，最高接近0.1，但Sigmoid激活函数的输入层梯度范数在图中给出的其他几种激活函数中是最低的，仅为1e - 5。这是因为Sigmoid激活函数的原始最大梯度较小，仅为0.25，在这种情况下，要为所有层找到合适的学习率是不可能的。其他所有激活函数在各层的梯度范数都较为相似。ReLU 激活函数在 0 附近有一个峰值，这是因为ReLU 激活函数原始梯度左侧的零部分导致的，并且会出现神经元死亡的现象，实验后续部分会深入探讨这一点。

1.6 训练模型

接下来，实验将在FashionMNIST数据集上使用不同的激活函数来训练模型，并比较所获得的性能。总的来说，本课程的最终目标是在选择的数据集上实现尽可能好的性能。因此，将在下一个代码单元格中编写一个训练循环，其中包括每个epoch之后的验证以及对最佳模型的最终测试：

def train_model(net, model_name, max_epochs=50, patience=7, batch_size=256, overwrite=False):
    """
    Train a model on the training set of FashionMNIST
    
    Inputs:
        net - Object of BaseNetwork
        model_name - (str) Name of the model, used for creating the checkpoint names
        max_epochs - Number of epochs we want to (maximally) train for
        patience - If the performance on the validation set has not improved for #patience epochs, we stop training early
        batch_size - Size of batches used in training
        overwrite - Determines how to handle the case when there already exists a checkpoint. If True, it will be overwritten. Otherwise, we skip training.
    """
    file_exists = os.path.isfile(_get_model_file(CHECKPOINT_PATH, model_name))
    if file_exists and not overwrite:
        print("Model file already exists. Skipping training...")
    else:
        if file_exists:
            print("Model file exists, but will be overwritten...")
            
        # 定义优化器、损失函数和数据加载器
        optimizer = optim.SGD(net.parameters(), lr=1e-2, momentum=0.9) # 默认参数，可随意修改。
        loss_module = nn.CrossEntropyLoss() 
        train_loader_local = data.DataLoader(train_set, batch_size=batch_size, shuffle=True, drop_last=True, pin_memory=True)

        val_scores = []
        best_val_epoch = -1
        for epoch in range(max_epochs):
            ############
            # Training #
            ############
            net.train()
            true_preds, count = 0., 0
            for imgs, labels in tqdm(train_loader_local, desc=f"Epoch {epoch+1}", leave=False):
                imgs, labels = imgs.to(device), labels.to(device) 
                optimizer.zero_grad() # 零梯度操作（`zero_grad`）可以放在 “`loss.backward()`” 之前的任何位置。
                preds = net(imgs)
                loss = loss_module(preds, labels)
                loss.backward()
                optimizer.step()
                # 在训练过程中记录统计信息
                true_preds += (preds.argmax(dim=-1) == labels).sum()
                count += labels.shape[0]
            train_acc = true_preds / count

            ##############
            # Validation #
            ##############
            val_acc = test_model(net, val_loader)
            val_scores.append(val_acc)
            print(f"[Epoch {epoch+1:2d}] Training accuracy: {train_acc*100.0:05.2f}%, Validation accuracy: {val_acc*100.0:05.2f}%")

            if len(val_scores) == 1 or val_acc > val_scores[best_val_epoch]:
                print("\t   (New best performance, saving model...)")
                save_model(net, CHECKPOINT_PATH, model_name)
                best_val_epoch = epoch
            elif best_val_epoch <= epoch - patience:
                print(f"Early stopping due to no improvement over the last {patience} epochs")
                break

        # 绘制验证准确率曲线
        plt.plot([i for i in range(1,len(val_scores)+1)], val_scores)
        plt.xlabel("Epochs")
        plt.ylabel("Validation accuracy")
        plt.title(f"Validation performance of {model_name}")
        plt.show()
        plt.close()
    
    load_model(CHECKPOINT_PATH, model_name, net=net)
    test_acc = test_model(net, test_loader)
    print((f" Test accuracy: {test_acc*100.0:4.2f}% ").center(50, "=")+"\n")
    return test_acc
    

def test_model(net, data_loader):
    """
    Test a model on a specified dataset.
    
    Inputs:
        net - Trained model of type BaseNetwork
        data_loader - DataLoader object of the dataset to test on (validation or test)
    """
    net.eval()
    true_preds, count = 0., 0
    for imgs, labels in data_loader:
        imgs, labels = imgs.to(device), labels.to(device)
        with torch.no_grad():
            preds = net(imgs).argmax(dim=-1)
            true_preds += (preds == labels).sum().item()
            count += labels.shape[0]
    test_acc = true_preds / count
    return test_acc

实验中为每个激活函数训练一个模型。如果您是在 CPU 上运行此笔记本，建议使用预训练模型以节省时间。

for act_fn_name in act_fn_by_name:
    print(f"Training BaseNetwork with {act_fn_name} activation...")
    set_seed(42)
    act_fn = act_fn_by_name[act_fn_name]()
    net_actfn = BaseNetwork(act_fn=act_fn).to(device)
    train_model(net_actfn, f"FashionMNIST_{act_fn_name}", overwrite=False)

如结果所示，使用sigmoid激活函数的模型表现不佳，其性能与随机猜测（10个类别，随机猜对的概率为1/10）相比没有提升。

其他所有激活函数的表现都较为相似。为了得出更准确的结论，需要用多个随机种子来训练模型并查看平均值。

然而，“最优”的激活函数还取决于许多其他因素（隐藏层大小、层数、层的类型、任务、数据集、优化器、学习率等），所以在实验中的情况下，进行全面的网格搜索并无用处。在学术文献中，对于深度网络而言表现良好的激活函数正是在此实验中使用的各类ReLU函数，不过在特定网络中，某些特定的激活函数会有一些细微的优势。

1.7 可视化激活值分布

在训练完模型之后，可以查看模型内部实际的激活值情况。

def visualize_activations(net, color="C0"):
    activations = {}
    
    net.eval()
    small_loader = data.DataLoader(train_set, batch_size=1024)
    imgs, labels = next(iter(small_loader))
    with torch.no_grad():
        layer_index = 0
        imgs = imgs.to(device)
        imgs = imgs.view(imgs.size(0), -1)
        # 需要手动遍历各层，以便保存所有的激活值。
        for layer_index, layer in enumerate(net.layers[:-1]):
            imgs = layer(imgs)
            activations[layer_index] = imgs.view(-1).cpu().numpy()
    
    
    columns = 4
    rows = math.ceil(len(activations)/columns)
    fig, ax = plt.subplots(rows, columns, figsize=(columns*2.7, rows*2.5))
    fig_index = 0
    for key in activations:
        key_ax = ax[fig_index//columns][fig_index%columns]
        sns.histplot(data=activations[key], bins=50, ax=key_ax, color=color, kde=True, stat="density")
        key_ax.set_title(f"Layer {key} - {net.layers[key].__class__.__name__}")
        fig_index += 1
    fig.suptitle(f"Activation distribution for activation function {net.config['act_fn']['name']}", fontsize=14)
    fig.subplots_adjust(hspace=0.4, wspace=0.4)
    plt.show()
    plt.close()

for i, act_fn_name in enumerate(act_fn_by_name):
    net_actfn = load_model(model_path=CHECKPOINT_PATH, model_name=f"FashionMNIST_{act_fn_name}").to(device)
    visualize_activations(net_actfn, color=f"C{i}")

由于使用 sigmoid 激活函数的模型训练效果很差，其激活值所包含的信息也较少，且都聚集在 0.5 附近。

tanh 激活函数表现出更多样化的特性。对于输入层，发现有大量神经元的激活值接近 -1 和 1，而在这两个值附近梯度接近零；而在接下来的两层中，激活值更接近零。这可能是因为输入层在输入图像中寻找特定特征，而后续层则将这些特征组合起来。最后一层的激活值再次更偏向于极值点，因为分类层可以看作是这些值的加权平均（梯度会将激活值推向这些极值）。

正如最初所预期的，ReLU 在 0 处有一个明显的峰值。对于负值没有梯度这一特性导致网络在线性层之后不会呈现类似高斯的分布，而是在正值方向有一个较长的尾部。LeakyReLU 表现出非常相似的特性，而 ELU 则更接近高斯分布。Swish 激活函数似乎处于两者之间，不过值得注意的是，Swish 使用的值明显高于其他激活函数，最高可接近20。

尽管在简单网络中，所有激活函数表现出的性能相似，但它们的行为都略有不同。显然，“最优”激活函数的选择确实取决于许多因素，而且对于所有可能的网络来说并非一成不变。

1.8 查找 ReLU 网络中的死亡神经元

ReLU 激活函数一个已知的缺点是会出现“死亡神经元”，即对于任何训练输入，其梯度都为零的神经元。死亡神经元的问题在于，由于该层没有梯度，实验无法训练前一层中与这个神经元相关的参数，使其输出非零值。要出现死亡神经元，ReLU 之前的线性层中某个特定神经元对于所有输入图像的输出值都必须为负。考虑到神经网络中神经元的数量众多，这种情况并非不可能发生。

为了更好地理解这是多大的一个问题，以及何时需要格外注意，将测量不同网络中有多少死亡神经元。为此，实验中实现一个函数，该函数会让网络在整个训练集上运行，并记录每个神经元对于所有数据点的输出是否恰好为零：

def measure_number_dead_neurons(net):

    # 对于每个神经元，创建一个布尔变量，初始值设为 1。如果该神经元在任何时候的激活值不为 0，
    # 我们就将这个变量设为 0。在遍历完整个训练集后，只有死亡神经元对应的变量值仍为 1。 
    neurons_dead = [
        torch.ones(layer.weight.shape[0], device=device, dtype=torch.bool) for layer in net.layers[:-1] if isinstance(layer, nn.Linear)
    ] # 与 `BaseNetwork` 中的隐藏层大小形状相同。

    net.eval()
    with torch.no_grad():
        for imgs, labels in tqdm(train_loader, leave=False): # 遍历整个训练集
            layer_index = 0
            imgs = imgs.to(device)
            imgs = imgs.view(imgs.size(0), -1)
            for layer in net.layers[:-1]:
                imgs = layer(imgs)
                if isinstance(layer, ActivationFunction):
                    # 在当前批次中，所有激活值是否都等于 0，并且在之前的批次中也没有记录到相反的情况（即激活值不为 0）？ 
                    neurons_dead[layer_index] = torch.logical_and(neurons_dead[layer_index], (imgs == 0).all(dim=0))
                    layer_index += 1
    number_neurons_dead = [t.sum().item() for t in neurons_dead]
    print("Number of dead neurons:", number_neurons_dead)
    print("In percentage:", ", ".join([f"{(100.0 * num_dead / tens.shape[0]):4.2f}%" for tens, num_dead in zip(neurons_dead, number_neurons_dead)]))

首先，可以测量一个未训练网络中的死亡神经元数量：

set_seed(42)
net_relu = BaseNetwork(act_fn=ReLU()).to(device)
measure_number_dead_neurons(net_relu)

发现只有少量神经元处于死亡状态，并且死亡神经元的数量会随着层的深度增加而增多。不过，就目前少量的死亡神经元而言，这并非问题，因为前一层权重的更新会改变后续层的输入。所以，后续层中的死亡神经元有可能会再次“复活”/变得活跃。

对于一个（采用相同初始化方式的）已训练网络，情况又会如何呢？

net_relu = load_model(model_path=CHECKPOINT_PATH, model_name="FashionMNIST_relu").to(device)
measure_number_dead_neurons(net_relu)

实际上，后续层中死亡神经元的数量减少了。不过，需要注意的是，死亡神经元在输入层尤其成问题。由于输入在各个训练轮次中不会改变（训练集保持原样），训练网络无法使这些神经元重新恢复活跃。尽管如此，输入数据通常具有足够高的标准差，从而降低了出现死亡神经元的风险。

最后，查看死亡神经元的数量是如何随层深度的增加而变化的。例如，考虑下面这个 10 层的神经网络：

set_seed(42)
net_relu = BaseNetwork(act_fn=ReLU(), hidden_sizes=[256, 256, 256, 256, 256, 128, 128, 128, 128, 128]).to(device)
measure_number_dead_neurons(net_relu)

死亡神经元的数量明显比之前更多了，这尤其在最初的迭代过程中对梯度流产生了不利影响。

1.9 结论

在本课程中，回顾了神经网络中的一组六种激活函数（Sigmoid、Tanh、ReLU、LeakyReLU、ELU 和 Swish），并讨论了它们如何影响各层之间的梯度分布。Sigmoid 函数在深度神经网络中往往表现不佳，因为它所能提供的最大梯度为 0.25，这会导致早期层出现梯度消失的问题。所有基于 ReLU 的激活函数表现都不错，并且除了原始的 ReLU 之外，不存在死亡神经元的问题。在实现你自己的神经网络时，建议从基于 ReLU 的网络开始，并根据网络的特性选择具体的激活函数。