Batch Normalization与Layer Normalization解析

周常见

已于 2025-02-26 14:06:38 修改

阅读量1.5k

点赞数 33

文章标签： pytorch

于 2025-02-26 14:03:26 首次发布

本文链接：https://blog.youkuaiyun.com/weixin_63170160/article/details/145869033

版权

文章目录

Batch Normalization
Layer Normalization
- Layer Normalization原理
- 使用pytorch进行试验Layer Normalization

参考链接：
Batch Normalization详解以及pytorch实验
 Layer Normalization解析
 BatchNorm和LayerNorm——通俗易懂的理解
 神经网络中的LayerNorm详解
 LayerNorm层归一化

Batch Normalization

Batch Normalization原理

“对于一个拥有d维的输入x，我们将对它的每一个维度进行标准化处理。”

例如，输入的shape是[B,C,H,W]，Batch Normalization拿出batch中一个样本,就是shape中的B,做标准化要求均值和方差，所以根据上述话理解应该是在通道C中拿出1个通道来做均值和方差，即拿出一个shape为[B,1,H,W]的值来做均值和方差。(具体细节请参考链接)

上述过程只是在训练过程中的计算方式，在推理过程中，例如极端情况batch为1时，使用Batch Normalization并不能做标准化，因为自己跟自己没有必要做标准化，标准化是让batch中所有几个样本的数据分布成正态分布，所以在pytorch中有running_mean和running_var这两个参数，这两个参数是通过每一次训练过程积累计算出来的，大概过程就是每训练一次，将该次训练的均值0.1倍加上之前0.9*running_mean，方差同理，具体细节参考链接。

上述链接中还提到了缩放参数𝛾和偏移参数β，为什么需要这两个参数呢？明明已经进行了标准化了。

以下是我询问chatgpt的回答：

BN 归一化后的特征是均值 0、方差 1，但这样的分布可能不是最优的，可能会导致信息丢失。
γ 和 β 允许网络学习到合适的均值和方差，保证模型的表达能力不受损失。
训练过程中，这两个参数是可学习的，最终模型可以学习到最适合当前任务的特征分布。

使用pytorch进行试验Batch Normalization

下属代码参考链接

import numpy as np
import torch.nn as nn
import torch


def bn_process(feature, mean, var):
    feature_shape = feature.shape
    for i in range(feature_shape[1]):
        # [batch, channel, height, width]
        feature_t = feature[:, i, :, :]
        mean_t = feature_t.mean()
        # 总体标准差
        std_t1 = feature_t.std()
        # 样本标准差
        std_t2 = feature_t.std(ddof=1)

        # bn process
        # 这里记得加上eps和pytorch保持一致
        feature[:, i, :, :] = (feature[:, i, :, :] - mean_t) / np.sqrt(std_t1 ** 2 + 1e-5)
        # update calculating mean and var
        mean[i] = mean[i] * 0.9 + mean_t * 0.1
        var[i] = var[i] * 0.9 + (std_t2 ** 2) * 0.1
    print(f'bn_process计算结果：{feature}')


# 随机生成一个batch为2，channel为2，height=width=2的特征向量
# [batch, channel, height, width]
# feature1 = torch.randn(2, 2, 2, 2)
feature1 = torch.arange(16).to(dtype=torch.float32).view(2, 2, 2, 2)
# 初始化统计均值和方差
calculate_mean = [0.0, 0.0]
calculate_var = [1.0, 1.0]
print(f'原始feature1：{feature1.numpy()}')

# 原始feature1：[[[[ 0.  1.]
#    [ 2.  3.]]
# 
#   [[ 4.  5.]
#    [ 6.  7.]]]
# 
# 
#  [[[ 8.  9.]
#    [10. 11.]]
# 
#   [[12. 13.]
#    [14. 15.]]]]

# 注意要使用copy()深拷贝
bn_process(feature1.numpy().copy(), calculate_mean, calculate_var)

bn = nn.BatchNorm2d(2, eps=1e-5)
output = bn(feature1)
print(f'nn.BatchNorm2d计算结果：{output}')

计算上述feature1的第一个通道的均值为（0+1+2+3+8+9+10+11）/8=5.5，第一个通道的标准差（（1/8）*（（0-5.5）**2+...+（11-5.5）**2））**（1/2）= 4.153，所以第一个值0经过BN后就应该是(0-5.5)/4.153=-1.3242，以此类推（其中还有一个细节eps=1e-5,具体细节查看链接）。

使用BN时需要注意的问题：

训练时要将traning参数设置为True，在验证时将trainning参数设置为False。在pytorch中可通过创建模型的model.train()和model.eval()方法控制。因为训练和验证时BN的计算方式不同，上面讲过。
batch size尽可能设置大点，设置小后表现可能很糟糕，设置的越大求的均值和方差越接近整个训练集的均值和方差。（如果必须设置的小，例如大了之后显存爆了，可以参考使用Group Normalization）
建议将bn层放在卷积层（Conv）和激活层（例如Relu）之间，且卷积层不要使用偏置bias,参考链接

Layer Normalization

Layer Normalization原理

BN 是对batch取出一个数据的每个 channel 进行 Norm 处理，即输入的shape为[B,C,H,W],取出一个[B,1,H,W]对其求均值和方差，进行归一化。

LN是对单个数据的指定维度进行Norm处理，与batch无关。

输入是自然语言的shape[B,S,D]时，在pytorch中的LN一般是处理D维向量（默认是从后往前处理），具体参考pytorch官网链接(处理的是一维数据)；也可以处理[S,D]维数据，处理[S,D]数据的链接博客。
输入是图像数据时，shape为[B,C,H,W]，pytorch官网处理的是[C,H,W]维的数据，而ConvNext网络处理的仅仅是通道C的数据，即把[B,C,H,W] - > [B,H,W,C]，然后进行处理。

使用pytorch进行试验Layer Normalization

以下内容参考链接

# NLP Example
batch, sentence_length, embedding_dim = 20, 5, 10
embedding = torch.randn(batch, sentence_length, embedding_dim)
layer_norm = nn.LayerNorm(embedding_dim)
# Activate module
layer_norm(embedding)
# Image Example
N, C, H, W = 20, 5, 10, 10
input = torch.randn(N, C, H, W)
# Normalize over the last three dimensions (i.e. the channel and spatial dimensions)
# as shown in the image below
layer_norm = nn.LayerNorm([C, H, W])
output = layer_norm(input)

以下内容参考链接

import torch
import torch.nn as nn


def layer_norm_process(feature: torch.Tensor, beta=0., gamma=1., eps=1e-5):
    var_mean = torch.var_mean(feature, dim=-1, unbiased=False)
    # 均值
    mean = var_mean[1]
    # 方差
    var = var_mean[0]

    # layer norm process
    feature = (feature - mean[..., None]) / torch.sqrt(var[..., None] + eps)
    feature = feature * gamma + beta

    return feature


def main():
    # t = torch.rand(4, 2, 3)   # 这行代码是源代码中的输入数据，我改了一下，变成了下面的样子，适合自己手算一下，验证自己的想法
    t = torch.arange(24).to(dtype=torch.float32).view(2, 3, 4)
    print(t)
    # 仅在最后一个维度上做norm处理
    norm = nn.LayerNorm(normalized_shape=4, eps=1e-5)
    # 官方layer norm处理
    t1 = norm(t)
    # 自己实现的layer norm处理
    t2 = layer_norm_process(t, eps=1e-5)
    print("t1:\n", t1)
    print("t2:\n", t2)


if __name__ == '__main__':
    main()

以下代码来自ConvNeXt

class LayerNorm(nn.Module):
    def __init__(self, normalized_shape, eps=1e-6, data_format='channels_last'):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(normalized_shape), requires_grad=True)
        self.bias = nn.Parameter(torch.zeros(normalized_shape), requires_grad=True)
        self.eps = eps
        self.data_format = data_format
        if self.data_format not in ['channels_last', 'channels_first']:
            raise ValueError(f"not support data formate '{self.data_format}'")
        self.normalized_shape = (normalized_shape,)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        if self.data_format == 'channels_last':       # [B,H,W,C]
            return F.layer_norm(x, self.normalized_shape, self.weight, self.bias, self.eps)
        elif self.data_format == "channels_first":    #  [B,C,H,W]
            mean = x.mean(1, keepdim=True)
            var = (x - mean).pow(2).mean(1, keepdim=True)
            x = (x - mean) / torch.sqrt(var + self.eps)
            x = self.weight[:, None, None] * x + self.bias[:, None, None]
            return x