BN_batch normalization

参考:

https://zhuanlan.zhihu.com/p/27938792

做法

设,每个batch输入是 x=[x_0,x_1,x_2,...,x_n] (其中每个 x_i 都是一个样本, n 是batch size) 假如在第一层后加入Batch normalization layer后, h_1 的计算就倍替换为下图所示的那样。

 

  • 矩阵 x 先经过 W_{h_1} 的线性变换后得到 s_1
    • :因为减去batch的平均值 \mu_B 后, b 的作用会被抵消掉,所以没必要加入 b (红色删除线)。
  • s_1 再减去batch的平均值 \mu_B ,并除以batch的标准差 \sqrt{\sigma_B+\epsilon} 得到 s_2\epsilon 是为了避免除数为0的情况所使用的微小正数。
    • \mu_B=\frac {1}{m} \sum^m_{i=0}W_{h_1}x_{i,:}
    • \sigma^2_B=\frac {1}{m} \sum^m_{i=0}(W_{h_1}x_{i,:}-\mu_B)^2
    • :但 s_2 基本会被限制在正态分布下,使得网络的表达能力下降。为解决该问题,引入两个新的parameters: \gamma\beta\gamma\beta 是在训练时网络自己学习得到的。
  • s_2 乘以 \gamma 调整数值大小,再加上 \beta 增加偏移后得到 s_3
  • 为加入非线性能力, s_3 也会跟随着ReLU等激活函数。
  • 最终得到的 h_1 会被送到下一层作为输入。

需要注意的是,上述的计算方法用于在训练。因为测试时常会只预测一个新样本,也就是说batch size为1。若还用相同的方法计算 \mu_B\mu_B 就会是这个新样本自身, s_1-\mu_B 就会成为0。

所以在测试时,所使用的 \mu\sigma^2 是整个训练集的均值 \mu_P 和方差 \sigma^2_P

而整个训练集的均值\mu_P和方差 \sigma^2_P 的值通常也是在训练的同时用移动平均法来计算

转载于:https://www.cnblogs.com/abella/p/10282225.html

### Batch Normalization in Neural Networks Explained #### Definition and Purpose Batch normalization is a method utilized during the training of artificial neural networks that aims at improving the speed, performance, and stability of these models by normalizing the input layer through re-centering and rescaling[^2]. This process helps mitigate issues related to internal covariate shift, which refers to changes in the distribution of inputs to layers deep within the network as weights are updated. #### Implementation Details During each mini-batch update step, batch normalization computes mean μ_B and variance σ²_B across all instances present in the current batch B. Then, for every instance x^(i), normalized value y^(i) gets calculated according to: \[ y^{(i)}=\frac{x^{(i)}-\mu_{B}}{\sqrt{\sigma_{B}^{2}+\epsilon}} \] where ε represents a small constant added for numerical stability purposes. Afterward, two learnable parameters γ and β get introduced per feature dimension so that transformed output z^(i)=γ*y^(i)+β allows controlling scale and offset after normalization has been applied. This approach ensures activations remain well-behaved throughout forward propagation while also providing regularization benefits similar to dropout techniques described elsewhere[^3]. #### Benefits During Training Phase Applying batch norm leads to several advantages including faster convergence rates due to reduced vanishing/exploding gradient problems; enhanced model generalization capabilities thanks partly because this mechanism acts like another form of noise injection into learned representations; less sensitivity towards specific weight initialization schemes since optimal ranges become more forgiving post-normalization treatment. ```python import tensorflow as tf model = tf.keras.models.Sequential([ ... tf.keras.layers.BatchNormalization(), ... ]) ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值