输入
Xi=(xi0,xi1,...,xi(n−1))X_i=(x_{i0},x_{i1},...,x_{i(n-1)})Xi=(xi0,xi1,...,xi(n−1)) i∈[0,m−1]i \in [0,m-1]i∈[0,m−1] batch-size等于m,特征维度n
输出
Yi=(yi0,yi1,...,yi(n−1))Y_i=(y_{i0},y_{i1},...,y_{i(n-1)})Yi=(yi0,yi1,...,yi(n−1)) i∈[0,m−1]i \in [0,m-1]i∈[0,m−1] 维度和输入XXX一致
前向计算
-
均值
μ=μ0,μ1,...,μn\mu = {\mu_0,\mu_1,...,\mu_n}μ=μ0,μ1,...,μn 其中
μp=1m∑ixip\mu_p = \frac{1}{m}\sum_ix_{ip}μp=m1∑ixip -
方差
σ=σ0,σ1,...,σn\sigma = {\sigma_0,\sigma_1,...,\sigma_n}σ=σ0,σ1,...,σn 其中
σp=1m∑i(xip−μp)2\sigma_p = \frac{1}{m}\sum_i(x_{ip}-\mu_p)^2σp=m1∑i(xip−μp)2 -
中间结果
x‾ip=xip−μpσp2+ϵ\overline x_{ip}=\frac{x_{ip}-\mu_p}{\sqrt{\sigma_p^2+\epsilon}}xip=σp2+ϵxip−μp -
结果
yip=γpx‾ip+βpy_{ip}=\gamma_p \overline x_{ip}+\beta_pyip=γpxip+βp 其中
参数γ=γ0,γ1,...,γn−1\gamma = {\gamma_0, \gamma_1,...,\gamma_{n-1}}γ=γ0,γ1,...,γn−1 和
β=β0,β1,...,βn−1\beta = {\beta_0,\beta_1,...,\beta_{n-1}}β=β0,β1,...,βn−1
是learnable parameters
反向计算
∂O∂xij=∑kl∂O∂ykl∂ykl∂xij=∑kl∂O∂ykl∂ykl∂x‾ij∂x‾ij∂xij=∑kl∂O∂yklγl∂x‾ij∂xij(1)\frac{\partial O}{\partial x_{ij}}=\sum_{kl}{ \frac{\partial O}{\partial y_{kl}} } \frac{\partial y_{kl}}{\partial x_{ij}} = \sum_{kl}{ \frac{\partial O}{\partial y_{kl}} } \frac{\partial y_{kl}}{\partial \overline x_{ij}} \frac{\partial \overline x_{ij}}{\partial x_{ij}} = \sum_{kl}{ \frac{\partial O}{\partial y_{kl}} } \gamma_l \frac{\partial \overline x_{ij}}{\partial x_{ij} } \quad (1) ∂xij∂O=kl∑∂ykl∂O∂xij∂ykl=kl∑∂ykl∂O∂xij∂ykl∂xij∂xij=kl∑∂ykl∂Oγl∂xij∂xij(1)
∂x‾ij∂xij=∂(xkl−μl)∂xijσl2+ϵ−∂σl2+ϵ∂xij(xkl−μl)σl2+ϵ(2) \frac{\partial \overline x_{ij}}{\partial x_{ij}} = \frac { \frac{\partial{ (x_{kl}-\mu_l)}}{\partial x_{ij}} \sqrt{\sigma_l^2+\epsilon} - \frac{ \partial {\sqrt{\sigma_l^2+\epsilon}} }{\partial x_{ij}}(x_{kl}-\mu_l) } { \sigma_l^2+\epsilon } \quad (2) ∂xij∂xij=σl2+ϵ∂xij∂(xkl−μl)σl2+ϵ−∂xij∂σl2+ϵ(xkl−μl)(2)
∂(xkl−μl)∂xij=δkiδlj−δlj1m(3)
\frac{ \partial (x_{kl}-\mu_l)}{\partial x_{ij}} =
\delta_{ki}\delta_{lj} - \delta_{lj} \frac{1}{m} \quad (3)
∂xij∂(xkl−μl)=δkiδlj−δljm1(3)
其中
δpq={1p=q0else
\delta_{pq}=
\begin{cases}
1 \quad p=q \\
0 \quad else
\end{cases}
δpq={1p=q0else
这个符号可以替代推导过程中的if-else,遇到求和号可以消除
∂σl2+ϵ∂xij=1m1σl2+ϵδlj(xil−μl)(4)
\frac{\partial \sqrt{\sigma_l^2 + \epsilon}} {\partial x_{ij}} =
\frac{1}{m} \frac{1}{\sqrt{\sigma_l^2+\epsilon}} \delta_{lj} (x_{il} - \mu_l) \quad (4)
∂xij∂σl2+ϵ=m1σl2+ϵ1δlj(xil−μl)(4)
(3)(4)带入(2)得到
∂x‾ij∂xij=δlj(δki−1m)σl2+ϵ−1mσl2+ϵ(xkl−μl)(xil−μl)σl2+ϵ
\frac{\partial \overline x_{ij}}{\partial x_{ij}} = \delta_{lj}
\frac
{
(\delta_{ki} - \frac{1}{m}) \sqrt{\sigma_l^2 + \epsilon} -
\frac{1}{m\sqrt{\sigma_l^2 + \epsilon}}(x_{kl}-\mu_l)(x_{il}-\mu_l)
}
{\sigma_l^2 + \epsilon}
∂xij∂xij=δljσl2+ϵ(δki−m1)σl2+ϵ−mσl2+ϵ1(xkl−μl)(xil−μl)
上式带入公式(1)得到
∂O∂xij=γjmσj2+ϵ(σj2+ϵ)((σj2+ϵ)(m∂O∂yjj−∑k∂O∂ykj)−(xij−μj)(xkj−μj)∑k∂O∂ykj)(done)
\frac{\partial O}{\partial x_{ij}} =
\frac{\gamma_j}{m\sqrt{\sigma_j^2 + \epsilon}(\sigma_j^2 + \epsilon)} (
(\sigma_j^2 + \epsilon)( m\frac{\partial O}{\partial y_{jj}}-\sum_k\frac{\partial O}{\partial y_{kj}}) - (x_{ij}-\mu_j)(x_{kj}-\mu_j)\sum_k\frac{\partial O}{\partial y_{kj}}
) \quad (done)
∂xij∂O=mσj2+ϵ(σj2+ϵ)γj((σj2+ϵ)(m∂yjj∂O−k∑∂ykj∂O)−(xij−μj)(xkj−μj)k∑∂ykj∂O)(done)