神经网络后面的层被freeze住,会影响前面的层的梯度吗?
答案是不会。
假设一个最简单的神经网络,它只有一个输入xxx,一个隐藏层神经元hhh,和一个输出层神经元yyy,均方差损失LLL,真实标签ttt:
h=w1⋅xy=w2⋅hL=12(y−t)2 \begin{gathered} h = w_1 \cdot x \\ y = w_2 \cdot h \\ L=\frac{1}{2}(y-t)^2 \end{gathered} h=w1⋅xy=w2⋅hL=21(y−t)2
以下分w2w_2w2是否被freeze住,即w2w_2w2.requires_grad是否为True来讨论。
情况1:w2w_2w2.requires_grad为True
这种情况下,LLL对w1w_1w1的梯度为:
∂L∂w1=∂L∂y⋅∂y∂h⋅∂h∂w1
\frac{\partial L}{\partial w 1}=\frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial h} \cdot \frac{\partial h}{\partial w 1}
∂w1∂L=∂y∂L⋅∂h∂y⋅∂w1∂h
∂L∂y=∂∂y(12(y−t)2)=y−t \frac{\partial L}{\partial y}=\frac{\partial}{\partial y}\left(\frac{1}{2}(y-t)^2\right)=y-t ∂y∂L=∂y∂(21(y−t)2)=y−t
∂y∂h=∂∂h(w2⋅h)=w2 \frac{\partial y}{\partial h}=\frac{\partial}{\partial h}\left(w_2 \cdot h\right)=w_2 ∂h∂y=∂h∂(w2⋅h)=w2
∂h∂w1=∂∂w1(w1⋅x)=x \frac{\partial h}{\partial w_1}=\frac{\partial}{\partial w_1}\left(w_1 \cdot x\right)=x ∂w1∂h=∂w1∂(w1⋅x)=x
因此:
∂L∂w1=∂L∂y⋅∂y∂h⋅∂h∂w1=(y−t)⋅w2⋅x
\frac{\partial L}{\partial w 1}=\frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial h} \cdot \frac{\partial h}{\partial w 1}
= (y-t) \cdot w_2 \cdot x
∂w1∂L=∂y∂L⋅∂h∂y⋅∂w1∂h=(y−t)⋅w2⋅x
情况2:w2w_2w2.requires_grad为False
这种情况下,w2w_2w2被视为一个常数,此时LLL对w1w_1w1的梯度仍然为:
∂L∂w1=∂L∂y⋅∂y∂h⋅∂h∂w1=(y−t)⋅w2⋅x
\frac{\partial L}{\partial w 1}=\frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial h} \cdot \frac{\partial h}{\partial w 1}
= (y-t) \cdot w_2 \cdot x
∂w1∂L=∂y∂L⋅∂h∂y⋅∂w1∂h=(y−t)⋅w2⋅x
因为无论w2w_2w2是否被freeze住,∂y∂h=∂∂h(w2⋅h)=w2\frac{\partial y}{\partial h}=\frac{\partial}{\partial h}\left(w_2 \cdot h\right)=w_2∂h∂y=∂h∂(w2⋅h)=w2这一点是不会变的。
在计算w1w_1w1的梯度时,我们并不需要w2w_2w2的梯度,而是只需要w2w_2w2这个参数值。