优化算法
加权平均(指数加权移动平均)
vt=βvt−1+(1−β)θtv_t = \beta v_{t-1}+(1-\beta)\theta_tvt=βvt−1+(1−β)θt
偏差修正
vt1−βt\frac{v_t}{1-\beta^t}1−βtvt
Momentum梯度下降法
vdw=βvdw+(1−β)dwv_{dw}=\beta v_{dw}+(1-\beta)dwvdw=βvdw+(1−β)dw
vdb=βvdb+(1−β)dbv_{db}=\beta v_{db}+(1-\beta)dbvdb=βvdb+(1−β)db
w:=w−αvdww:=w-\alpha v_{dw}w:=w−αvdw
b:=b−αvdbb:=b-\alpha v_{db}b:=b−αvdb
β\betaβ经常取值0.9
RMSprop(root mean square prop)
Sdw=β2Sdw+(1−β2)(dw)2S_{dw}=\beta_2 S_{dw}+(1-\beta_2){(dw)}^2Sdw=β2Sdw+(1−β2)(dw)2
Sdb=β2Sdb+(1−β2)(db)2S_{db}=\beta_2 S_{db}+(1-\beta_2){(db)}^2Sdb=β2Sdb+(1−β2)(db)2
w:=w−αdwSdw+ϵw:=w-\alpha \frac{d_w}{\sqrt{S_{dw}+\epsilon}}w:=w−αSdw+ϵdw
b:=b−αdbSdb+ϵb:=b-\alpha \frac{d_b}{\sqrt{S_{db}+\epsilon}}b:=b−αSdb+ϵdb
其中ϵ\epsilonϵ是个非常小的值,比如10−810^{-8}10−8,为了防止0。
Adam(Adaptive Moment Estimation)优化算法
Vdw=0,Sdw=0,Vdb=0,Sdb=0V_{dw}=0,S_{dw}=0,V_{db}=0,S_{db}=0Vdw=0,Sdw=0,Vdb=0,Sdb=0
Vdw=β1Vdw+(1−β1)dw, Vdb=β1Vdb+(1−β1)dbV_{dw}=\beta_1 V_{dw}+(1-\beta_1)dw, \ V_{db}=\beta_1 V_{db}+(1-\beta_1)dbVdw=β1Vdw+(1−β1)dw, Vdb=β1Vdb+(1−β1)db
Sdw=β2Sdw+(1−β2)(dw)2, Sdb=β2Sdb+(1−β2)(db)2S_{dw}=\beta_2 S_{dw}+(1-\beta_2)(dw)^2, \ S_{db}=\beta_2 S_{db}+(1-\beta_2)(db)^2Sdw=β2Sdw+(1−β2)(dw)2, Sdb=β2Sdb+(1−β2)(db)2
Vdwcorrected=Vdw1−β1t, Vdbcorrected=Vdb1−β1t{V_{dw}}^{corrected} = \frac{V_{dw}}{1-{\beta_1}^t}, \ {V_{db}}^{corrected} = \frac{V_{db}}{1-{\beta_1}^t}Vdwcorrected=1−β1tVdw, Vdbcorrected=1−β1tVdb
Sdwcorrected=Sdw1−β2t, Sdbcorrected=Sdb1−β2t{S_{dw}}^{corrected} = \frac{S_{dw}}{1-{\beta_2}^t}, \ {S_{db}}^{corrected} = \frac{S_{db}}{1-{\beta_2}^t}Sdwcorrected=1−β2tSdw, Sdbcorrected=1−β2tSdb
w:=w−αVdwcorrectedSdwcorrected+ϵw:=w-\alpha \frac{{V_{dw}}^{corrected}}{\sqrt{{S_{dw}}^{corrected}+\epsilon}}w:=w−αSdwcorrected+ϵVdwcorrected
b:=b−αVdbcorrectedSdbcorrected+ϵb:=b-\alpha \frac{{V_{db}}^{corrected}}{\sqrt{{S_{db}}^{corrected}+\epsilon}}b:=b−αSdbcorrected+ϵVdbcorrected
超参数α,β1=0.9,β2=0.999,ϵ=10−8,α\alpha,\beta_1 = 0.9,\beta_2=0.999,\epsilon=10^{-8},\alphaα,β1=0.9,β2=0.999,ϵ=10−8,α需要尝试。