Gradient Descent with Momentum
Compute exponentially weighed average of gradient, and use the gradient to update weights.
Algorithm
On iteration t:
- Compute dW d W and db d b on current minibatch
-
VdW=β1VdW+(1−β1)dW
V
d
W
=
β
1
V
d
W
+
(
1
−
β
1
)
d
W
Vdb=β1Vdb+(1−β1)db V d b = β 1 V d b + ( 1 − β 1 ) d b -
W=W−αVdW
W
=
W
−
α
V
d
W
b=b−αVdb b = b − α V d b
RMSprop
On iteration t:
- Compute dW d W and db d b on current minibatch
-
SdW=β2SdW+(1−β2)dW2
S
d
W
=
β
2
S
d
W
+
(
1
−
β
2
)
d
W
2
(Element-wise square)
Sdb=β2Sdb+(1−β2)db2 S d b = β 2 S d b + ( 1 − β 2 ) d b 2 (Element-wise square) -
W=W−αdWSdW+ε−−−−−−−√
W
=
W
−
α
d
W
S
d
W
+
ε
b=b−αdbSdb+ε−−−−−−√ b = b − α d b S d b + ε
Adam
Adaptive Momentum
Algorithm
On iteration t:
- Compute dW d W and db d b on current minibatch
-
VdW=β1VdW+(1−β1)dW
V
d
W
=
β
1
V
d
W
+
(
1
−
β
1
)
d
W
Vdb=β1Vdb+(1−β1)db V d b = β 1 V d b + ( 1 − β 1 ) d b
SdW=β2SdW+(1−β2)dW2 S d W = β 2 S d W + ( 1 − β 2 ) d W 2 (Element-wise square)
Sdb=β2Sdb+(1−β2)db2 S d b = β 2 S d b + ( 1 − β 2 ) d b 2 (Element-wise square) -
VcorrecteddW=VdW1−βt1
V
d
W
corrected
=
V
d
W
1
−
β
1
t
Vcorrecteddb=Vdb1−βt1 V d b corrected = V d b 1 − β 1 t
ScorrecteddW=SdW1−βt2 S d W corrected = S d W 1 − β 2 t
Scorrecteddb=Sdb1−βt2 S d b corrected = S d b 1 − β 2 t -
W=W−αVdWSdW−−−−√+ε
W
=
W
−
α
V
d
W
S
d
W
+
ε
b=b−αVdbSdb−−−√+ε b = b − α V d b S d b + ε