第九讲 machine translation & LSTM & GRU
Current statistical machine translation systems
parallel corpus: lots of sentences from a language to another.
- Source language, e.g. French
- Target language, e.g. English
KaTeX parse error: Expected 'EOF', got '\argmax' at position 10: \hat e = \̲a̲r̲g̲m̲a̲x̲_e p(e|f) = \ar… - Train p ( f ∣ e ) p(f|e) p(f∣e) on parallel corpus
- train p ( e ) p(e) p(e) on English only corpus
Step 1: Alignment (p(f|e))
Goal: know which word or phrases in source language would translate to what words or phrases in target language?
- zero fertility word not translated.
- one-to-many alignment
- many-to-one alignment
- many-to-many alignment
Then we consider reordering of translated phrases.
Decode: search for best of many hypotheses
beam search, etc.
Deep learning method
Encoder-Decoder
Encoder:
h
t
=
ϕ
(
h
t
−
1
,
x
t
)
h_t=\phi(h_{t-1},x_t)
ht=ϕ(ht−1,xt)
Decoder:
h
t
=
ϕ
(
h
t
−
1
)
y
t
=
s
o
f
t
m
a
x
(
W
(
S
)
h
t
)
h_t=\phi(h_{t-1}) \\ y_t=softmax(W^{(S)}h_t)
ht=ϕ(ht−1)yt=softmax(W(S)ht)
Target:
max
θ
1
N
∑
n
=
1
N
log
p
θ
(
y
(
n
)
∣
x
(
n
)
)
\max_\theta \frac{1}{N} \sum_{n=1}^N \log p_\theta (y^{(n)}|x^{(n)})
θmaxN1n=1∑Nlogpθ(y(n)∣x(n))
Some tricks/extensions for Encoder-Decoder:
- Train different RNN for encoding and decoding
- Add the last hidden vector of encoder
c
t
c_t
ct to decoder
h D , t = ϕ D ( h t − 1 , c , y t − 1 ) h_{D,t}=\phi_D(h_{t-1},c,y_{t-1}) hD,t=ϕD(ht−1,c,yt−1) - Deeper RNN.
- Bidirectional encoder
instead of A B C -> X Y,
we use C B A -> X Y
Paper Reading: Better LM
three ways to get better LM
- better inputs
char level: d r e a m w o r k s
subword level: dre+am+wo+rks
word level: dreamworks - Better Regularization/Preprocessing
Randomly replace words in a sentence with other words, or use bigram statistics to generate Kneser-Ney inspired replacement.
GRU
update gate:
z
t
=
σ
(
W
(
z
)
x
t
+
U
(
z
)
h
t
−
1
)
z_t = \sigma(W^{(z)}x_t + U^{(z)}h_{t-1})
zt=σ(W(z)xt+U(z)ht−1)
reset gate:
r
t
=
σ
(
W
(
r
)
x
t
+
U
(
r
)
h
t
−
1
)
r_t = \sigma(W^{(r)}x_t + U^{(r)}h_{t-1})
rt=σ(W(r)xt+U(r)ht−1)
New memory content:
h
^
t
=
t
a
n
h
(
W
x
t
+
r
t
⊙
U
h
t
−
1
)
\hat h _t = tanh(Wx_t+r_t\odot Uh_{t-1})
h^t=tanh(Wxt+rt⊙Uht−1)
Final memory:
h
t
=
z
t
⊙
h
t
−
1
+
(
1
−
z
t
)
⊙
h
^
t
h_t = z_t \odot h_{t-1} + (1-z_t) \odot \hat h_t
ht=zt⊙ht−1+(1−zt)⊙h^t
- Allow model to drop information that is irrelevant in the future.
- Units with short-term dependencies often have reset gates very active.
- Units with long-term dependencies often have active update gates z z z.
- If z close to 1, less vanishing gradient!
LSTM
i t = σ ( W ( i ) x t + U ( i ) h t − 1 ) f t = σ ( W ( f ) x t + U ( f ) h t − 1 ) o t = σ ( W ( o ) x t + U ( o ) h t − 1 ) c ^ t = t a n h ( W ( c ) x t + U ( c ) h t − 1 ) c t = f t ⊙ c t − 1 + i t ⊙ c ^ t h t = o t ⊙ t a n h ( c t ) \begin{aligned} i_t =& \sigma(W^{(i)}x_t + U^{(i)}h_{t-1}) \\ f_t =& \sigma(W^{(f)}x_t + U^{(f)}h_{t-1}) \\ o_t =& \sigma(W^{(o)}x_t + U^{(o)}h_{t-1}) \\ \hat c_t =& tanh(W^{(c)}x_t + U^{(c)}h_{t-1}) \\ c_t =& f_t \odot c_{t-1} + i_t \odot \hat c_t \\ h_t =& o_t \odot tanh(c_t) \end{aligned} it=ft=ot=c^t=ct=ht=σ(W(i)xt+U(i)ht−1)σ(W(f)xt+U(f)ht−1)σ(W(o)xt+U(o)ht−1)tanh(W(c)xt+U(c)ht−1)ft⊙ct−1+it⊙c^tot⊙tanh(ct)
Recent improvements on RNN
Softmax Problem:
- Answers can only be predicted if they were seen during training and part of softmax
- It’s natural to learn new words in an active conversation.
Solution:
Mixture Model of softmax and pointers.
p
(
y
i
∣
x
i
)
=
g
p
v
o
c
a
b
(
y
i
∣
x
i
)
+
(
1
−
g
)
p
p
t
r
(
y
i
∣
x
i
)
p(y_i|x_i)=g p_{vocab}(y_i|x_i)+(1-g)p_{ptr}(y_i|x_i)
p(yi∣xi)=gpvocab(yi∣xi)+(1−g)pptr(yi∣xi)