CS224N notes_chapter9_machine translation & LSTM & GRU_lstm machine translation-优快云博客

本文链接：https://blog.youkuaiyun.com/lirt15/article/details/94767146

第九讲 machine translation & LSTM & GRU

Current statistical machine translation systems

parallel corpus: lots of sentences from a language to another.

Source language, e.g. French
Target language, e.g. English
$KaTeX parse error: Expected 'EOF', got '\argmax' at position 10: \hat e = \̲a̲r̲g̲m̲a̲x̲_e p(e|f) = \ar…$
Train $p (f ∣ e)$ on parallel corpus
train $p (e)$ on English only corpus

Step 1: Alignment (p(f|e))
Goal: know which word or phrases in source language would translate to what words or phrases in target language?

zero fertility word not translated.
one-to-many alignment
many-to-one alignment
many-to-many alignment

Then we consider reordering of translated phrases.

Decode: search for best of many hypotheses
beam search, etc.

Deep learning method

Encoder-Decoder

Encoder:
$h_t=\phi(h_{t-1},x_t)$
Decoder:
$h_t=\phi(h_{t-1}) \\ y_t=softmax(W^{(S)}h_t)$
Target:
$\max_\theta \frac{1}{N} \sum_{n=1}^N \log p_\theta (y^{(n)}|x^{(n)})$

Some tricks/extensions for Encoder-Decoder:

Train different RNN for encoding and decoding
Add the last hidden vector of encoder $c_t$ to decoder
$h_{D,t}=\phi_D(h_{t-1},c,y_{t-1})$
Deeper RNN.
Bidirectional encoder
instead of A B C -> X Y,
we use C B A -> X Y

Paper Reading: Better LM

three ways to get better LM

better inputs
char level: d r e a m w o r k s
subword level: dre+am+wo+rks
word level: dreamworks
Better Regularization/Preprocessing
Randomly replace words in a sentence with other words, or use bigram statistics to generate Kneser-Ney inspired replacement.

GRU

update gate:
$z_t = \sigma(W^{(z)}x_t + U^{(z)}h_{t-1})$
reset gate:
$r_t = \sigma(W^{(r)}x_t + U^{(r)}h_{t-1})$
New memory content:
$\hat h _t = tanh(Wx_t+r_t\odot Uh_{t-1})$
Final memory:
$h_t = z_t \odot h_{t-1} + (1-z_t) \odot \hat h_t$

Allow model to drop information that is irrelevant in the future.
Units with short-term dependencies often have reset gates very active.
Units with long-term dependencies often have active update gates $z$ .
If z close to 1, less vanishing gradient!

LSTM

$\begin{aligned} i_t =& \sigma(W^{(i)}x_t + U^{(i)}h_{t-1}) \\ f_t =& \sigma(W^{(f)}x_t + U^{(f)}h_{t-1}) \\ o_t =& \sigma(W^{(o)}x_t + U^{(o)}h_{t-1}) \\ \hat c_t =& tanh(W^{(c)}x_t + U^{(c)}h_{t-1}) \\ c_t =& f_t \odot c_{t-1} + i_t \odot \hat c_t \\ h_t =& o_t \odot tanh(c_t) \end{aligned}$