CS224N notes_chapter9_machine translation & LSTM & GRU

第九讲 machine translation & LSTM & GRU

Current statistical machine translation systems

parallel corpus: lots of sentences from a language to another.

  • Source language, e.g. French
  • Target language, e.g. English
    KaTeX parse error: Expected 'EOF', got '\argmax' at position 10: \hat e = \̲a̲r̲g̲m̲a̲x̲_e p(e|f) = \ar…
  • Train p ( f ∣ e ) p(f|e) p(fe) on parallel corpus
  • train p ( e ) p(e) p(e) on English only corpus

Step 1: Alignment (p(f|e))
Goal: know which word or phrases in source language would translate to what words or phrases in target language?

  • zero fertility word not translated.
  • one-to-many alignment
  • many-to-one alignment
  • many-to-many alignment

Then we consider reordering of translated phrases.

Decode: search for best of many hypotheses
beam search, etc.

Deep learning method

Encoder-Decoder

Encoder:
h t = ϕ ( h t − 1 , x t ) h_t=\phi(h_{t-1},x_t) ht=ϕ(ht1,xt)
Decoder:
h t = ϕ ( h t − 1 ) y t = s o f t m a x ( W ( S ) h t ) h_t=\phi(h_{t-1}) \\ y_t=softmax(W^{(S)}h_t) ht=ϕ(ht1)yt=softmax(W(S)ht)
Target:
max ⁡ θ 1 N ∑ n = 1 N log ⁡ p θ ( y ( n ) ∣ x ( n ) ) \max_\theta \frac{1}{N} \sum_{n=1}^N \log p_\theta (y^{(n)}|x^{(n)}) θmaxN1n=1Nlogpθ(y(n)x(n))

Some tricks/extensions for Encoder-Decoder:

  • Train different RNN for encoding and decoding
  • Add the last hidden vector of encoder c t c_t ct to decoder
    h D , t = ϕ D ( h t − 1 , c , y t − 1 ) h_{D,t}=\phi_D(h_{t-1},c,y_{t-1}) hD,t=ϕD(ht1,c,yt1)
  • Deeper RNN.
  • Bidirectional encoder
    instead of A B C -> X Y,
    we use C B A -> X Y

Paper Reading: Better LM

three ways to get better LM

  1. better inputs
    char level: d r e a m w o r k s
    subword level: dre+am+wo+rks
    word level: dreamworks
  2. Better Regularization/Preprocessing
    Randomly replace words in a sentence with other words, or use bigram statistics to generate Kneser-Ney inspired replacement.

GRU

update gate:
z t = σ ( W ( z ) x t + U ( z ) h t − 1 ) z_t = \sigma(W^{(z)}x_t + U^{(z)}h_{t-1}) zt=σ(W(z)xt+U(z)ht1)
reset gate:
r t = σ ( W ( r ) x t + U ( r ) h t − 1 ) r_t = \sigma(W^{(r)}x_t + U^{(r)}h_{t-1}) rt=σ(W(r)xt+U(r)ht1)
New memory content:
h ^ t = t a n h ( W x t + r t ⊙ U h t − 1 ) \hat h _t = tanh(Wx_t+r_t\odot Uh_{t-1}) h^t=tanh(Wxt+rtUht1)
Final memory:
h t = z t ⊙ h t − 1 + ( 1 − z t ) ⊙ h ^ t h_t = z_t \odot h_{t-1} + (1-z_t) \odot \hat h_t ht=ztht1+(1zt)h^t

  • Allow model to drop information that is irrelevant in the future.
  • Units with short-term dependencies often have reset gates very active.
  • Units with long-term dependencies often have active update gates z z z.
  • If z close to 1, less vanishing gradient!

LSTM

i t = σ ( W ( i ) x t + U ( i ) h t − 1 ) f t = σ ( W ( f ) x t + U ( f ) h t − 1 ) o t = σ ( W ( o ) x t + U ( o ) h t − 1 ) c ^ t = t a n h ( W ( c ) x t + U ( c ) h t − 1 ) c t = f t ⊙ c t − 1 + i t ⊙ c ^ t h t = o t ⊙ t a n h ( c t ) \begin{aligned} i_t =& \sigma(W^{(i)}x_t + U^{(i)}h_{t-1}) \\ f_t =& \sigma(W^{(f)}x_t + U^{(f)}h_{t-1}) \\ o_t =& \sigma(W^{(o)}x_t + U^{(o)}h_{t-1}) \\ \hat c_t =& tanh(W^{(c)}x_t + U^{(c)}h_{t-1}) \\ c_t =& f_t \odot c_{t-1} + i_t \odot \hat c_t \\ h_t =& o_t \odot tanh(c_t) \end{aligned} it=ft=ot=c^t=ct=ht=σ(W(i)xt+U(i)ht1)σ(W(f)xt+U(f)ht1)σ(W(o)xt+U(o)ht1)tanh(W(c)xt+U(c)ht1)ftct1+itc^tottanh(ct)

Recent improvements on RNN

Softmax Problem:

  • Answers can only be predicted if they were seen during training and part of softmax
  • It’s natural to learn new words in an active conversation.

Solution:
Mixture Model of softmax and pointers.
p ( y i ∣ x i ) = g p v o c a b ( y i ∣ x i ) + ( 1 − g ) p p t r ( y i ∣ x i ) p(y_i|x_i)=g p_{vocab}(y_i|x_i)+(1-g)p_{ptr}(y_i|x_i) p(yixi)=gpvocab(yixi)+(1g)pptr(yixi)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值