- Traditional language models
- RNNs
- RNN language models
- training problems and tricks
- RNN for other sequence tasks
- Bi and deep rnns
Language Models
computes a probability for a sequence of words P(w1,...,wT)P(w1,...,wT)
ML
- word ordering (ab vs ba)
- word choice(home vs house)
Traditional
- conditioned on window of n previous words
- Markov assumption
- use counts to estimate prob.
- RAM requirment scales with length of sequence(n-gram)
Recurrent Neural Networks
- RAM requirement scales with number of words
- use same set of W weights at all time steps
- RAM requirement scales with number of words
Gradient vanish or explosive
- long distance: can only memory 5-6 words
- solution1: initialize W to identity matrix and RELU f(z)=reac(z)=max(z,0)f(z)=reac(z)=max(z,0)
birnn
- just need to reverse the order of sequence
SMT
f: french, source
e: english, destiny
p(e|f)=argmaxep(f|e)p(e)p(e|f)=argmaxep(f|e)p(e)
p(e)p(e) language model: see as a weighted parameter, control the fluency
p(f|e)p(f|e) translate model
或
p(text|voice)=p(voice|text)p(text)p(text|voice)=p(voice|text)p(text)
translate model
- alignment: hard
- zero
- one to many
- many to many
- many to many
- reorder
- many options:beam search
NMT
AI advantage: end2end trainable model, just consider a final objective function, then everything is learned in the model
RNN Translation model extensions
- Train different RNN weights for encoding and decoding
- compute every hidden state in decoder from
- Previous hidden state
- last hidden vector of encoder
- previous predict output word
- train stacked rnns
- train bidirectional encoder(occasionally)
- train input sequence in reverse order for simpler optimization(escape vanishing gradients): A B C -> X Y ==> C B A -> X Y
Advanced RNN
- LSTM
- GRU
GRU
- update gate: based on current input word vector and hidden state
zt=sig(W(z)xt+U(z)ht−1)zt=sig(W(z)xt+U(z)ht−1) - reset gate:
rt=sig(W(r)xt+U(r)ht−1)rt=sig(W(r)xt+U(r)ht−1)
LSTM
- Input gate
- Forge gate
- Output
Recent Improvements
- prob with softmax
- no zero shot word predictions
- combine pointer and softmax
Tricks
- prob: softmax is huge and slow
- class-based word prediction(instead of softmax)
- just need to back propagation once
- initialize W to identity matrix and RELU
How to improve word Embedding
Input: word -> subword
- morpheme: BP encoding
- character embedding
regularization
- preprocessing: replace some words, drop frequent word and add unfrequent word
Taks List
- NER todo: see leture8
- Machine Translation:
todos:
- Recap word vector equtions, shows in the begining of leture9: Machine Translation and Adavanced Recurrent LSTMs and GRUs
- replicating NER paper