CS224N-Notes06-NMT,Seq2Seq, Attention

最新推荐文章于 2022-05-30 22:55:31 发布

转载最新推荐文章于 2022-05-30 22:55:31 发布 · 291 阅读

CS224N-Stanford-Winter-2019-Notes 专栏收录该内容

3 篇文章

订阅专栏

本文介绍了使用深度学习进行机器翻译的方法，重点讲解了Seq2Seq模型及其变种——带有注意力机制的模型。从Seq2Seq的基本架构讲起，包括编码器-解码器的工作原理，再到如何利用双向循环神经网络解决长依赖问题，最后引入注意力机制提升翻译质量。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

CS224n：Natural Language Processing with Deep Learning

1.Neural Machine Translation with Seq2Seq

So far in this case, we’ve dealt with problems of predicting a single output: an NER label for a word, the single most likely next word in a sentence given the past few, and so on. However, there is a whole class of NLP tasks that rely on sequential output, or outputs that are sequences of potentially varying length. For example,

Translation: taking a sentence in one language as input and outputting the same sentence in another language.
Conversation: taking a statement or question as input and responding to it
Summarization: taking a large body of text as input and outputting a summary of it
In these notes, we’ll look at sequence-to-sequence models, a deep learning-based framework for handing these types of problems. This framework proved to be very effective, and has, in fewer than 3 years, become the standard for machine translation.

1.1 Brief Note on Historical Approaches

In the past, translation systems were based on probabilities models constructed from:

A translation model, telling us what a sentence/phrase in a source language most likely translates into.
A language model, telling us how likely a given sentence/phrase is overall.

The components were used to build translation systems based on words or phrases. As you might expect, a naïve word-phase system would completely fail to capture differences in ordering between language (e.g. where negation words go, location of subject vs. verb in a sentence, etc.).
Phrase-based systems were most common prior to Seq2Seq. A phrase-based translation system can consider inputs and outputs in terms of sequences of phrases and can handle more complex syntaxes than word-based systems. However, long-term dependencies are still difficult to capture in phrase-based systems.
The advantage that Seq2Seq brought to the table, especially with its use of LSTMs, is that modern translation systems can generate arbitrary output sequences after seeing the entire input. They can even focus in on specific parts of the input automatically to help generate a useful translation.

1.2 Sequence-to-sequence Basics

Seq2Seq is a relatively new paradigm, with its first published usage in 2014 for English-French translation. At a high level, a seq-to-seq model is an end-to-end model made up of two recurrent neural networks:

An encoder, which takes the model’s input sequence as input and encodes it into a fixed-size “context vector”, and
A decoder, which uses the context vector from above as a “seed” from which to generate an output sequence.
For this reason, Seq2Seq models are often referred to as “encoder-decoder models.” We’ll look at the details of these two networks separately.

1.3 Seq2Seq architecture – encoder

The encoder network’s job is to read the input sequence to our model and generate a fixed-dimensional context vector C for the sequence. To do so, the encoder will use a recurrent neural network cell – usually an LSTM – to read the input tokens one at a time. The final hidden state of the cell will then become C. However, because it’s so difficult to compress an arbitrary-length sequence into a single fixed-size vector (especially for difficult tasks like translation), the encoder will usually consist of stacked LSTMs: a series of LSTM “layers” where each layer’s outputs are the input sequence to the next layer. The final layer’s LSTM hidden state will be used as C.
Seq2Seq encoders will often do something strange: they will process the input sequence in reverse. This is actually done on purpose. The idea is that, by doing this, the last thing that encoder sees will (roughly) corresponds to the first thing that the model outputs; this makes it easier for the decoder to “get started” on the output, which makes then gives the decoder an easier time generating a proper output sentence. In the context of translation, we’re allowing the network to translate the first few words of the input as soon as it sees them; once it has the first few words translated correctly, it’s much easier to go on to construct a correct sentence than it is do so from scratch. See Fig. 1 for an example of what such an encoder network might look like.

1.4 Seq2Seq architecture – decoder

The decoder is also an LSTM network, but its usage is a little more complex than the encoder network. Essentially, we’d like to use it as a language model that’s “aware” of the words that it’s generated so far and of the input. To that end, we’ll keep the “stacked” LSTM architecture from the encoder, but we’ll initialize the hidden state of our first layer with the context vector from above; the decoder will literally use the context of the input to generate an output.
Once the decoder is set up with its context, we’ll pass in a special token to signify the start of output generation; in literature, this usually as / token appended to the end of the input (there’s also one at the end if the output). Then, we’ll run all three layers of LSTM, one after the other, following up with a softmax on the final layer’s output to generate the first output word. Then, we pass that word into the first layer, and repeat the generation. This is how we get the LSTMs to act like a language model. See Fig. 2 for an example of a decoder network.
Once we have the output sequence, we use the same learning strategy as usual. We define a loss, the cross entropy on the prediction sequence, and we minimize it with a gradient descent algorithm and back-propagation. Both the encoder and the decoder are trained at the same time, so that they both learn the same context vector representation.

1.5 Recap Basic NMT Example

Note that there is no connection between the lengths of the input and output; any length input can be passed in and any length output can be generated. However, Seq2Seq models are known to lose effectiveness on very long inputs, a consequence of the practical limits of LSTMs.
To recap, let’s think about what a Seq2Seq model does in order to translate the English “what is your name” into French “comment t’appelles tu”. First, we start with 4 one-hot vectors for the input. These inputs may or may not (for translation, they usually are) embedded into a dense vector representation. Then, a stacked LSTM network reads the sequence in reverse and encodes intro a context vector. This context vector is a vector space representation of the nation of asking someone for their name. It’s used to initialize the first layer of another stacked LSTM. We run one step of each layer of this network, perform softmax on the last layer’s output, and use that to select our first output word. This word is fed back into the network as input, and the rest of the sentence “comment t’appelles tu” is decoded in this translation. During backpropagation, the encoder’s LSTM weights are updated so that it learns a better vector space representation for sentences, while the decoder’s LSTM weights are trained to allow it to generate grammatically correct sentences that are relevant to the context vector.

1.6 Bidirectional RNNs

Recall from earlier in this class that dependencies in sentences don’t just work in one direction; a word can have a dependency on another word before of after it, The formulation of Seq2Seq that we’ve talked about so far doesn’t account for that; at any timestep, we are only considering information (via the LSTM hidden state) from words before the current word. For NMT, we need to be able to effectively before the current word. For NMT, we need to be able to effectively encode any input, regardless of dependency directions within that input, so this won’t cut it.
Bidirectional RNNs fix this problem by traversing a sequence in both directions and concatenating the resulting outputs (both cell outputs and final hidden states). For every RNN cell, we simply add another cell but feed inputs to it in the opposite direction; the output $o_t$ corresponding to the $t ’$ th word is the concatenated vector $o_t^{(f)} o_t^{(b)}]$ , where $o_t^{(f)}$ is the output of the forward-direction RNN on word $t$ and $o_t^{(b)}$ is the corresponding output from the reverse-direction RNN. Similarly, the final hidden state is $h=[h^{(f)}h^{(b)}]$ , where $h^{(f)}$ is the final hidden state of the forward RNN and $h^{(b)}$ is the final hidden state of the reverse RNN.

2 Attention Mechanism

2.1 Motivation

When you hear the sentence “the ball is on the field”, you don’t assign the same importance to all 6 words. You primarily take note of the words “ball”, “on” and “field”, since those words most important to you. Similarly, Bahdanau noticed the flaw in using the final RNN hidden state as the single “context vector” for sequence-to-sequence models: often, different parts of an input have different levels of significance. Moreover, different parts of the output man even consider different parts of the input “important”. For example, in translation, the fist word of output is usually based on the first few words of the input, but the last words is likely based on the last few words of inputs.
Attention mechanisms make use of this observation by providing the decoder by providing the decoder network with a look at the entire input sequence at every decoding step; the decoder can then decide what input words are important at any point in time. There are many types of encoder mechanisms, but we’ll examine the one introduced by Bahdanau.

2.2 Bahdanau et al. NMT model

Remember that our seq2seq model is made of two parts, an encoder that encodes the input sentence, and a decoder that leverages the information extracted by the decoder to reproduce the translated sentence. Basically, out input is a sequence of wrods $x1,⋯ ,xnx_1,\cdots,x_n$ that wo want to translate, and our target sentence is a sequence of words $y1,⋯ ,yny_1, \cdots, y_n$ .
Encoder:
Let $(h1,⋯ ,hn)(h_1,\cdots,h_n)$ to be the hidden vectors representing the input sentence. These vectors are the output of a bi-LSTM for instance, and capture contextual representation of each word in the sentence.
Decoder:
We want to compute the hidden states $s_i$ of the decoder using a recursive formula of the form $s_i=f(s_{i-1},y_{(i-1)},c_i)$ where $s_{(i-1)}$ is the previous hidden vector, $y_{(i-1)}$ is the generated word at the previous step, and $c_i$ is a context vector that capture the context from the original sentence that is relevant to the time step $i$ of the decoder.
The context vector $c_i$ captures relevant information for the $i$ -th decoding time step (unlike the standard Seq2Seq in which there’s only one context vector). For each hidden vector from the original sentence $h_j$ , compute a score $e_{I,j}=a(s_{i-1},h_j)$ where $a$ is any function with values in R, for instance a single layer fully-connected neural network. Then, we end up with a sequence of scalar values $eI,1,⋯ ,eI,ne_{I,1},\cdots,e_{I,n}$ . Normalize these scores into a vector $αi=(αi,1,⋯ ,αi,n)\alpha_i = (\alpha_{i,1},\cdots,\alpha_{i,n})$ , using a softmax layer, $αI,j=exp(eI,j)∑k=1nexp(ei,k)\alpha_{I,j}=\frac{exp(e_{I,j})}{\sum_{k=1}^{n}exp(e_{i,k})}$ . Intuitively, this vector captures the relevant contextual information from the original sentence for the $i$ -th step of the decoder.