NLP interview sheet-优快云博客

本文链接：https://blog.youkuaiyun.com/ShadyPi/article/details/143884954

Content

Word Embedding
Tokenization
Different N's impact in N-gram
Metrics to evaluate Language Model
Why RNN is likely to suffer gradient explosion and vanishing
How to deal with gradient explosion and vanishing
Variants of RNN
In Self-Attention, why the dot product of QK is divided by square root of d?
Tricks in multi-head self-attention
Positional Encoding
BERT and its variants
Encoder-Decoder LM
Why today's most NLP approach apply attention as backbone rather than LSTM
Sampling Method

Word Embedding

One-hot: sparse, too long for large vocabulary, no semantic information.
Word2Vec
- Skip-gram: predict surrounding words according to central word.
- CBOW: predict central word according to average embedding of surrounding words.
  Generally, skip-gram performs better than CBOW, because skip-gram will adjust the embedding of central words according to each surround word (trained $W$ times), but CBOW uses the average embedding of surrounding words (trained only once).
- GloVE: Train embeddings that the dot product of two embeddings is close to the count of their concurrence.

Lack context information and face out of vocabulary problem.

Tokenization

Word-level: face out of vocabulary problem, <UNK> is not enough to handle this problem, because <UNK> cannot be included in output (of generative model) and omits the relations between words with the same root.
Character-level: small vocabulary but loses much information and the tokenized sequence will be quite long. It takes much more effort to rebuild the information.
Subword tokenization:
- BPE: starting from character-level, count the concurrency of adjacent token pair and merge the most frequent pair, putting the merged token into vocabulary. Repeat the merge until reach desired vocabulary size.
- Byte-level BPE (BBPE): Do BPE on byte-level rather than character level, reducing the size of vocabulary and easily apply across languages.
- Unigram: initialize a larger vocabulary than desired size and assign unigram probability to each token based on frequency. Use dynamic programming to tokenize the sentences with the maximum log likelihood. Then discard the token with least contribution to optimal tokenization until reach the desired vocabulary size.
  And Unigram is suit for Subword Regularization, which randomly choose one from top-k log likelihood tokenization rather than a fixed tokenization to enhance generalizability.
- Sentence piece: a tool to implement tokenization, including BPE and Unigram.

Different N’s impact in N-gram

Smaller N: more dense but general
Larger N: more specific but sparse
N-gram is also facing out of vocabulary problem.

Metrics to evaluate Language Model

Perplexity: how many candidate words of the next word in prediction, pr the uncertainty to decide the next word. In theory, the lower the perplexity, the better the model.

However, the perplexity of human written text is around 12 with great diversity. So, greedy decoding or optimal beam search may not be the optimal solution. Usually top-k, top-p, sampling with temperature is applied.

Why RNN is likely to suffer gradient explosion and vanishing

The derivative of loss function to the last hidden states is a recurrent function about derivatives between each hidden state in RNN. And the derivative of one hidden state to its former hidden state is a function of RNN’s parameters. So, the gradient will be the power of RNN’s parameters (if we do not consider the activation function). Once the number of recurrency becomes large, the weights greater than 1 will explode and the weights less than 1 will vanish.

The activation functions like Sigmoid and tanh are also likely to lead to gradient vanishing because their derivative for values with large absolute value is near to 0. It’s too difficult for the RNN to learn to preserve information over many timesteps

CNN and MLP are not so likely to suffer gradient explosion and vanishing because there is no repeat multiplication on the same weights. However, in very deep neural networks, gradient explosion and vanishing will also happen due to non-linear activation.

How to deal with gradient explosion and vanishing

RNN
- Layer Normalization, normalize for every sample.
- Gradient Clip, restrict the magnitude of gradient to prevent gradient explosion.
  - How to set the threshold? Log the gradients in training and adjust accordingly. Usually, the larger model with greater threshold and smaller model with less threshold.
- Gated architecture: Gates like Forget Gate, Input Gate and Output Gate control the information flows through network. In detail, the weighted sum of memory (cell state) and current input in LSTM makes the gradients an additive process, which are much less likely to explode or vanish. Also, the forget gate can scales the gradients flow backward.
- Shorter Sequences: Obviously, with less layers, the power of weights are less likely to explode or vanish,
CNN and MLP
- Batch Normalization, normalize for every channel.
- Residual Connections, allow gradients flow directly to former layers.
- ReLU Activation, the gradient is constant 1 for activated neurons.
General
- Xavier or Kaiming initialization, try to keep the variance of activation to 1 along the forward process. Xavier for symmetric activation function like tanh or sigmoid; Kaiming for ReLU.

Variants of RNN

Bidirectional RNN: concatenate forward and backward hidden states together, only appliable when have access to the whole sentence.
Multilayer RNN: the hidden states feed forward as the input of the next layer RNN.
Gated RNN: LSTM, forget gate is added to control how much information is reserved in long-term memory, input gate is added to control how much information is added to the long-term memory, and output gate is added to control how much information will be used to influence the output.

In Self-Attention, why the dot product of QK is divided by square root of d?

The magnitude grows very fast as the embedding dimension $d$ increases, and input with great magnitude with fall into the small gradient area of softmax, leading to gradient vanishing.
Works like normalization, making the variance of input to 1.

Tricks in multi-head self-attention

LN: Normalizes the outputs to be within a consistent range, preventing too much variance in scale of outputs.
Residual connections: training converges faster and allows for deeper networks.

Both of them are beneficial for preventing gradient explosion and vanishing.

Multi-head: aggregates information from different representation subspaces in parallel.

Positional Encoding

Absolute Encoding
Relative Encoding
- ALiBi: add bias to the attention score.
- RoPE: rotate the embedding according to the position, the relative position is captured by dot product.
Trainable Encoding:
- DeBERTa:

BERT and its variants

RoBERTa (“Robustly optimized BERT approach”)
- Train BERT with more data, longer, bigger batches, and more training steps.
- Dynamic masking: BERT uses the same MASK scheme for every epoch, RoBERTa recomputes them.
ELECTRA
- Replace some of the tokens and predict whether the token is replaced so that every token in corpus has a prediction.
DeBERTa (Decoding-enhanced BERT with Disentangled Attention)
- Each token is represented by 2 vectors, one for content and on for position.
- Additional Q and K representing relative position, and content-to-position and position-to-content attention to the final attention score.

Encoder-Decoder LM

BART (Bidirectional and Auto-Regressive Transformers): Reconstruct corrupted sentences.
T5: Treat all task as seq2seq.

Why today’s most NLP approach apply attention as backbone rather than LSTM

Parallelism: All tokens in the sentence are processed simultaneously in attention computation and can be parallelized for distribute system, while LSTM needs to process the sequence of tokens sequentially. (Although the time complexity of LSTM is $O (n)$ and attention is $O(n^2)$ , the sequential nature limits the highest training speed)
Long-term Memory and Representation: LSTM only uses a fixed-size hidden state to memorize the information from previous token. Even though LSTM introduces forget gate and input gate to control the information in hidden states, LSTM still struggle with long-term dependencies in very long context. (Researchers found LSTM is sensitive to near-50 tokens and memorize at most near-200 tokens’ information) In contract, attention mechanism directly computes pairwise interaction between all tokens, explicitly allows information retrieval from all previous tokens.
Scalability: Based on the parallelism and long-term memory ability of attention mechanism, model based on attention is easily to gain performance improvements as the scale of model increases.

Sampling Method

Top-k: Obviously sample the next token from top-k probable candidates.
Top-p: Ranking the candidates in descending order of probability, sample from the top candidates that the sum of their probability is not larger than p.
Temperature: Divide the logits by a given temperature. High temperature (>1) will decrease the difference and low temperature (<1) will increase the difference.
Hybrid: Top-k/Top-p -> Temperature.