[nlp] 小傻学bert_problem: mask token never seen at fine-tuning-优快云博客

what is bert?

BERT: Bidirectional Encoder Representation from Transformers，即双向Transformer的Encoder。

与其他词向量关系

Word2vec等词向量是词维度，训练好就确定了。而BERT是句子维度的向量表示，依赖上下文构建结果。

why bert?

bert的出现彻底改变了预训练产生词向量和下游具体NLP任务的关系；
使用Masked LM和Next Sentence Prediction两种方法分别捕捉词语和句子级别的representation，增加词向量模型泛化能力；
证明了双向模型对文本特征表示的重要性；
证明了预训练模型能够消除很多繁重的任务相关的网络结构；
预训练架构与最终下游架构之间的差异很小，可进行迁移学习；
在11个NLP任务上，提升了state of the art水平

bert general goal

designed for pre-training bidirectional representations from unlabelled data
Conditions on left and right context in all layers
pre-trained model can be fine-tuned with one additional output layer for many tasks(e.g., NLI, QA, …)
for many tasks, no modifications to the bert architecture are required

the architecture of bert

bert：每层均使用双向Transformer，以 $P(w_i|w_1, ... , w_{i-1}，w_{i+1}, ... , w_{n})$ 作为目标函数进行训练。
OpenAI GPT：从左至右的Transformer；
ELMO：从左至右&从右至左的两个Lstm，分别以 $P(w_i|w_1, ... , w_{i-1})$ 和 $P(w_i|w_{i+1}, ... , w_{n})$ 作为目标函数，独立训练出两个representation然后拼接。在这里插入图片描述

bert核心思想

在这里插入图片描述

input & output

input representation:
在这里插入图片描述

input embeddings = token embeddings + segmentation embeddings + position embeddings
Token Embeddings：词向量，第一个单词是CLS标志，可用于后续分类任务；
Segment Embeddings：区别两种句子，因为预训练不光做LM还要做以两个句子为输入的分类任务
Position Embeddings：和之前文章中的Transformer不一样，不是三角函数而是学习出来的

句子拆分表示

[CLS]：起始标记
[SEP]：句对分割标记
单句或句对组合
WordPiece embeddings（英文）
单字拆分（中文）：相比于词组可减小词表大小

why subword？

Classic word representation cannot handle unseen word or rare word well;
Character embeddings is one of the solution to overcome out-of-vocabulary(OOV)
It may too fine-gained any missing some important information
Subword is in between word and character. It is not too fine-gained while able to handle unseen or rare word;
e.g., subword = sub + word
Three commonly used algorithms:
a. Byte Pair Encoding(BPE)
b. WordPiece
c. Unigram Language Model

pre-train(Task 1)：Masked LM

masked LM：随机MASK 15% 的word piece
（1）[MASK]token替换（80％）
（2）随机token替换（10％）
（3）不变（10％）
采用bidirectional：使用模型进行任务处理时，需左右两边信息，而非只需要左边的信息。（Important departure from previous embedding models: don’t train the model to predict the next word, but train it to predict the whole context.）
problem: how can we prevent trivial copying via the self-attention mechanism?
solution: mask 15% of the tokens in the input sequence; train the model to predict these.
problem: masking creates mismatch between pre-training and fine-tuning: [MASK] token is not seen during fine-tuning.
solution:
1. do not always replace masked words with [MASK], instead choose 15% of token position at random for prediction
2. if i-th token is chosen, we replace the i-th token with:
a. the [MASK] token 80% of the time
b. a random token 10% of the time
c. the unchanged i-th token 10% of the time
3. now use Ti to predict original token with cross entropy loss
在这里插入图片描述
why masking ?

if we always use [MASK] token, the model would not have to learn good representation for other words.
if we always use [MASK] token or random word, model would learn that observed word is never correct.
if we always use [MASK] token or observed word, model would be bias to trivially copy.

pre-train(Task 2)：Next Sentence Prediction (NSP）

目的：让模型理解问答，推理等句子对之间的关系
训练数据构建：
50％：B是跟随A的实际下一个句子（标记为IsNext）
50％：来自语料库的随机句子（标记为NotNext）
对比：
BERT：传输所有参数初始化最终任务模型参数
其他：句子嵌入被转移到下游任务
The final model achieves 97%-98% accuracy on NSP.

注意：作者特意说了语料的选取很关键，要选用document-level的而不是sentence-level的，这样可以具备抽象连续长序列特征的能力

Fine-tuning

Transformer中的自动注意机制允许BERT通过交换适当的输入和输出来模拟许多下游任务
将任务特定的输入和输出插入到BERT中，并对端到端的所有参数进行微调。
在这里插入图片描述
参数：
Batch size: 16, 32
Learning rate (Adam): 5e-5, 3e-5, 2e-5
Number of epochs: 2, 3, 4

参数

参数：
BERT_BASE（L = 12，H = 768，A = 12，总参数= 110M）
BERT_LARGE（L = 24，H = 1024，A = 16，总参数= 340M）
In all cases we set the feed-forward/filter size to be 4H, i.e., 3072 for the H = 768 and 4096 for the H = 1024.
L：the number of layers (transformer blocks)
H: the dimensionality of hidden layer
A: the number of self-attention heads

code

others

bert每一层都学到了什么

ACL 2019：What does BERT learn about the structure of language?
https://hal.inria.fr/hal-02131630/document
低层网络捕捉了短语级别的结构信息
表层信息特征在底层网络（3，4），句法信息特征在中间层网络（6_{9），语义信息特征在高层网络。（9}12）
主谓一致表现在中间层网络（8，9）

bert变体

ROBERTA
静态mask->动态mask
去除句对NSP任务，输入连续多个句子
更多数据更大batch size 更长时间
ALBERT
减少参数（瘦身）：
词表 V 到隐层 H 的中间，插入一个小维度 E
共享所有层的参数：Attention FFN
SOP 替换 NSP：负样本换成了同一篇文章中的两个逆序的句子
BERT对MASK 15% 的词来预测。ALBERT 预测的是 n-gram 片段，包含更完整的语义信息
训练数据长度：90%取512 BERT90% 128
对应BERT large：H:1024 ->4096 L:24->12 窄而深->宽而浅
百度 ERNIE

bert VS GPT

Encoder VS. Decoder
更多的数据:BooksCorpus and Wikipedia VS. BooksCorpus
在预训练中采用 SEP CLS（GPT ：Fine-tuning）
每个batch size词用的更多
(5e-5, 4e-5, 3e-5, and 2e-5) VS. 5e-5