预训练BERT学习笔记_Storm*Rage的博客-优快云博客

Transformer的Attention

Decoder的结构与Encoder相比多了一个Encoder-Decoder Attention，两个Attention分别用于计算输入和输出的权值：

Self-Attention：当前翻译和已经翻译的前文之间的关系；
Encoder-Decnoder Attention：当前翻译和编码的特征向量之间的关系。
Encoder和Decoder的结构下图所示：

2.4 Encoder-Decoder Attention
在解码器中，Transformer block比编码器中多了个encoder-cecoder attention。在encoder-decoder attention中，Q QQ来自与解码器的上一个输出，K KK和V VV则来自于编码器的输出。

由于在机器翻译中，解码过程是一个顺序操作的过程，也就是当解码第 k kk个特征向量时，只能看到第 k − 1 k-1k−1及其之前的解码结果，论这种情况下的multi-head attention被视为masked multi-head attention。

BERT主要分为三层，embedding层、encoder层、prediction层。

3.2 Encoder层
Encoder层则和Transformer encoder基本相同，主要完成两个自监督任务，即MLM和NSP。

3.2.1 Masked Language Model
Masked Language Model（MLM）是指在训练的时候随机从输入语料上mask掉一些单词，然后通过的上下文预测该单词，该任务像完形填空。
在BERT的实验中，15%的WordPiece Token会被随机Mask掉。在训练模型时，一个句子会被多次传递到模型中用于参数学习，但是并不是每次都mask掉这些单词，而是在确定要mask掉的单词之后，80%的时候会直接替换为[mask]，10%的时候将其替换为其它任意单词，10%的时候会保留原始Token。
因为如果句子中的某个Token 100%都会被mask掉，那么在fine-tuning的时候模型就会有一些没有见过的单词。加入随机Token的原因是因为Transformer要保持对每个输入token的分布式表征，否则模型就会记住这个[mask]与原单词相同。而因为一个单词被随机替换掉的概率只有15 % × 10 % = 1.5 % 15\% \times 10\% =1.5\%15%×10%=1.5%，负面影响可以忽略不计的。另外由于每次只预测15%的单词，因此模型收敛的比较慢。

3.2.2 Next Sentence Prediction
Next Sentence Prediction（NSP）的任务是判断句子B是否是句子A的下文。如果是的话输出’IsNext’，否则输出’NotNext’。训练数据的生成方式是从平行语料中随机抽取的连续两句话，其中50%保留抽取的两句话，它们符合IsNext关系，另外50%的第二句话随机从语料中提取，符合NotNext。这个关系保存在[CLS]符号中。

3.3 prediction层
Prediction层采用线性全连接并做softmax归一化。在不同的下游任务使用中，可以把bert理解为一个特征抽取encoder，根据下游任务灵活使用。BERT应用的四个场景：

语句对分类，如语句相似度任务，语句蕴含判断等
单语句分类，如情感分类
QA任务，如阅读理解，将question和document构建为语句对，输出start和end的位置即可
序列标注，如NER，从每个位置得到类别即可。
对于NSP任务来说，其条件概率表示为P = s o f t m a x ( C W T ) P=softmax(CW^T)P=softmax(CW
T
)，其中C CC是BERT输出中的[CLS]符号，W WW是可学习的权值矩阵。对于其它任务来说，可以根据BERT的输出信息作出对应的预测，在BERT的基础上再添加一个输出层便可以完成对特定任务的微调。

其中Tok表示不同的Token，E EE表示嵌入向量，T i T_iT
i

表示第i ii个Token在经过BERT处理之后得到的特征向量。

微调的任务包括：

1.基于句子对的分类
MNLI：给定一个前提(Premise)，根据这个前提去推断假设(Hypothesis)与前提的关系。该任务的关系分为三种，蕴含关系(Entailment)、矛盾关系(Contradiction)以及中立关系(Neutral)。
QQP：基于Quora，判断 Quora 上的两个问题句是否表示的是一样的意思。
QNLI：用于判断文本是否包含问题的答案，类似于我们做阅读理解定位问题所在的段落。
STS-B：预测两个句子的相似性，包括5个级别。
MRPC：判断两个句子是否是等价的。
RTE：类似于MNLI，但是只是对蕴含关系的二分类判断。
SWAG：从四个句子中选择为可能为前句下文的那个。
2.基于单个句子的分类
SST-2：电影评价的情感分析。
CoLA：句子语义判断，是否是可接受的（Acceptable）。
3.问答任务
SQuAD v1.1：给定一个句子（通常是一个问题）和一段描述文本，输出这个问题的答案，类似于做阅读理解的简答题。
4.命名实体识别
CoNLL-2003 NER：判断一个句子中的单词是不是Person，Organization，Location，Miscellaneous或者other（无命名实体）。
四、源码分析
分析的源码为基于PyTorch的HuggingFace Transformer。源代码地址：https://github.com/huggingface/transformers。入口类为BertModel。

4.1 入口和总体架构
使用bert进行下游任务fine-tune时，通常先构造一个BertModel，然后由它从输入语句中提取特征，得到输出。

class BertModel(BertPreTrainedModel):

def __init__(self, config):
super().__init__(config)
self.config = config

self.embeddings = BertEmbeddings(config) # embedding向量输入层
self.encoder = BertEncoder(config) # encoder编码层
self.pooler = BertPooler(config) # pooler输出层，CLS输出

self.init_weights()

# 获取embedding层的word_embedding,
def get_input_embeddings(self):
return self.embeddings.word_embeddings

# 利用别的数据来初始化word_embeddings
def set_input_embeddings(self, value):
self.embeddings.word_embeddings = value
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
构造方法主要有以下三点：

读取配置config，包括vocab_size、num_attention_heads、num_hidden_layers等参数，一般在配置文件bert_config.json中
构造embedding、encoder、pooler三个对象，分别对应embedding层、encoder层、prediction层。
利用pretrain model初始化weights，进行多头剪枝prune_heads等。
从输入语句提取特征，并得到输出，代码如下

def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
):
··· # 预处理
# embedding层，包括word_embedding, position_embedding, token_type_embedding
embedding_output = self.embeddings(
input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
)
# encoder层，得到每个位置的编码、所有中间隐层、所有中间attention分布
encoder_outputs = self.encoder(
embedding_output,
attention_mask=extended_attention_mask,
head_mask=head_mask,
encoder_hidden_states=encoder_hidden_states,
encoder_attention_mask=encoder_extended_attention_mask,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
sequence_output = encoder_outputs[0]
# CLS位置编码向量
pooled_output = self.pooler(sequence_output)

if not return_dict:
# 返回每个位置编码、CLS位置编码、所有中间隐层、所有中间attention分布等。
return (sequence_output, pooled_output) + encoder_outputs[1:]

return BaseModelOutputWithPooling(
last_hidden_state=sequence_output,
pooler_output=pooled_output,
hidden_states=encoder_outputs.hidden_states,
attentions=encoder_outputs.attentions,
)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
在上述从输入语句中抽取特征的过程中，有以下行为：

embedding层，对input_ids、position_ids、token_type_ids进行embedding
encoder层，embedding后的结果，经过多层Transformer encoder，得到输出。每一层encoder结构基本相同，均包括multi-head self-attention和feed-forward，并经过layer-norm和残差连接
pooler层，对CLS位置进行线性全连接，将它作为整个sequence的输出。
最终返回4个结果：

sequence_output：每个位置的编码输出，每个位置对应一个向量
pooled_output: CLS位置编码输出，经过了一层Linear和activation。一般用CLS来代表整个语句
hidden_states：所有中间层的隐层，这个需要在config中打开，才会保存下来
attentions: 所有中间层的attention分布，这个也需要在config中打开，才会保存。
4.2 Embedding层
class BertEmbeddings(nn.Module):

def __init__(self, config):
super().__init__()
self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)

# layerNorm归一化和dropout。layerNorm对归一化做一个线性连接，故有训练参数
self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)

self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))

def forward(self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None):
# 获取input_shape， [batch, seq_length]
if input_ids is not None:
input_shape = input_ids.size()
else:
input_shape = inputs_embeds.size()[:-1]

seq_length = input_shape[1]
# position_ids默认按照字的顺利进行编码，不足补0
if position_ids is None:
position_ids = self.position_ids[:, :seq_length]
# token_type_ids默认全0
if token_type_ids is None:
token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)
# 通过embedding_lookup查表，将ids向量化
if inputs_embeds is None:
inputs_embeds = self.word_embeddings(input_ids)
position_embeddings = self.position_embeddings(position_ids)
token_type_embeddings = self.token_type_embeddings(token_type_ids)
# 最终embedding为三者直接相加，权值包含在embedding本身训练参数中
embeddings = inputs_embeds + position_embeddings + token_type_embeddings
# 归一化和dropout后，得到最终输入向量
embeddings = self.LayerNorm(embeddings)
embeddings = self.dropout(embeddings)
return embeddings
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
Embedding层的具体行为包括：

从三个embedding表中，通过id查找到对应向量。三个embedding表为word_embeddings，position_embeddings，token_type_embeddings。均是在train阶段训练得到。
三个embedding向量直接相加，得到总embedding。
对总embedding进行归一化和dropout。
4.3 Encoder层
class BertEncoder(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])

def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
output_attentions=False,
output_hidden_states=False,
return_dict=False,
):
all_hidden_states = () if output_hidden_states else None
all_attentions = () if output_attentions else None
# 遍历所有layer。bert中每个layer结构相同
for i, layer_module in enumerate(self.layer):
# 保存每层hidden_state
if output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)

if getattr(self.config, "gradient_checkpointing", False):

def create_custom_forward(module):
def custom_forward(*inputs):
return module(*inputs, output_attentions)

return custom_forward
# 执行每层self-attention和feed-forward计算。得到隐层输出
layer_outputs = torch.utils.checkpoint.checkpoint(
create_custom_forward(layer_module),
hidden_states,
attention_mask,
head_mask[i],
encoder_hidden_states,
encoder_attention_mask,
)
else:
layer_outputs = layer_module(
hidden_states,
attention_mask,
head_mask[i],
encoder_hidden_states,
encoder_attention_mask,
output_attentions,
)
hidden_states = layer_outputs[0]
# 保存每层attention分布，默认不保存
if output_attentions:
all_attentions = all_attentions + (layer_outputs[1],)
# 保存最后一层
if output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)

if not return_dict:
return tuple(v for v in [hidden_states, all_hidden_states, all_attentions] if v is not None)
return BaseModelOutput(
last_hidden_state=hidden_states, hidden_states=all_hidden_states, attentions=all_attentions
)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
Encoder由多个结构相同的子层BertLayer组成，遍历所有的子层，执行每层的self-attention和feed-forward计算，并保存每层的hidden_state和attention分布。

4.3.1 BertLayer子层
class BertLayer(nn.Module):
def __init__(self, config):
super().__init__()
# multi-head self attention层
self.attention = BertAttention(config)
self.is_decoder = config.is_decoder
if self.is_decoder:
# 对于decoder，cross-attention和self-attention共用一个函数
self.crossattention = BertAttention(config)
# 两层feed-forward全连接，然后残差并layerNorm输出
self.intermediate = BertIntermediate(config)
self.output = BertOutput(config)

def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
output_attentions=False,
):
# self-attention：attention_mask 和 head_mask
self_attention_outputs = self.attention(
hidden_states, attention_mask, head_mask, output_attentions=output_attentions,
)
# hidden state隐层输出
attention_output = self_attention_outputs[0]
# attention分布
outputs = self_attention_outputs[1:] # add self attentions if we output attention weights
# decoder在self-attention结束后进行soft-attention
if self.is_decoder and encoder_hidden_states is not None:
cross_attention_outputs = self.crossattention(
attention_output,
attention_mask,
head_mask,
encoder_hidden_states,
encoder_attention_mask,
output_attentions,
)
attention_output = cross_attention_outputs[0]
outputs = outputs + cross_attention_outputs[1:] # add cross attentions if we output attention weights
# feed-forward和layerNorm归一化
intermediate_output = self.intermediate(attention_output)
layer_output = self.output(intermediate_output, attention_output)
# 输出hidden_state隐层和attention分布
outputs = (layer_output,) + outputs
return outputs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
BertLayer实现的功能包括：

multi-head self-attention, 支持attention_mask和head_mask
如果作decoder的话，self-attention结束后进行一层cross-attention，将encoder信息和decoder信息产生交互
feed-forward全连接和layerNorm归一化。
4.3.2 BertAttention注意力计算
class BertAttention(nn.Module):
def __init__(self, config):
super().__init__()
self.self = BertSelfAttention(config)
self.output = BertSelfOutput(config)
self.pruned_heads = set()

def prune_heads(self, heads):
# 对每层多头进行裁剪，直接对权重矩阵剪枝
if len(heads) == 0:
return
heads, index = find_pruneable_heads_and_indices(
heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads
)

# q,k,v和全连接上，加入mask
self.self.query = prune_linear_layer(self.self.query, index)
self.self.key = prune_linear_layer(self.self.key, index)
self.self.value = prune_linear_layer(self.self.value, index)
self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)

self.self.num_attention_heads = self.self.num_attention_heads - len(heads)
self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads
self.pruned_heads = self.pruned_heads.union(heads)

def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
output_attentions=False,
):
# self-attention计算
self_outputs = self.self(
hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions,
)
# 残差连接和归一化
attention_output = self.output(self_outputs[0], hidden_states)
# 输出归一化后隐层，和attention概率分布
outputs = (attention_output,) + self_outputs[1:]
return outputs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
BertAttention主要包括：self-attention计算和归一化残差连接。

4.3.3 BertIntermediate全连接
class BertIntermediate(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
if isinstance(config.hidden_act, str):
self.intermediate_act_fn = ACT2FN[config.hidden_act]
else:
self.intermediate_act_fn = config.hidden_act

def forward(self, hidden_states):
# 全连接，[hidden_size, intermediate_size]
hidden_states = self.dense(hidden_states)
# 非线性激活，如glue，relu。bert默认使用glue
hidden_states = self.intermediate_act_fn(hidden_states)
return hidden_states
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
该模块主要进行全连接和非线性激活。

4.3.4 BertOutput输出
class BertOutput(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)

def forward(self, hidden_states, input_tensor):
# 全连接, [intermediate_size, hidden_size]
hidden_states = self.dense(hidden_states)
# dropout
hidden_states = self.dropout(hidden_states)
# 归一化和残差连接
hidden_states = self.LayerNorm(hidden_states + input_tensor)
return hidden_states
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
输出层也比较简单经过一层全连接、dropout、layerNorm归一化和残差连接，得到输出隐层。

4.4 Pooler层输出
class BertPooler(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.activation = nn.Tanh()

def forward(self, hidden_states):
# CLS位置输出
first_token_tensor = hidden_states[:, 0]
# 全连接 + tanh激活
pooled_output = self.dense(first_token_tensor)
pooled_output = self.activation(pooled_output)
return pooled_output

Pooler层对CLS位置向量，进行全连接和tanh激活，从而得到输出向量。CLS位置向量一般用来代表整个sequence。

https://blog.youkuaiyun.com/weixin_43886056/article/details/107960402