接上一篇。
最主要的Transformer encoder结构代码如下:
def transformer_model(input_tensor,
attention_mask=None,
hidden_size=768,
num_hidden_layers=12,
num_attention_heads=12,
intermediate_size=3072,
intermediate_act_fn=gelu,
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
initializer_range=0.02,
do_return_all_layers=False):
"""Multi-headed, multi-layer Transformer from "Attention is All You Need".
This is almost an exact implementation of the original Transformer encoder.
See the original paper:
https://arxiv.org/abs/1706.03762
Also see:
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py
Args:
input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].
attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,
seq_length], with 1 for positions that can be attended to and 0 in
positions that should not be.
hidden_size: int. Hidden size of the Transformer.
num_hidden_layers: int. Number of layers (blocks) in the Transformer.
num_attention_heads: int. Number of attention heads in the Transformer.
intermediate_size: int. The size of the "intermediate" (a.k.a., feed
forward) layer.
intermediate_act_fn: function. The non-linear activation function to apply
to the output of the intermediate/feed-forward layer.
hidden_dropout_prob: float. Dropout probability for the hidden layers.
attention_probs_dropout_prob: float. Dropout probability of the attention
probabilities.
initializer_range: float. Range of the initializer (stddev of truncated
normal).
do_return_all_layers: Whether to also return all layers or just the final
layer.
Returns:
float Tensor of shape [batch_size, seq_length, hidden_size], the final
hidden layer of the Transformer.
Raises:
ValueError: A Tensor shape or parameter is invalid.
"""
if hidden_size % num_attention_heads != 0:
raise ValueError(
"The hidden size (%d) is not a multiple of the number of attention "
"heads (%d)" % (hidden_size, num_attention_heads))
#multi-head是将每个head attn 连接起来,输出的维度为hidden size,因此每个head维度计算如下
attention_head_size = int(hidden_size / num_attention_heads)
input_shape = get_shape_list(input_tensor, expected_rank=3)
batch_size = input_shape[0]
seq_length = input_shape[1]
input_width = input_shape[2]
# The Transformer performs sum residuals on all layers so the input needs
# to be the same as the hidden size.
#因为要加残差加到self-attn的输出,因此要求维度相同,也就是embedding维度与hidden_size相同
if input_width != hidden_size:
raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
(input_width, hidden_size))
# We keep the representation as a 2D tensor to avoid re-shaping it back and
# forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on
# the GPU/CPU but may not be free on the TPU, so we want to minimize them to
# help the optimizer.为了避免在2D和3D之间来回reshape,我们统一把所有的3D Tensor用2D来表示。
#虽然reshape在GPU/CPU上很快,但是在TPU上却不是这样,这样做的目的是为了优化TPU
prev_output = reshape_to_matrix(input_tensor)
all_layer_outputs = []
for layer_idx in range(num_hidden_layers):
with tf.variable_scope("layer_%d" % layer_idx):
layer_input = prev_output
with tf.variable_scope("attention"):
attention_heads = []
with tf.variable_scope("self"):
attention_head = attention_layer(
from_tensor=layer_input,
to_tensor=layer_input,
attention_mask=attention_mask,
num_attention_heads=num_attention_heads,
size_per_head=attention_head_size,
attention_probs_dropout_prob=attention_probs_dropout_prob,
initializer_range=initializer_range,
do_return_2d_tensor=True,
batch_size=batch_size,
from_seq_length=seq_length,
to_seq_length=seq_length)
attention_heads.append(attention_head)
attention_output = None
if len(attention_heads) == 1:
attention_output = attention_heads[0]
else:
# In the case where we have other sequences, we just concatenate
# them to the self-attention head before the projection.
attention_output = tf.concat(attention_heads, axis=-1)
# Run a linear projection of `hidden_size` then add a residual
# with `layer_input`.
with tf.variable_scope("output"):
attention_output = tf.layers.dense(
attention_output,
hidden_size,
kernel_initializer=create_initializer(initializer_range))
attention_output = dropout(attention_output, hidden_dropout_prob)
attention_output = layer_norm(attention_output + layer_input)
# The activation is only applied to the "intermediate" hidden layer.
#这步对应Feed Forward,全连接层
with tf.variable_scope("intermediate"):
intermediate_output = tf.layers.dense(
attention_output,
intermediate_size,
activation=intermediate_act_fn,
kernel_initializer=create_initializer(initializer_range))
# Down-project back to `hidden_size` then add the residual.用dense将shape变为‘hidden_size’才能加残差
with tf.variable_scope("output"):
layer_output = tf.layers.dense(
intermediate_output,
hidden_size,
kernel_initializer=create_initializer(initializer_range))
layer_output = dropout(layer_output, hidden_dropout_prob)
layer_output = layer_norm(layer_output + attention_output)
prev_output = layer_output
all_layer_outputs.append(layer_output)
if do_return_all_layers:
final_outputs = []
for layer_output in all_layer_outputs:
final_output = reshape_from_matrix(layer_output, input_shape)
final_outputs.append(final_output)
return final_outputs
else:
final_output = reshape_from_matrix(prev_output, input_shape)
return final_output
对应一步一步对应下图:

下面是multi-head attention的具体实现代码了,先来张图来回忆一下具体做法:


这个函数实现论文"Attention is all you Need"里的multi-head attention。如果from_tensor和to_tensor是同一个tensor,那么就实现Self-Attention。from_tensor的每个时刻都会attends to to_tensor,也就是用from的Query去乘以所有to的Key,得到weight,然后把所有to的Value加权求和起来。
这个函数首先把from_tensor变换成一个"query" tensor,然后把to_tensor变成"key"和"value" tensors。总共有num_attention_heads组Query、Key和Value,每一个Query,Key和Value的shape都是[batch_size, seq_length, size_per_head(768/12=64)].
然后计算query和key的内积并且除以size_per_head的平方根(8)。然后softmax变成概率,最后用概率加权value得到输出。因为有多个Head,每个Head都输出[batch_size, seq_length, size_per_head],最后把12个Head的结果concat起来,就最终得到[batch_size, seq_length, size_per_head*12=768]
实际上我们是把这8个Head的Query,Key和Value都放在一个Tensor里面的,因此实际通过transpose和reshape就达到了上面的效果。
def attention_layer(from_tensor,
to_tensor,
attention_mask=None,
num_attention_heads=1,
size_per_head=512,
query_act=None,
key_act=None,
value_act=None,
attention_probs_dropout_prob=0.0,
initializer_range=0.02,
do_return_2d_tensor=False,
batch_size=None,
from_seq_length=None,
to_seq_length=None):
"""Performs multi-headed attention from `from_tensor` to `to_tensor`.
This is an implementation of multi-headed attention based on "Attention
is all you Need". If `from_tensor` and `to_tensor` are the same, then
this is self-attention. Each timestep in `from_tensor` attends to the
corresponding sequence in `to_tensor`, and returns a fixed-with vector.
This function first projects `from_tensor` into a "query" tensor and
`to_tensor` into "key" and "value" tensors. These are (effectively) a list
of tensors of length `num_attention_heads`, where each tensor is of shape
[batch_size, seq_length, size_per_head].
Then, the query and key tensors are dot-producted and scaled. These are
softmaxed to obtain attention probabilities. The value tensors are then
interpolated by these probabilities, then concatenated back to a single
tensor and returned.
In practice, the multi-headed attention are done with transposes and
reshapes rather than actual separate tensors.
Args:
from_tensor: float Tensor of shape [batch_size, from_seq_length,
from_width].
to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width].
attention_mask: (optional) int32 Tensor of shape [batch_size,
from_seq_length, to_seq_length]. The values should be 1 or 0. The
attention scores will effectively be set to -infinity for any positions in
the mask that are 0, and will be unchanged for positions that are 1.
num_attention_heads: int. Number of attention heads.
size_per_head: int. Size of each attention head.
query_act: (optional) Activation function for the query transform.
key_act: (optional) Activation function for the key transform.
value_act: (optional) Activation function for the value transform.
attention_probs_dropout_prob: (optional) float. Dropout probability of the
attention probabilities.
initializer_range: float. Range of the weight initializer.
do_return_2d_tensor: bool. If True, the output will be of shape [batch_size
* from_seq_length, num_attention_heads * size_per_head]. If False, the
output will be of shape [batch_size, from_seq_length, num_attention_heads
* size_per_head].
batch_size: (Optional) int. If the input is 2D, this might be the batch size
of the 3D version of the `from_tensor` and `to_tensor`.
from_seq_length: (Optional) If the input is 2D, this might be the seq length
of the 3D version of the `from_tensor`.
to_seq_length: (Optional) If the input is 2D, this might be the seq length
of the 3D version of the `to_tensor`.
Returns:
float Tensor of shape [batch_size, from_seq_length,
num_attention_heads * size_per_head]. (If `do_return_2d_tensor` is
true, this will be of shape [batch_size * from_seq_length,
num_attention_heads * size_per_head]).
Raises:
ValueError: Any of the arguments or tensor shapes are invalid.
"""
def transpose_for_scores(input_tensor, batch_size, num_attention_heads,
seq_length, width):
output_tensor = tf.reshape(
input_tensor, [batch_size, seq_length, num_attention_heads, width])
output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])
return output_tensor
from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])
if len(from_shape) != len(to_shape):
raise ValueError(
"The rank of `from_tensor` must match the rank of `to_tensor`.")
if len(from_shape) == 3:
batch_size = from_shape[0]
from_seq_length = from_shape[1]
to_seq_length = to_shape[1]
elif len(from_shape) == 2:
if (batch_size is None or from_seq_length is None or to_seq_length is None):
raise ValueError(
"When passing in rank 2 tensors to attention_layer, the values "
"for `batch_size`, `from_seq_length`, and `to_seq_length` "
"must all be specified.")
# Scalar dimensions referenced here:
# B = batch size (number of sequences)
# F = `from_tensor` sequence length
# T = `to_tensor` sequence length
# N = `num_attention_heads`
# H = `size_per_head`
#压缩为2d
from_tensor_2d = reshape_to_matrix(from_tensor)
to_tensor_2d = reshape_to_matrix(to_tensor)
# `query_layer` = [B*F, N*H] F表示From,from_tensor_2d维度是[batch_size*seq_length,hidden_size]
#得到[B*F, N*H]=[batch_size*seq_length,num_attention_heads * size_per_head]
query_layer = tf.layers.dense(
from_tensor_2d,
num_attention_heads * size_per_head,
activation=query_act,
name="query",
kernel_initializer=create_initializer(initializer_range))
# `key_layer` = [B*T, N*H] T表示To
key_layer = tf.layers.dense(
to_tensor_2d,
num_attention_heads * size_per_head,
activation=key_act,
name="key",
kernel_initializer=create_initializer(initializer_range))
# `value_layer` = [B*T, N*H]
value_layer = tf.layers.dense(
to_tensor_2d,
num_attention_heads * size_per_head,
activation=value_act,
name="value",
kernel_initializer=create_initializer(initializer_range))
# `query_layer` = [B, N, F, H]
query_layer = transpose_for_scores(query_layer, batch_size,
num_attention_heads, from_seq_length,
size_per_head)
# `key_layer` = [B, N, T, H]
key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
to_seq_length, size_per_head)
# Take the dot product between "query" and "key" to get the raw
# attention scores.
# `attention_scores` = [B, N, F, T]
attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)#将key转置以后相乘
attention_scores = tf.multiply(attention_scores,
1.0 / math.sqrt(float(size_per_head)))
if attention_mask is not None:
# `attention_mask` = [B, 1, F, T]
attention_mask = tf.expand_dims(attention_mask, axis=[1])
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
# masked positions, this operation will create a tensor which is 0.0 for
# positions we want to attend and -10000.0 for masked positions.
adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0
# Since we are adding it to the raw scores before the softmax, this is
# effectively the same as removing these entirely.
attention_scores += adder
# Normalize the attention scores to probabilities.
# `attention_probs` = [B, N, F, T]
attention_probs = tf.nn.softmax(attention_scores)
# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs = dropout(attention_probs, attention_probs_dropout_prob)
# `value_layer` = [B, T, N, H]
value_layer = tf.reshape(
value_layer,
[batch_size, to_seq_length, num_attention_heads, size_per_head])
# `value_layer` = [B, N, T, H]
value_layer = tf.transpose(value_layer, [0, 2, 1, 3])
# `context_layer` = [B, N, F, H]
context_layer = tf.matmul(attention_probs, value_layer)
# `context_layer` = [B, F, N, H]
context_layer = tf.transpose(context_layer, [0, 2, 1, 3])
if do_return_2d_tensor:
# `context_layer` = [B*F, N*H]
context_layer = tf.reshape(
context_layer,
[batch_size * from_seq_length, num_attention_heads * size_per_head])
else:
# `context_layer` = [B, F, N*H]
context_layer = tf.reshape(
context_layer,
[batch_size, from_seq_length, num_attention_heads * size_per_head])
return context_layer
attention_scores += adder是利用attention_mask让padding的部分变成无穷小,这样padding部分softmax就接近0了(这样看前面源码计算attention_mask不太对呢,keras实现的那个更准确)。
对应着看,也很简单啦。
到这里,模型主要部分算是结束了。后面就是pre-training和finetune了。
预训练
虽然goole提供了很多预训练的模型,但是我们也要学会自己通过Mask LM和Next Sentence来进行预训练,自己预训练的时候如果机器资源有限可以使用goole提供的checkpoint作为初始值来减少预训练的时间。
预训练数据格式:每一篇文档之间使用空行隔开,每一行为一句。
我们首先需要使用create_pretraining_data.py把文本文件变成TFRecord格式,便于后面的代码进行Pretraining。由于这个脚本会把整个文本文件加载到内存,因此这个文件不能太大。如果读者有很多文档要训练,比如1000万。那么我们可以把这1000万文档拆分成1万个文件,每个文件1000个文档,从而生成1000个TFRecord文件。
我们先看create_pretraining_data.py的用法:
python create_pretraining_data.py
–input_file=./sample_text.txt
–output_file=/tmp/tf_examples.tfrecord
–vocab_file=$BERT_BASE_DIR/vocab.txt
–do_lower_case=True
–max_seq_length=128
–max_predictions_per_seq=20
–masked_lm_prob=0.15
–random_seed=12345
–dupe_factor=5
- max_seq_length Token序列的最大长度
- max_predictions_per_seq 最多生成多少个MASK
- masked_lm_prob 多少比例的Token变成MASK
- dupe_factor 一个文档重复多少次
首先说一下参数dupe_factor,比如一个句子”it is a good day”,为了充分利用数据,我们可以多次随机的生成MASK,比如第一次可能生成”it is a [MASK] day”,第二次可能生成”it [MASK] a good day”。这个参数控制重复的次数。
masked_lm_prob就是论文里的参数15%。max_predictions_per_seq是一个序列最多MASK多少个Token,它通常等于max_seq_length * masked_lm_prob。这么看起来这个参数没有必要提供,但是后面的脚本也需要用到这个同样的值,而后面的脚本并没有这两个参数。
def main(_):
tf.logging.set_verbosity(tf.logging.INFO)
tokenizer = tokenization.FullTokenizer(
vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
input_files = []
for input_pattern in FLAGS.input_file.split(","):
input_files.extend(tf.gfile.Glob(input_pattern))
tf.logging.info("*** Reading from input files ***")
for input_file in input_files:
tf.logging.info(" %s", input_file)
rng = random.Random(FLAGS.random_seed)
instances = create_training_instances(
input_files, tokenizer, FLAGS.max_seq_length, FLAGS.dupe_factor,
FLAGS.short_seq_prob, FLAGS.masked_lm_prob, FLAGS.max_predictions_per_seq,
rng)
output_files = FLAGS.output_file.split(",")
tf.logging.info("*** Writing to output files ***")
for output_file in output_files:
tf.logging.info(" %s", output_file)
write_instance_to_example_files(instances, tokenizer, FLAGS.max_seq_length,
FLAGS.max_predictions_per_seq, output_files)
create_pretraining_data.py的main函数如上,输入文本文件列表是input_files,通过函数create_training_instances构建训练的instances,然后调用write_instance_to_example_files以TFRecord格式写到output_files。
直接来看create_instances_from_document:
def create_instances_from_document(
all_documents, document_index, max_seq_length, short_seq_prob,
masked_lm_prob, max_predictions_per_seq, vocab_words, rng):
"""Creates `TrainingInstance`s for a single document."""
document = all_documents[document_index]
# Account for [CLS], [SEP], [SEP]
max_num_tokens = max_seq_length - 3
# We *usually* want to fill up the entire sequence since we are padding
# to `max_seq_length` anyways, so short sequences are generally wasted
# computation. However, we *sometimes*
# (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
# sequences to minimize the mismatch between pre-training and fine-tuning.
# The `target_seq_length` is just a rough target however, whereas
# `max_seq_length` is a hard limit.
#我们通常希望Token序列长度为最大的max_seq_length,否则padding后的计算是无意义的,浪费计
# 算资源。但是有的时候我们有希望生成一些短的句子,因为在实际应用中会有短句,如果都是
# 长句子,会出现pre-training and fine-tuning之间的mismatch,所有我们以short_seq_prob == 0.1 的
#概率来生成短句子
target_seq_length = max_num_tokens
if rng.random() < short_seq_prob:
target_seq_length = rng.randint(2, max_num_tokens)
代码有点长,但是逻辑很简单,那么算法首先找到一个chunk,它会不断往chunk加入一个句子的所有Token,使得chunk里的token数量大于等于target_seq_length。通常我们期望target_seq_length为max_num_tokens(128-3),这样padding的尽量少,训练的效率高。但是有时候我们也需要生成一些短的序列,否则会出现训练与实际使用不匹配的问题。
找到一个chunk之后,比如这个chunk有5个句子,那么我们随机的选择一个切分点,比如3。把前3个句子当成句子A,后两个句子当成句子B。这是两个句子A和B有关系的样本(is_random_next=False)。为了生成无关系的样本,我们还以50%的概率把B用随机从其它文档抽取的句子替换掉,这样就得到无关系的样本(is_random_next=True)。如果是这种情况,后面两个句子需要放回去,以便在下一层循环中能够被再次利用。
有了句子A和B之后,我们就可以填充tokens和segment_ids,这里会加入特殊的[CLS]和[SEP]。接下来使用create_masked_lm_predictions来随机的选择某些Token,把它变成[MASK]。
create_masked_lm_predictions也很简单,就不详细说了。
下面说一下get_masked_lm_output:
未完待续。。。
参考:
https://jalammar.github.io/
https://codewithzichao.github.io/2020/07/04/NLP-BERT%E6%BA%90%E7%A0%81%E8%A7%A3%E8%AF%BB/
http://fancyerii.github.io/2019/03/09/bert-codes/
本文深入探讨Transformer编码器的实现细节,包括多头注意力机制的数学原理和代码实现。通过逐层解析,展示了如何从输入序列计算注意力权重并生成上下文向量。此外,介绍了预训练阶段的数据处理,包括创建训练实例、生成TFRecord文件以及预训练目标的设定。预训练旨在减小预训练与微调之间的差异,并为BERT模型的高效学习奠定基础。
2684

被折叠的 条评论
为什么被折叠?



