Bert的MLM任务loss原理

最新推荐文章于 2025-05-30 22:15:18 发布

原创最新推荐文章于 2025-05-30 22:15:18 发布 · 5.6k 阅读

5 ·

CC 4.0 BY-SA版权

文章标签：

#bert #深度学习 #自然语言处理

BERT的预训练任务包括MLM和NSP，本文主要关注MLM。该任务中，15%的词汇被mask，通过transformer等层编码，然后预测mask位置的词，相当于在词汇表大小的类别上做多分类。logits与one-hot编码的label计算交叉熵，加权平均后得到loss。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

bert预训练有MLM和NSP两个任务，其中MLM是类似于“完形填空”的方式，对一个句子里的15%的词进行mask，通过双向transformer+feedforward+rediual_add+layer_norm完成对每个词的embedding编码，然后对mask的这个词进行预测，预测过程相当于做多分类，类别的个数是词汇的总个数，将mask的词的emb经过MLP变换生成在每个类别词汇上的logits 概率，label是mask位置上真实词在整个词汇上的one-hot编码，将logits和label计算交叉熵，又做了加权平均，即可得出MLM的loss，过程如下：

源码中的get_masked_lm_output()方法过程解析：

1、输入input_tensor:[batch,maskednums, embed_size]

2、经过线性变换+layernorm:[batch,maskednums, 768]

3、logits：将embedding table[3万,768]作为变换矩阵，计算logits：[batch,maskednums, 3万]，相当于得出每个被盖住词在3万个词上的概率，其实就是3万个类别多分类

4、labels：one-hot编码[maskednums,3万]

5、计算交叉熵：[bactch, maskednums]

6、loss：加权平均得出一个实数

def get_masked_lm_output(bert_config, input_tensor, output_weights, positions,
label_ids, label_weights):
"""Get loss and log probs for the masked LM."""
input_tensor = gather_indexes(input_tensor, positions)

with tf.variable_scope("cls/predictions"):
# We apply one more non-linear transformation before the output layer.
# This matrix is not used after pre-training.
with tf.variable_scope("transform"):
input_tensor = tf.layers.dense(
input_tensor,
units=bert_config.hidden_size,
activation=modeling.get_activation(bert_config.hidden_act),
kernel_initializer=modeling.create_initializer(
bert_config.initializer_range))
input_tensor = modeling.layer_norm(input_tensor)

# The output weights are the same as the input embeddings, but there is
# an output-only bias for each token.
output_bias = tf.get_variable(
"output_bias",
shape=[bert_config.vocab_size],
initializer=tf.zeros_initializer())
logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
log_probs = tf.nn.log_softmax(logits, axis=-1)

label_ids = tf.reshape(label_ids, [-1])
label_weights = tf.reshape(label_weights, [-1])

one_hot_labels = tf.one_hot(
label_ids, depth=bert_config.vocab_size, dtype=tf.float32)

# The `positions` tensor might be zero-padded (if the sequence is too
# short to have the maximum number of predictions). The `label_weights`
# tensor has a value of 1.0 for every real prediction and 0.0 for the
# padding predictions.
per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1])
numerator = tf.reduce_sum(label_weights * per_example_loss)
denominator = tf.reduce_sum(label_weights) + 1e-5
loss = numerator / denominator

return (loss, per_example_loss, log_probs)