Bert代码详解（一）

最新推荐文章于 2024-12-05 21:08:38 发布

c-minus

最新推荐文章于 2024-12-05 21:08:38 发布

阅读量3w

点赞数 106

分类专栏： NLP

本文链接：https://blog.youkuaiyun.com/cpluss/article/details/88418176

版权

这是bert的pytorch版本（与tensorflow一样的，这个更简单些，这个看懂了，tf也能看懂），地址：https://github.com/huggingface/pytorch-pretrained-BERT 主要内容在pytorch_pretrained_bert/modeling文件中。

BertModel 流程详解

从BertModel的forward函数开始

第一步：整理输入

#将attention_mask变成（batch_size, 1, 1, to_seq_length） 
#(to be completed)
extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility

#原本的mask中，1代表有用信息，0代表填充信息。下面的这句代码将其更改为：0代表有用信息，-10000代表填充信息。（为什么？从最后的softmax函数出发考虑）
#（to be completed）
extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0

#将input_ids和token_type_ids输入到embeddings层，构造下一层的输入
embedding_output = self.embeddings(input_ids, token_type_ids)

BertEmbeddings层详细解释

#输入为input_ids和token_type_ids，其维度均为（batch_size, seq_length）
#.....................................................................
#生成positions_ids
#如果一句话的长度是seq_length，那么生成的positions_id就是【0，1，2，......，seq_length - 1】
#positions_id其实就是为了构造论文中提到的Position_Embeddings
position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device)
#变成和input_ids一样的形状
position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
#.....................................................................


#....................................................................
#如果输入token_type_ids为None的话，则默认整个输入都是a句。（a句的含义请看论文解读）
if token_type_ids is None:
    token_type_ids = torch.zeros_like(input_ids)

#....................................................................


#....................................................................
#此三句的含义是根据相应的输入获得三种embeddings(对应论文的三种embedding)
#word_embeddings是nn的一个内置函数（方法？），其作用是根据输入，产生相应的embedding,网上查其用法，很简单。
words_embeddings = self.word_embeddings(input_ids)
position_embeddings = self.position_embeddings(position_ids)
token_type_embeddings = self.token_type_embeddings(token_type_ids)
#....................................................................


#.....................................................................
#论文中重要的一步，将三种embedding相加作为这个单词的代表。
embeddings = words_embeddings + position_embeddings + token_type_embeddings
#.....................................................................

#.........................................................................
#将结果输入到layer_normer层和dropout层进行处理，得到最后的输出并返回
embeddings = self.LayerNorm(embeddings)
embeddings = self.dropout(embeddings)
return embeddings
#.........................................................................

BertLayerNorm层详细解释

#layerNorm 和batchNorm的区别和作用网上有解释
u = x.mean(-1, keepdim=True)
s = (x - u).pow(2).mean(-1, keepdim=True)
x = (x - u) / torch.sqrt(s + self.variance_epsilon)
return self.weight * x + self.bias
#上述的代码其实就是下面这个公式,其中x是向量，u是标量（均值），分母代表标准差。（参考概率论详解此公式含义）
#关于代码中为什么要添加variance_epsilon,这是一个很小得数，是为了防止分母（平方差）为0.

最低0.47元/天解锁文章