BERT的代码实现

目录

1.BERT的理论

2.代码实现 

 2.1构建输入数据格式

 2.2定义BERT编码器的类

 2.3BERT的两个任务

2.3.1任务一:Masked Language Modeling MLM掩蔽语言模型任务 

2.3.2 任务二:next sentence prediction

3.整合代码 

 4.知识点个人理解


 

1.BERT的理论

BERT全称叫做Bidirectional Encoder Representations from Transformers, 论文地址: [1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (arxiv.org)

BERT是谷歌AI研究院在2018年10月提出的一种预训练模型. BERT本质上就是Transformer模型的encoder部分, 并且对encoder做了一些改进.

下图中编码器部分即BERT的基本结构.

  

2.代码实现 

import torch
from torch import nn
import dltools

 2.1构建输入数据格式

def get_tokens_and_segments(tokens_a, tokens_b=None):
    #classification 分类
    #BERT是两句话作为一对句子一同传入的,也可以单独传一句话,若序列长度长,可以补padding
    #假设先传一句话tokens_a
    tokens = ['<cls>'] + tokens_a + ['<sep>']  #tokens_embedding层的处理
    segments = [0] * (len(tokens_a) + 2)  #判断词元属于哪一句话,加标记,0属于第一句话
    if tokens_b is not None:
        tokens += tokens_b + ['sep']
        segments += [1] * (len(tokens_b) + 1)
    return tokens, segments


#测试上面的函数
get_tokens_and_segments([1, 2, 3], [4, 5, 6])

(['<cls>', 1, 2, 3, '<sep>', 4, 5, 6, 'sep'], [0, 0, 0, 0, 0, 1, 1, 1, 1])

 2.2定义BERT编码器的类

class BERTEncoder(nn.Module):
    #由于前馈网络的ffn_num_outputs 
Molecular-graph-BERT 是一种基于图神经网络的化学分子表示方法,可用于分子性质预测、分子设计等应用。以下是 Molecular-graph-BERT代码实现。 1. 安装依赖 ```python !pip install torch !pip install dgl !pip install rdkit ``` 2. 数据预处理 ```python import dgl from rdkit import Chem from dgl.data.utils import load_graphs, save_graphs from dgl.data.chem.utils import smiles_to_bigraph, CanonicalAtomFeaturizer # 将 SMILES 序列转换为 DGLGraph def graph_from_smiles(smiles): mol = Chem.MolFromSmiles(smiles) return smiles_to_bigraph(smiles, atom_featurizer=CanonicalAtomFeaturizer()) # 读取数据,并将 SMILES 序列转换为 DGLGraph data = [] with open('data.txt', 'r') as f: for line in f: smiles, label = line.strip().split('\t') g = graph_from_smiles(smiles) label = int(label) data.append((g, label)) # 将 DGLGraph 序列化并保存为二进制文件 save_graphs('data.bin', data) ``` 3. 定义模型 ```python import torch import torch.nn as nn import dgl.function as fn # 定义 GraphConvLayer class GraphConvLayer(nn.Module): def __init__(self, in_feats, out_feats): super(GraphConvLayer, self).__init__() self.linear = nn.Linear(in_feats, out_feats) self.activation = nn.ReLU() def forward(self, g, features): with g.local_scope(): g.ndata['h'] = features g.update_all(fn.copy_u('h', 'm'), fn.sum('m', 'neigh')) h_neigh = g.ndata['neigh'] h = self.linear(features + h_neigh) h = self.activation(h) return h # 定义 MolecularGraphBERT 模型 class MolecularGraphBERT(nn.Module): def __init__(self, hidden_size, num_layers): super(MolecularGraphBERT, self).__init__() self.embed = nn.Embedding(100, hidden_size) self.layers = nn.ModuleList([GraphConvLayer(hidden_size, hidden_size) for _ in range(num_layers)]) self.pool = dgl.nn.pytorch.glob.max_pool def forward(self, g): h = self.embed(g.ndata['feat']) for layer in self.layers: h = layer(g, h) g.ndata['h'] = h hg = self.pool(g, g.ndata['h']) return hg ``` 4. 训练模型 ```python from torch.utils.data import DataLoader from dgl.data.utils import load_graphs # 加载数据 data, _ = load_graphs('data.bin') labels = torch.tensor([d[1] for d in data]) # 划分训练集和测试集 train_data, test_data = data[:80], data[80:] train_labels, test_labels = labels[:80], labels[80:] # 定义训练参数 lr = 0.01 num_epochs = 50 hidden_size = 128 num_layers = 3 # 定义模型和优化器 model = MolecularGraphBERT(hidden_size, num_layers) optimizer = torch.optim.Adam(model.parameters(), lr=lr) # 训练模型 for epoch in range(num_epochs): model.train() for i, (g, label) in enumerate(train_data): pred = model(g) loss = nn.functional.binary_cross_entropy_with_logits(pred, label.unsqueeze(0).float()) optimizer.zero_grad() loss.backward() optimizer.step() model.eval() with torch.no_grad(): train_acc = 0 for g, label in train_data: pred = model(g) train_acc += ((pred > 0).long() == label).sum().item() train_acc /= len(train_data) test_acc = 0 for g, label in test_data: pred = model(g) test_acc += ((pred > 0).long() == label).sum().item() test_acc /= len(test_data) print('Epoch {:d} | Train Acc {:.4f} | Test Acc {:.4f}'.format(epoch, train_acc, test_acc)) ``` 以上就是 Molecular-graph-BERT代码实现。需要注意的是,由于 Molecular-graph-BERT 是基于图神经网络的方法,需要使用 DGL 库来构建和操作图数据,因此需要先安装 DGL 库。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值