摘要:上篇介绍了数据的标注过程,接下来就是模型的训练了,本文采用BERT+CRF模型进行训练。
采用kashgari模块(一个将Bert封装好的模块,用于快速搭建模型)的Bert模块快速搭建自己模型
在原来bert预训练模型基础上,导入自己的标注数据进一步训练,
bert中文预训练模型就自己下一下吧~
代码比较简单,就是导入数据,在预训练基础上进一步训练
from kashgari.tasks.seq_labeling import BLSTMCRFModel
from kashgari.embeddings import BERTEmbedding
import kashgari
from kashgari import utils
import os
#os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
#os.environ["CUDA_VISIBLE_DEVICES"] = ""
def get_sequence_tagging_data(file_path):
data_x, data_y = [], []
with open(file_path, 'r', encoding='utf-8') as f:
lines = f.read().splitlines()
# print(lines)
x, y = [], []
for line in lines:
rows = line.split(' ')
if len(rows) == 4:
data_x.append(x)
data_y.append(y)
x = []
y = []
else:
x.append(rows[0])
y.append(rows[1])
return data_x, data_y
#train_x, train_y = get_sequence_tagging_data('training_data_bert_train.txt')
# train_x, train_y = get_sequence_tagging_data('../new_note.txt')
# print(f"train data count: {len(train_x)}")
model training
embedding = BERTEmbedding('/pvc/train/chinese_L-12_H-768_A-12',40)
model = BLSTMCRFModel(embedding)
model.fit(train_x,
train_y,
validation_split = 0.4,
epochs=10,
batch_size=32)
print('model_save')
model.save('../model_save/ner_model')
最后附上一段供测试的代码
load_model = BLSTMCRFModel.load_model('../model_save/ner_model')
print(load_model.predict("刘若英语怎么样"))
从训练数据的标注至NER模型训练流程就算简单走一遍了。
至此,NER本节就算完结了~逃