本文是BERT实战的第二篇,使用BERT进行命名实体识别(序列标注类任务)。
1. 准备
1.1 环境
python 3.7;pytorch 1.3;transformers 2.3(安装教程);
1.2 数据
- 数据链接(链接:https://pan.baidu.com/s/1spwmV3_07U0HA9mlde2wMg
提取码:reic);
2. 实战
2.1 训练代码
lr = 5e-5
max_length = 256
batch_size = 8
epoches = 20
cuda = True
# cuda = False
max_grad_norm = 1
warmup_steps = 3000
train_steps = 60000
train_dataset_file_path = './data/names/train.json'
eval_dataset_file_path = './data/names/text.json'
tokenizer = BertTokenizer('./bert_model/vocab.txt')
with open('./data/names/label.json', mode='r', encoding='utf8') as f:
id2label, label2id = json.load(f)
# 得到attention mask
def get_atten_mask(tokens_ids, pad_index=0):
return list(map(lambda x: 1 if x != pad_index else 0, tokens_ids))
class NerDataSet(Dataset):
def __init__(self, file_path):
token_ids = []
token_attn_mask = []
token_seg_type = []
labels = []
with open(file_path, mode='r', encoding='utf8') as f:
data_set = json.load(f)
data_set = data_set[:5]
for data in data_set:
text = data['text']
tmp_token_ids = tokenizer.encode(text, max_length=max_length, pad_to_max_length=True)
if len(text) < max_length - 2:
tmp_labels = [label2id['O']] + [label2id[item] for item in data['labels']] + [label2id['O']] * (
max_length - len(data['labels']) - 1)
else:
tmp_labels = [label2id['O']] + [label2id[item] for item in data['labels']][:max_length - 2] + [
label2id['O']]
tmp_token_attn_mask = get_atten_mask(tmp_token_ids)
tmp_seg_type = tokenizer.create_token_type_ids_from_sequences(tmp_token_ids[1:-1])
token_ids.append(tmp_token_ids)
token_attn_mask.append(tmp_token_attn_mask)
token_seg_type.append(tmp_seg_type)
labels.append(tmp_labels)
self.TOKEN_IDS = torch.from_numpy(np.array(token_ids)).long()
self.TOKEN_ATTN_MASK = torch.from_numpy

本文详细介绍使用BERT模型进行命名实体识别的过程,包括环境配置、数据准备、模型训练及评估等关键步骤,最终在验证集上实现了高精度的实体识别效果。
最低0.47元/天 解锁文章
3615

被折叠的 条评论
为什么被折叠?



