【BERT系列】——命名实体识别

最新推荐文章于 2025-02-27 19:54:24 发布

原创

最新推荐文章于 2025-02-27 19:54:24 发布 · 2k 阅读

10 ·

CC 4.0 BY-SA版权

文章标签：

#自然语言处理 #pytorch #深度学习

本文详细介绍使用BERT模型进行命名实体识别的过程，包括环境配置、数据准备、模型训练及评估等关键步骤，最终在验证集上实现了高精度的实体识别效果。

本文是BERT实战的第二篇，使用BERT进行命名实体识别（序列标注类任务）。

1. 准备

1.1 环境

python 3.7；
pytorch 1.3；
transformers 2.3 （安装教程）；

1.2 数据

数据链接（链接：https://pan.baidu.com/s/1spwmV3_07U0HA9mlde2wMg
提取码：reic）；

2. 实战

2.1 训练代码


lr = 5e-5
max_length = 256
batch_size = 8
epoches = 20
cuda = True
# cuda = False
max_grad_norm = 1
warmup_steps = 3000
train_steps = 60000
train_dataset_file_path = './data/names/train.json'
eval_dataset_file_path = './data/names/text.json'

tokenizer = BertTokenizer('./bert_model/vocab.txt')

with open('./data/names/label.json', mode='r', encoding='utf8') as f:
    id2label, label2id = json.load(f)


# 得到attention mask
def get_atten_mask(tokens_ids, pad_index=0):
    return list(map(lambda x: 1 if x != pad_index else 0, tokens_ids))


class NerDataSet(Dataset):

    def __init__(self, file_path):
        token_ids = []
        token_attn_mask = []
        token_seg_type = []
        labels = []

        with open(file_path, mode='r', encoding='utf8') as f:
            data_set = json.load(f)
            data_set = data_set[:5]

        for data in data_set:
            text = data['text']
            tmp_token_ids = tokenizer.encode(text, max_length=max_length, pad_to_max_length=True)
            if len(text) < max_length - 2:
                tmp_labels = [label2id['O']] + [label2id[item] for item in data['labels']] + [label2id['O']] * (
                        max_length - len(data['labels']) - 1)
            else:
                tmp_labels = [label2id['O']] + [label2id[item] for item in data['labels']][:max_length - 2] + [
                    label2id['O']]
            tmp_token_attn_mask = get_atten_mask(tmp_token_ids)
            tmp_seg_type = tokenizer.create_token_type_ids_from_sequences(tmp_token_ids[1:-1])
            token_ids.append(tmp_token_ids)
            token_attn_mask.append(tmp_token_attn_mask)
            token_seg_type.append(tmp_seg_type)
            labels.append(tmp_labels)

        self.TOKEN_IDS = torch.from_numpy(np.array(token_ids)).long()
        self.TOKEN_ATTN_MASK = torch.from_numpy

最低0.47元/天解锁文章