
目标
本文基于预训练模型bert分词器BertTokenizer,将输入的文本以文本对的形式,送入到分词器中得到文本对的词嵌入向量,之后经过若干网络层,输出在已知2类别匹配或不匹配的概率分布,从而实现一个简单的句子对级别的匹配任务。
数据准备
预训练模型bert-base-chinese预训练模型
类别标签文件schema.json
{
"停机保号": 0,
"密码重置": 1,
"宽泛业务问题": 2,
"亲情号码设置与修改": 3,
"固话密码修改": 4,
"来电显示开通": 5,
"亲情号码查询": 6,
"密码修改": 7,
"无线套餐变更": 8,
"月返费查询": 9,
"移动密码修改": 10,
"固定宽带服务密码修改": 11,
"UIM反查手机号": 12,
"有限宽带障碍报修": 13,
"畅聊套餐变更": 14,
"呼叫转移设置": 15,
"短信套餐取消": 16,
"套餐余量查询": 17,
"紧急停机": 18,
"VIP密码修改": 19,
"移动密码重置": 20,
"彩信套餐变更": 21,
"积分查询": 22,
"话费查询": 23,
"短信套餐开通立即生效": 24,
"固话密码重置": 25,
"解挂失": 26,
"挂失": 27,
"无线宽带密码修改": 28
}
训练集数据train.json训练集数据
验证集数据valid.json验证集数据
参数配置
config.py
# -*- coding: utf-8 -*-
"""
配置参数信息
"""
Config = {
"model_path": "model_output",
"schema_path": "../data/schema.json",
"train_data_path": "../data/train.json",
"valid_data_path": "../data/valid.json",
"pretrain_model_path":r"../../../bert-base-chinese",
"vocab_path":r"../../../bert-base-chinese/vocab.txt",
"max_length": 20,
"hidden_size": 256,
"epoch": 10,
"batch_size": 128,
"epoch_data_size": 10000, #每轮训练中采样数量
"positive_sample_rate":0.5, #正样本比例
"optimizer": "adam",
"learning_rate": 1e-3,
}
数据处理
loader.py
# -*- coding: utf-8 -*-
import json
import re
import os
import torch
import random
import logging
from torch.utils.data import Dataset, DataLoader
from collections import defaultdict
from transformers import BertTokenizer
"""
数据加载
"""
logging.getLogger("transformers").setLevel(logging.ERROR)
class DataGenerator:
def __init__(self, data_path, config):
self.config = config
self.path = data_path
self.tokenizer = load_vocab(config["vocab_path"])
self.config["vocab_size"] = len(self.tokenizer.vocab)
self.schema = load_schema(config["schema_path"])
self.train_data_size = config["epoch_data_size"] #由于采取随机采样,所以需要设定一个采样数量,否则可以一直采
self.max_length = config["max_length"]
self.data_type = None #用来标识加载的是训练集还是测试集 "train" or "test"
self.load()
def load(self):
self.data = []
self.knwb = defaultdict(list)
with open(self.path, encoding="utf8") as f:
for line in f

最低0.47元/天 解锁文章
901

被折叠的 条评论
为什么被折叠?



