目录
赛题描述及数据说明
大赛地址:https://tianchi.aliyun.com/s/3bd272d942f97725286a8e44f40f3f74
代码实践
Step 1:环境准备
import pandas as pd
import codecs, gc
import numpy as np
from sklearn.model_selection import KFold
from keras_bert import load_trained_model_from_checkpoint, Tokenizer
from keras.metrics import top_k_categorical_accuracy
from keras.layers import *
from keras.callbacks import *
from keras.models import Model
import keras.backend as K
from keras.optimizers import Adam
from keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
google colab上运行代码,需要先将数据上传至driver上。执行以下代码挂在driver并配置相关环境.
from google.colab import drive
drive.mount('/content/drive')
'''
路径说明:
../code #保存代码
../data #保存数据
../subs #保存数据
../chinese_roberta_wwm_large_ext_L-24_H-1024_A-16 #bert路径
'''
pip install keras-bert
Step 2:数据读取
# 将ocnli中content1[0:maxlentext1]+content2作为ocnli任务的content
times_train = pd.read_csv('/data/TNEWS_train1128.csv', sep='\t', header=None, names=('id', 'content', 'label')).astype(str)
ocemo_train = pd.read_csv('/data/OCEMOTION_train1128.csv',sep='\t', header=None, names=('id', 'content', 'label')).astype(str)
ocnli_train = pd.read_csv('/data/OCNLI_train1128.csv', sep='\t', header=None, names=('id', 'content1', 'content2', 'label')).astype(str)
ocnli_train['content'] = ocnli_train['content1'] + ocnli_train['content2'] # .apply( lambda x: x[:maxlentext1] )
times_testa = pd.read_csv('/data/TNEWS_a.csv', sep='\t', header=None, names=('id', 'content')).astype(str)
ocemo_testa = pd.read_csv('/data/OCEMOTION_a.csv',sep='\t', header=None, names=('id', 'content')).astype(str)
ocnli_testa = pd.read_csv('/data/OCNLI_a.csv', sep='\t', header=None, names=('id', 'content1', 'content2')).astype(str)
ocnli_testa['content'] = ocnli_testa['content1'] + ocnli_testa['content2'] # .apply( lambda x: x[:maxlentext1] )
1) 数据集合并
分别将三个任务的content、label列按行concat在一起作为训练集和标签、测试集,以此简单地将三任务转化为单任务。
# 合并三个任务的训练、测试数据
train_df = pd.concat([times_train, ocemo_train, ocnli_train[['id','content', 'label']]], axis=0).copy()
testa_df = pd.concat([times_testa, ocemo_testa, ocnli_testa[['id', 'content']]], axis=0).copy()
2)标签编码
# LabelEncoder处理标签,因为bert输入的label需要从0开始
# LabelEncoder(): Encode labels with value between 0 and n_classes-1.
encode_label = LabelEncoder()
train_df['label'] = encode_label.fit_transform(train_df['label'].apply(str))
3) 数据信息查看
train_df.info()
'''
<class 'pandas.core.frame.DataFrame'>
Int64Index: 147453 entries, 0 to 48777
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 147453 non-null object
1 content 147453 non-null object
2 la