涉及python pandas(pd)的知识点:
1、读取输入文件, 并转化为pd dataframe
# 读取输入文件,并根据分割符划分字段, 并指定字段名
import pandas as pd
data = pd.read_csv('train.csv', sep = "\t", names=['label', 'msg'])
# 查看输入的数据
print(data.shape)
print(data.head(10))
2、将训练样本打散
#对数据进行随机打乱
data = data.sample(frac=1, random_state=42)
3、对字段进行转化
data['msg'] = data['msg'].apply(lambda x: ' '.join(x))
4、查看样本label的分布
print(data['label'].value_counts())
5、将样本根据比例切分成训练集和测试集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = \
train_test_split(data['msg'],
data['label'],