目录
-
文本分类简介
文本分类是指在给定分类体系,根据文本内容自动确定文本类别的过程。最基础的分类是归到两个类别中,称为二分类问题,例如电影评论分类,只需要分为“好评”或“差评”。分到多个类别中的称为多分类问题,例如,把名字分类为法语名字、英语名字、西班牙语名字等。
一般来说文本分类大致分为如下几个步骤:
- 定义阶段:定义数据以及分类体系,具体分为哪些类别,需要哪些数据。
- 数据预处理:对文档做分词、去停用词等准备工作。
- 数据提取特征:对文档矩阵进行降维,提取训练集中最有用的特征。
- 模型训练阶段:选择具体的分类模型以及算法,训练出文本分类器。
- 评测阶段:在测试集上测试并评价分类器的性能。
- 应用阶段:应用性能最高的分类模型对待分类文档进行分类。
-
数据集介绍
Large Movie Review Dataset数据集(aclimdb)由斯坦福大学人工智能实验室于2011年推出,包含25000条训练数据和25000条测试数据,另外包含约50000条没有标签的辅助数据。训练集和测试集又分别包含12500条正例(正向评价pos)和12500负例(负向评价neg)。
aclimdb的目录结构:
训练集正例的目录:
这个里面包含了12500篇英文评论,打开第一个评论看一下里面的文本内容:
Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!
-
数据预处理
首先载入数据,得到训练集数据、训练集标签、测试集数据、测试集标签,其中训练集标签和测试集标签可由正例或负例数据载入时生成全0或全1数组得到,正例标签为1,负例标签为0.
import glob
import numpy as np
def get_data(path_neg, path_pos)