1.概述
情感分类是自然语言处理中的经典任务,是典型的分类问题。本节使用MindSpore实现一个基于RNN网络的情感分类模型,实现如下的效果:
输入: This film is terrible
正确标签: Negative
预测标签: Negative
输入: This film is great
正确标签: Positive
预测标签: Positive
2.数据准备
本节使用情感分类的经典数据集IMDB影评数据集,数据集包含Positive和Negative两类,下面为其样例:
Review Label
"Quitting" may be as much about exiting a pre-ordained identity as about drug withdrawal. As a rural guy coming to Beijing, class and success must have struck this young artist face on as an appeal to separate from his roots and far surpass his peasant parents' acting success. Troubles arise, however, when the new man is too new, when it demands too big a departure from family, history, nature, and personal identity. The ensuing splits, and confusion between the imaginary and the real and the dissonance between the ordinary and the heroic are the stuff of a gut check on the one hand or a complete escape from self on the other.
Negative
This movie is amazing because the fact that the real people portray themselves and their real life experience and do such a good job it's like they're almost living the past over again. Jia Hongsheng plays himself an actor who quit everything except music and drugs struggling with depression and searching for the meaning of life while being angry at everyone especially the people who care for him most.
Positive
使用预训练词向量对自然语言单词进行编码,以获取文本的语义特征。选取Glove词向量作为Embedding。
1).数据下载模块
数据下载模块,实现可视化下载流程,并保存数据集与预训练词向量至指定路径。
数据下载模块使用requests库进行http请求,并通过tqdm库对下载百分比进行可视化。
使用IO的方式下载临时文件,而后保存至指定的路径并返回,保证下载安全性。
tqdm和requests库需手动安装,命令如下:pip install tqdm requests
2)再使用下边URL(华为云镜像,也可使用其他镜像)下载IMDB数据集并保存到指定路径(文件路径+文件名):
3)加载IMDB数据集
下载好的IMDB数据集为tar.gz文件,我们使用Python的tarfile库对其进行读取,并将所有数据和标签分别进行存放。原始的IMDB数据集解压目录如下:
├── aclImdb
│ ├── imdbEr.txt
│ ├── imdb.vocab
│ ├── README
│