基于lstm的情感分类

最新推荐文章于 2023-05-02 15:35:15 发布

原创

最新推荐文章于 2023-05-02 15:35:15 发布 · 1k 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#nlp #pytorch #python

本文介绍了基于LSTM的文本分类实现，包括数据加载、处理、模型构建和运行过程。通过简单的模型结构（一层LSTM和一层线性层），在IMDb数据集上实现了约74.56%的准确性。

文章目录

基于lstm的文本分类

基于lstm的文本分类

简介

本文是有关于lstm的文本分类的实现，并不会介绍很多相关理论内容。应为自身也是新手，所以代码中可能会有一些不怎么合理的部分。

数据加载和处理

本程序中所使用的文本格式如下

text	label

其中text为imdb影评，label为0或1，0代表消极，1代表积极。
先加载相应数据，并打乱

train_data = pd.read_csv('imdb/Train.csv')
test_data = pd.read_csv('imdb/Test.csv')
train_data = shuffle(train_data)
test_data = shuffle(test_data)

原text例子,train_data[0]如下

I grew up (b. 1965) watching and loving the Thunderbirds. All my mates at school watched. We played “Thunderbirds” before school, during lunch and after school. We all wanted to be Virgil or Scott. No one wanted to be Alan. Counting down from 5 became an art form. I took my children to see the movie hoping they would get a glimpse of what I loved as a child. How bitterly disappointing. The only high point was the snappy theme tune. Not that it could compare with the original score of the Thunderbirds. Thankfully early Saturday mornings one television channel still plays reruns of the series Gerry Anderson and his wife created. Jonatha Frakes should hand in his directors chair, his version was completely hopeless. A waste of film. Utter rubbish. A CGI remake may be acceptable but replacing marionettes with Homo sapiens subsp. sapiens was a huge error of judgment.

将text中所有英文字符串看作word，变为word list，然后用train中所有word建立词典word2Idx，key为word，value为值，其中初始{None:0}表示未知单词。然后用此字典将train和test中text转化为int数组，并保证长度都为max_len

def clean_text(text):
    # 清洗句子，使之成为word list
    text=re.sub('[^a-zA-Z]',' ',text)
    words=text.lower().split()
    return words

def processData(x_train, X_test):

    # 获取所有词以及句子数值化
    word2Idx = dict()
    word2Idx['None'] = 0 # 未知词
    train = []
    test = []
    # 此处将train中单词表示成字典形式 key为单词
    for sentence in x_train:
        sentence=re.sub('[^a-zA-Z]',' ',sentence)
        words=sentence.lower().split()
        for word in words:
            if word not in word2Idx.keys():
                word2Idx[word] = len(word2Idx)


    for i in range(len(x_train)