基于lstm的情感分类

本文介绍了基于LSTM的文本分类实现,包括数据加载、处理、模型构建和运行过程。通过简单的模型结构(一层LSTM和一层线性层),在IMDb数据集上实现了约74.56%的准确性。

基于lstm的文本分类

简介

本文是有关于lstm的文本分类的实现,并不会介绍很多相关理论内容。应为自身也是新手,所以代码中可能会有一些不怎么合理的部分。

相关包和超参

以下是我实现这个程序所使用的相关包

import pandas as pd
import re
import torch.nn as nn
from torch import optim
import torch.utils.data.dataset as Dataset
import numpy as np
import torch.utils.data.dataloader as DataLoader
import torch as t
import torch.nn.functional as F
from sklearn.utils import shuffle

以下是本程序中所使用的超参

max_len = 65 #句子最大长度
embedding_size = 64 #即单词对应的embedding的长度
#下面几个是lstm的一些参数
input_dim = embedding_size
hidden_dim = 128
num_layer = 1
num_class = 2
lr = 0.1 #学习率
epoch = 30
batch_size = 20

数据加载和处理

本程序中所使用的文本格式如下

text label

其中text为imdb影评,label为0或1,0代表消极,1代表积极。
先加载相应数据,并打乱

train_data = pd.read_csv('imdb/Train.csv')
test_data = pd.read_csv('imdb/Test.csv')
train_data = shuffle(train_data)
test_data = shuffle(test_data)

原text例子,train_data[0]如下

I grew up (b. 1965) watching and loving the Thunderbirds. All my mates at school watched. We played “Thunderbirds” before school, during lunch and after school. We all wanted to be Virgil or Scott. No one wanted to be Alan. Counting down from 5 became an art form. I took my children to see the movie hoping they would get a glimpse of what I loved as a child. How bitterly disappointing. The only high point was the snappy theme tune. Not that it could compare with the original score of the Thunderbirds. Thankfully early Saturday mornings one television channel still plays reruns of the series Gerry Anderson and his wife created. Jonatha Frakes should hand in his directors chair, his version was completely hopeless. A waste of film. Utter rubbish. A CGI remake may be acceptable but replacing marionettes with Homo sapiens subsp. sapiens was a huge error of judgment.

将text中所有英文字符串看作word,变为word list,然后用train中所有word建立词典word2Idx,key为word,value为值,其中初始{None:0}表示未知单词。然后用此字典将train和test中text转化为int数组,并保证长度都为max_len

def clean_text(text):
    # 清洗句子,使之成为word list
    text=re.sub('[^a-zA-Z]',' ',text)
    words=text.lower().split()
    return words

def processData(x_train, X_test):

    # 获取所有词以及句子数值化
    word2Idx = dict()
    word2Idx['None'] = 0 # 未知词
    train = []
    test = []
    # 此处将train中单词表示成字典形式 key为单词
    for sentence in x_train:
        sentence=re.sub('[^a-zA-Z]',' ',sentence)
        words=sentence.lower().split()
        for word in words:
            if word not in word2Idx.keys():
                word2Idx[word] = len(word2Idx)


    for i in range(len(x_train)
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值