基于lstm的文本分类
简介
本文是有关于lstm的文本分类的实现,并不会介绍很多相关理论内容。应为自身也是新手,所以代码中可能会有一些不怎么合理的部分。
相关包和超参
以下是我实现这个程序所使用的相关包
import pandas as pd
import re
import torch.nn as nn
from torch import optim
import torch.utils.data.dataset as Dataset
import numpy as np
import torch.utils.data.dataloader as DataLoader
import torch as t
import torch.nn.functional as F
from sklearn.utils import shuffle
以下是本程序中所使用的超参
max_len = 65 #句子最大长度
embedding_size = 64 #即单词对应的embedding的长度
#下面几个是lstm的一些参数
input_dim = embedding_size
hidden_dim = 128
num_layer = 1
num_class = 2
lr = 0.1 #学习率
epoch = 30
batch_size = 20
数据加载和处理
本程序中所使用的文本格式如下
| text | label |
|---|
其中text为imdb影评,label为0或1,0代表消极,1代表积极。
先加载相应数据,并打乱
train_data = pd.read_csv('imdb/Train.csv')
test_data = pd.read_csv('imdb/Test.csv')
train_data = shuffle(train_data)
test_data = shuffle(test_data)
原text例子,train_data[0]如下
I grew up (b. 1965) watching and loving the Thunderbirds. All my mates at school watched. We played “Thunderbirds” before school, during lunch and after school. We all wanted to be Virgil or Scott. No one wanted to be Alan. Counting down from 5 became an art form. I took my children to see the movie hoping they would get a glimpse of what I loved as a child. How bitterly disappointing. The only high point was the snappy theme tune. Not that it could compare with the original score of the Thunderbirds. Thankfully early Saturday mornings one television channel still plays reruns of the series Gerry Anderson and his wife created. Jonatha Frakes should hand in his directors chair, his version was completely hopeless. A waste of film. Utter rubbish. A CGI remake may be acceptable but replacing marionettes with Homo sapiens subsp. sapiens was a huge error of judgment.
将text中所有英文字符串看作word,变为word list,然后用train中所有word建立词典word2Idx,key为word,value为值,其中初始{None:0}表示未知单词。然后用此字典将train和test中text转化为int数组,并保证长度都为max_len
def clean_text(text):
# 清洗句子,使之成为word list
text=re.sub('[^a-zA-Z]',' ',text)
words=text.lower().split()
return words
def processData(x_train, X_test):
# 获取所有词以及句子数值化
word2Idx = dict()
word2Idx['None'] = 0 # 未知词
train = []
test = []
# 此处将train中单词表示成字典形式 key为单词
for sentence in x_train:
sentence=re.sub('[^a-zA-Z]',' ',sentence)
words=sentence.lower().split()
for word in words:
if word not in word2Idx.keys():
word2Idx[word] = len(word2Idx)
for i in range(len(x_train)

本文介绍了基于LSTM的文本分类实现,包括数据加载、处理、模型构建和运行过程。通过简单的模型结构(一层LSTM和一层线性层),在IMDb数据集上实现了约74.56%的准确性。
最低0.47元/天 解锁文章
1万+

被折叠的 条评论
为什么被折叠?



