IMDB Dataset of 50K Movie Reviews
1、从文件中读取数据
data_path = '/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv'
#从文件中读取数据
import pandas as pd
imdb_data=pd.read_csv(data_path)
#输出数据的shape以及前10个数据
print(imdb_data.shape)
imdb_data.head(10)
需要注意的是imdb_data的数据类型是DataFrame的
print(imdb_data.keys())
#sentiment count
print(imdb_data['sentiment'].value_counts())
2、导入IMDB数据集的内容和标签
3、对IMDB原始数据的文本进行分词
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
maxlen = 100 #只读取评论的前100个单词
max_words = 10000 #只考虑数据集中最常见的前10000个单词
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts) #将texts文本转换成整数序列
word_index = tokenizer.word_index #单词和数字的字典
print('Found %s unique tokens' % len(word_index))
data = pad_sequences(sequences, maxlen=maxlen) #将data填充为一个(sequences, maxlen)的二维矩阵
data = np.asarray(data).astype('int64')
labels = np.asarray(labels).astype('float32')
print(data.shape, labels.shape)
划分训练集和测试集
train_reviews = data[: 40000]
train_sentiments = labels[:40000]
test_reviews = data[40000:]
test_sentiments = labels[40000:]
#show train datasets and test datasets shape
print(train_reviews.shape,train_sentiments.shape)
print(test_reviews.shape,test_sentiments.shape)
4、对嵌入进行预处理
glove.6B.100d.txt中有40k个单词的100维的嵌入向量,每一行的第一列是对应的单词,之后的100列是这个单词对应的嵌入向量
构建一个embedding_index的字典,键为单词,值为这个单词对应的嵌入向量
import os
glove_dir = '/kaggle/input/glove6b'
embedding_index = {
}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype = 'float32')
embedding_index[word] = coefs
f.close()
print(