电影评论文本分类
官网的教程代码有一些问题:1.调用文件夹时,官网的调用方式有错。2.调用vectorize_layer 没有返回,3.模型编译时 ,metics写错了。4.最后新的数据要转为张量才能用来预测。
这笔记里代码里都改过来了
文章目录
导入库
re library 引入新的字符串表达方式
shutil — 高级文件操作 提供一些支持文件拷贝和删除的函数
os — 多种操纵系统接口
string — 常见字符串操作
import matplotlib.pyplot as plt
import os
import re
import shutil
import string
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import losses
os.getcwd()
'd:\\Desktop files\\Python_projects\\TensFlow_tutorial_personal'
下载IMDB数据集
浏览目录结构
这篇训练的目的是对电影评论进行正向或者负向分类 二元分类
tf.keras.utils.get_file()
Downloads a file from a URL if it not already in the cache.
tf.keras.utils.get_file(
fname=None,
origin=None,
untar=False,
md5_hash=None,
file_hash=None,
cache_subdir=‘datasets’,
hash_algorithm=‘auto’,
extract=False,
archive_format=‘auto’,
cache_dir=None,
force_download=False
)
os.path.join() 连接文件,并用’\“分隔开各个文件
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
dataset = tf.keras.utils.get_file("aclImdb_v1", url, untar=True, cache_dir='.', cache_subdir='')
Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
[1m84125825/84125825[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 0us/step
dataset_dir = os.path.join('.\\aclImdb_v1', 'aclImdb')
os.listdir(dataset_dir)
文件夹格式.//aclImdb_v1//aclImdb或者 .\aclImdb_v1\aclImdb 都可以,官网给的教程找不到文件夹,自己找到文件夹改就行
train_dir = os.path.join(dataset_dir,'train')
os.listdir(trian_dir)
# print(trian_dir)
['labeledBow.feat',
'neg',
'pos',
'unsup',
'unsupBow.feat',
'urls_neg.txt',
'urls_pos.txt',
'urls_unsup.txt']
aclImdb/train/pos和aclImdb/train.neg目录包含许多文本文件,每个文件都是一条电影评论
sample_file = os.path.join(train_dir,'pos/1181_9.txt')
with open(sample_file) as f:
print(f.read())
Rachel Griffiths writes and directs this award winning short film. A heartwarming story about coping with grief and cherishing the memory of those we've loved and lost. Although, only 15 minutes long, Griffiths manages to capture so much emotion and truth onto film in the short space of time. Bud Tingwell gives a touching performance as Will, a widower struggling to cope with his wife's death. Will is confronted by the harsh reality of loneliness and helplessness as he proceeds to take care of Ruth's pet cow, Tulip. The film displays the grief and responsibility one feels for those they have loved and lost. Good cinematography, great direction, and superbly acted. It will bring tears to all those who have lost a loved one, and survived.
加载数据集
加载数据并将其准备为适合训练的格式,使用tf.keras.preprocessing.text_dataset_from_directory 构建数据目录为以下格式
main_directory/
...class_a/
......a_text_1.txt
......a_text_2.txt
...class_b/
......b_text_1.txt
......b_text_2.txt
移除其他的文件夹
shutil. rmtree() allows you to delete a directory and all of its files and subdirectories.
remove_dir = os.path.join(train_dir,'unsup')
shutil.rmtree(remove_dir)
os.listdir(train_dir)
['labeledBow.feat',
'neg',
'pos',
'unsupBow.feat',
'urls_neg.txt',
'urls_pos.txt',
'urls_unsup.txt']
将数据集分成三份:训练,测试和验证
已有测试集情况下,将训练集拆分为训练集和验证集
使用text_dataset_from_directory 工具进行拆分
batch_size = 32
seed = 42
raw_train_ds = tf.keras.utils.text_dataset_from_directory(
'aclImdb_v1/aclImdb/train',
batch_size=batch_size,
validation_split=0.2,
subset='training',
seed=seed
)
Found 25000 files belonging to 2 classes.
Using 20000 files for training.
for text_batch,label_batch in raw_train_ds.take(1):
for i in range(5):
print("Review", text_batch[i])
print(" ", label_batch[i])
Review tf.Tensor(b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs. "Airplane", "The Naked Gun" trilogy, "Blazing Saddles", "High Anxiety", and "Spaceballs" are some of my favorite comedies that spoof a particular genre. "Pandemonium" is not up there with those films. Most of the scenes in this movie had me sitting there in stunned silence because the movie wasn\'t all that funny. There are a few laughs in the film, but when you watch