https://github.com/FudanNLP/nlp-beginner
1. 回顾
之前实现了textCNN,但是效果并不是很好,推测是随机选取train_data的锅,之前一直不知道该怎么多组一起下降,今天才理清epoch和batch的关系,可以利用pytorch中对梯度的累加机制来实现非等长数据的小批量梯度下降,所以前一篇跑出来的textCNN就当跑个乐了,之后有空再回去补。另注意跑完所有sample才算跑完一次eopch...
所以这里mini-batch采用的是
train_data_number = 156060 * 0.8 = 124848
iter_num = 10
batch_size = int(train_data_number / iter_num) = 1248
epoch = 10
update_num = epoch * iter_num = 100
更新次数就降到100次了,以前一次epoch就更新124848次,想必也是直接一头冲进哪个局部最低点了.... 难怪跑出来这么菜
2. textRNN
直接用glove的glove.6B.zip中的glove.6B.50d.txt初始化词向量了。
2.1. 代码
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import random_split
import pandas as pd
import numpy as np
import random
import copy
# 同前读取数据部分
read_data = pd.read_table('../train.tsv')
data = []
data_len = read_data.shape[0]
for i in range(data_len):
data.append([read_data['Phrase'][i].lower().split(' '), read_data['Sentiment'][i]])
word_to_ix = {} # 给每个词分配index
ix_to_word = {}
word_set = set()
for sent, _ in data:
for word in sent:
if word not in word_to_ix:
ix_to_word[len(word_to_ix)] = word
word_to_ix[word] = len(word_to_ix)
word_set.add(word)
unk = '<unk>'
ix_to_word[len(word_to_ix)] = unk
word_to_ix[unk] = len(word_to_ix)
word_set.add(unk)
torch.manual_seed(6) # 设置torch的seed,影响后面初始化参数和random_split
train_len = int(0.8 * data_len)
test_len = data_len - train_len
train_data, test_data = random_split(data, [train_len, test_len]) # 分割数据集
# print(type(train_data)) # torch.utils.data.dataset.Subset
train_data = list(train_data)
test_data = list(test_data)
# 参数字典,方便成为调参侠
args = {
'vocab_size': len(word_to_ix), # 有多少词,embedding需要以此来生成词向量
'embedding_size': 50, # 每个词向量有几维(几个特征)
'hidden_size': 16,
'type_num': 5, # 分类个数
'train_batch_size': int(train_len / 10),
# 'test_batch_size': int(test_len / 10)
}
f = open('../glove.6B.50d.txt', 'r', encoding='utf-8')
line = f.readline()
glove_word2vec = {}
pretrained_vec = []
while line:
line = line.split()
word = line[0]
if word in word_set:
glove_word2vec[word] = [float(v) for v in line[1:]]
line = f.readline()
unk_num = 0
for i in range(args['vocab_size']):
if ix_to_word[i] in glove_word2vec:
pretrained_vec.append(glove_word2vec[ix_to_word[i]])
else:
pretrained_vec.append(glove_word2vec[unk])
unk_num += 1
print(unk_num, args['vocab_size'])
pretrained_vec = np.array(pretrained_vec)
train_len = int(int(train_len / args['train_batch_size']) * args['train_batch_size']) # 去掉最后多余不满一个batch的一组,不去掉对loss影响比较大,比如