1.NNLM
A Neural Probabilistic Language Model
是自然语言中比较古老重要的模型。
论文参考:NNLM
2. 代码
# %%
# code by Tae Hwan Jung @graykode
import torch
import torch.nn as nn
import torch.optim as optim
def make_batch():
input_batch = []
target_batch = []
for sen in sentences:
word = sen.split() # space tokenizer
input = [word_dict[n] for n in word[:-1]] # create (1~n-1) as input
target = word_dict[word[-1]] # create (n) as target, We usually call this 'casual language model'
input_batch.append(input)
target_batch.append(target)
return input_batch, target_batch
# Model
class NNLM(nn.Module):
def __init__(self):
super(NNLM, self).__init__()
self.C = nn.Embedding(n_class, m)
self.H = nn.Linear(n_step * m, n_hidden, bias=False)
self.d = nn.Parameter(torch.ones(n_hidden))
self.U = nn.Linear(n_hidden, n_class, bias=False)
self.W = nn.Linear(n_step * m, n_class, bias=False)
self.b = nn.Parameter(torch.ones(n_class))
def forward(self, X):
X = self.C(X) # X : [batch_size, n_step, m]
X = X.view(-1, n_step * m) # [batch_size, n_step * m]
tanh = torch.tanh(self.d + self.H(X)) # [batch_size, n_hidden]
output = self.b + self.W(X) + self.U(tanh) # [batch_size, n_class]
return output
if __name__ == '__main__':
n_step = 2 # number of steps, n-1 in paper
n_hidden = 2 # number of hidden size, h in paper
m = 2 # embedding size, m in paper
sentences = ["i like dog", "i love coffee", "i hate milk"]
word_list = " ".join(sentences).split()
word_list = list(set(word_list))
word_dict = {w: i for i, w in enumerate(word_list)}
number_dict = {i: w for i, w in enumerate(word_list)}
n_class = len(word_dict) # number of Vocabulary
model = NNLM()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
input_batch, target_batch = make_batch()
input_batch = torch.LongTensor(input_batch)
target_batch = torch.LongTensor(target_batch)
# Training
for epoch in range(5000):
optimizer.zero_grad()
output = model(input_batch)
# output : [batch_size, n_class], target_batch : [batch_size]
loss = criterion(output, target_batch)
if (epoch + 1) % 1000 == 0:
print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))
loss.backward()
optimizer.step()
# Predict
predict = model(input_batch).data.max(1, keepdim=True)[1]
# Test
print([sen.split()[:2] for sen in sentences], '->', [number_dict[n.item()] for n in predict.squeeze()])
其中:
sentences用了三句话,根据word_list数组['love', 'milk', 'dog', 'hate', 'i', 'coffee', 'like']处理得到字典:
word_dict {'love': 0, 'milk': 1, 'dog': 2, 'hate': 3, 'i': 4, 'coffee': 5, 'like': 6};
number_dict {0: 'love', 1: 'milk', 2: 'dog', 3: 'hate', 4: 'i', 5: 'coffee', 6: 'like'}
n_class是7
结下来处理输入数据。
input_batch是[[4, 6], [4, 0], [4, 3]]
target_batch是[2, 5, 1]
上面的数字都是索引,根据number去找对应的词。4,6就是i like
训练过程中output = model(input_batch) 通过输入传入模型得到结果。
会调用class NNLM中的forward函数。
参数就是input_bach。
处理输入用到了nn.Embedding(n_class, m)
这个又是什么?涉及到nlp的词向量都会用到。通俗讲解一下:
Embedding会把词表中的词转换成矩阵。例如将包含3个元素的词汇表W={'优', '良', '差'}中的每个元素转换为5维向量
import torch.nn as nn
# 三个词转成五纬向量
embed = nn.Embedding(num_embeddings=3, embedding_dim=5)
# 取第一个
print(embed(torch.tensor([0],dtype=torch.int64)))
tensor([[ 0.2817, -0.8119, -0.8011, -0.1379, -0.5013]],
grad_fn=<EmbeddingBackward0>)
# 取第二个
print(embed(torch.tensor([1],dtype=torch.int64)))
tensor([[-1.3766, 1.1211, -1.0014, 0.9673, -0.4131]],
grad_fn=<EmbeddingBackward0>)
# 取第三个
print(embed(torch.tensor([2],dtype=torch.int64)))
tensor([[ 0.3854, -0.7950, -0.1810, 0.6227, -1.6562]],
grad_fn=<EmbeddingBackward0>)
初始化的值是随机的,但会随着训练的迭代变化更新,例如在NLP任务中,随着网络的训练,表示'优'与'良'的两个向量相似度会逐渐减小,而表示'优'与'差'的两个向量相似度会逐渐增大。
回到nn.Embedding(n_class, m)中 类型是2,m纬度也是2,因此会创建2*2的矩阵。
print一下self.C
tensor([[[-0.3372, 0.5522],
[-0.0847, 0.5518]],
[[-0.3372, 0.5522],
[-0.4868, -0.8433]],
[[-0.3372, 0.5522],
[ 1.0267, -0.8303]]], grad_fn=<EmbeddingBackward0>)
改变了原来的输入[[4, 6], [4, 0], [4, 3]] 但是shape最里层的[4,6]中的4改成了[-0.3372, 0.5522],以此类推。
X = X.view(-1, n_step * m)把里面的二维拉平。
tensor([[-0.3372, 0.5522, -0.0847, 0.5518],
[-0.3372, 0.5522, -0.4868, -0.8433],
[-0.3372, 0.5522, 1.0267, -0.8303]], grad_fn=<ViewBackward0>)
下面self.d层
nn.Parameter(torch.ones(n_hidden))其中n_hidden是2 输出是
tensor([1., 1.], requires_grad=True)
self.H层
nn.Linear(n_step * m, n_hidden, bias=False) 把维度从n_step * m 变成了 n_hidden,也就是从4变到2。
tensor([[-0.0315, 0.0569],
[-0.5814, -0.4416],
[-0.3996, 0.0209]], grad_fn=<MmBackward0>)
也就是把里面第一行的[-0.3372, 0.5522, -0.0847, 0.5518]一顿操作计算后变成了[-0.0315, 0.0569]
self.d + self.H(X)就是把每一个数加上1
tensor([[0.9685, 1.0569],
[0.4186, 0.5584],
[0.6004, 1.0209]], grad_fn=<AddBackward0>)
torch.tanh激活函数处理后得到
tensor([[0.7480, 0.7845],
[0.3957, 0.5068],
[0.5373, 0.7702]], grad_fn=<TanhBackward0>)
最后的output是self.b + self.W(X) + self.U(tanh)
其中self.W(X) 把n_step * m转换到n_class维度
self.U(tanh)把n_hidden转换到n_class维度
self.b也是n_class维度。三者相加。
其中三个分别是
tensor([1., 1., 1., 1., 1., 1., 1.], requires_grad=True)
tensor([[-0.2555, 0.0985, -0.1871, 0.1405, 0.3175, 0.0423, -0.0021],
[ 0.1171, -0.0690, -0.4293, -0.1167, -0.0510, -0.3074, -0.1122],
[-0.2951, -0.3981, -0.9634, 0.1058, 0.3930, -0.1794, -0.3963]],
grad_fn=<MmBackward0>)
tensor([[ 0.0344, 0.6908, 0.4222, -0.2680, 0.2775, -0.1408, 0.9677],
[ 0.0444, 0.4080, 0.2628, -0.1935, 0.1456, -0.0672, 0.5648],
[ 0.0838, 0.5921, 0.3921, -0.3089, 0.1966, -0.0847, 0.8142]],
grad_fn=<MmBackward0>)
相加后得到
tensor([[0.7789, 1.7894, 1.2352, 0.8725, 1.5950, 0.9016, 1.9656],
[1.1616, 1.3390, 0.8335, 0.6898, 1.0946, 0.6254, 1.4526],
[0.7887, 1.1939, 0.4287, 0.7968, 1.5896, 0.7358, 1.4179]],
grad_fn=<AddBackward0>)
然后根据输出和目标值tensor([3, 2, 5])计算loss
loss = criterion(output, target_batch)
loss值是tensor(2.3135, grad_fn=<NllLossBackward0>)
根据loss值算梯度迭代更新,让loss越来越小,就是框架的事情了。
最终预测训练的数据,传入input_batch得到predict输出
tensor([[2],
[4],
[1]])
对应到number_dict {0: 'hate', 1: 'milk', 2: 'dog', 3: 'i', 4: 'coffee', 5: 'like', 6: 'love'}中查找。
最终输出
Epoch: 1000 cost = 0.102355
Epoch: 2000 cost = 0.015673
Epoch: 3000 cost = 0.005281
Epoch: 4000 cost = 0.002299
Epoch: 5000 cost = 0.001127
tensor([[2],
[4],
[1]])
[['i', 'like'], ['i', 'love'], ['i', 'hate']] -> ['dog', 'coffee', 'milk']
本文详细介绍了如何使用PyTorch构建和训练一个基于神经网络的概率语言模型(NNLM),包括词嵌入、模型结构和训练过程,展示了如何通过实例预测文本序列中的下一个词。

被折叠的 条评论
为什么被折叠?



