Character level language model - Dinosaurus land(字符级别的语言模型—恐龙岛)
欢迎来到Dinosaurus岛! 6500万年前,恐龙存在,在这项任务中他们又回来了。你负责一项特殊任务。领先的生物学研究人员正在创造新的恐龙品种并将它们带到地球上,而你的工作就是为这些恐龙命名。如果恐龙不喜欢它的名字,它可能会变成beserk,所以明智地选择!(choose wisely!)
幸运的是,你已经学到了一些深度学习,你将用它来挽救这一天。你的助手收集了他们可以找到的所有恐龙名字的清单,并将它们编译成这个dataset。 (可以通过单击上一个链接来查看。)要创建新的恐龙名称,您将构建一个字符级语言模型来生成新名称。您的算法将学习不同的名称模式,并随机生成新名称。希望这个算法能让你和你的团队远离恐龙的愤怒!
完成此作业后,您将学习:
- 如何存储文本数据以便使用RNN进行处理
- 如何通过在每个时间步骤采样预测并将其传递到下一个RNN-cell来合成数据
- 如何构建一个字符级文本(character-level text)生成递归神经网络
- 为什么削减梯度(clipping the gradients)很重要
我们将首先加载我们在rnn_utils
中为您提供的一些函数。具体来说,您可以访问诸如rnn_forward
和rnn_backward
之类的函数,这些函数等同于您在先前作业中实现的函数。
import numpy as np
from utils import *
import random
from random import shuffle
1 - Problem Statement(问题描述)
1.1 - Dataset and Preprocessing(数据集和预处理)
运行以下单元格以读取恐龙名称的数据集,创建唯一字符列表(unique characters)(例如a-z),并计算数据集和词汇表大小。
data = open('dinos.txt', 'r').read()
data= data.lower()
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print('There are %d total characters and %d unique characters in your data.' % (data_size, vocab_size))
There are 19909 total characters and 27 unique characters in your data.
The characters are a-z (26 characters) plus the “\n” (or newline character), which in this assignment plays a role similar to the <EOS>
(or “End of sentence”) token we had discussed in lecture, only here it indicates the end of the dinosaur name rather than the end of a sentence. In the cell below, we create a python dictionary (i.e., a hash table) to map each character to an index from 0-26. We also create a second python dictionary that maps each index back to the corresponding character character. This will help you figure out what index corresponds to what character in the probability distribution output of the softmax layer. Below, char_to_ix
and ix_to_char
are the python dictionaries.
字符是a-z(26个字符)加上“\n”(或换行符),在这个任务中扮演的角色类似于我们在课程中讨论过的“”(或“句末”)token标记,只有在这里它表示恐龙名称的结尾而不是句子的结尾。在下面的单元格中,我们创建一个python字典(即哈希表),将每个字符映射到0-26的索引。我们还创建了第二个python字典,它将每个索引映射回相应的字符字符。这将帮助您确定哪个索引对应于softmax图层的概率分布输出中的哪个字符。下面,char_to_ix
和ix_to_char
是python词典。
char_to_ix = { ch:i for i,ch in enumerate(sorted(chars)) }
ix_to_char = { i:ch for i,ch in enumerate(sorted(chars)) }
print(ix_to_char)
1.2 - Overview of the model
您的模型将具有以下结构:
- 初始化参数
- 执行优化循环
- 前向传播以计算损失函数
- 反向传播以计算相对于损失函数的梯度
- 剪切梯度以避免梯度爆炸
- 使用梯度,使用梯度下降更新规则来更新参数。
- 返回学习参数
图 1: Recurrent Neural Network, similar to what you had built in the previous notebook “Building a RNN - Step by Step”.
在每个时间步,RNN尝试预测给定前一个字符的下一个字符是什么。数据集 X = ( x ⟨ 1 ⟩ , x ⟨ 2 ⟩ , . . . , x ⟨ T x ⟩ ) X = (x^{\langle 1 \rangle}, x^{\langle 2 \rangle}, ..., x^{\langle T_x \rangle}) X=(x⟨1⟩,x⟨2⟩,...,x⟨Tx⟩) 是训练集里面的字符列表, 其中 Y = ( y ⟨ 1 ⟩ , y ⟨ 2 ⟩ , . . . , y ⟨ T x ⟩ ) Y = (y^{\langle 1 \rangle}, y^{\langle 2 \rangle}, ..., y^{\langle T_x \rangle}) Y=(y⟨1⟩,y⟨2⟩,...,y⟨Tx⟩) 是每一时间步的预测值 t t t, 并且 y ⟨ t ⟩ = x ⟨ t + 1 ⟩ y^{\langle t \rangle} = x^{\langle t+1 \rangle} y⟨t⟩=x⟨t+1⟩.
2 - Building blocks of the model(模型的构建块)
在本部分中,您将构建整个模型(overall model)的两个重要块:
- 梯度修剪: 为了避免梯度爆炸
- 采样: 用于生成字符的技术
然后,您将应用这两个函数来构建模型。
2.1 - Clipping the gradients in the optimization loop(在优化循环中修剪梯度)
在本节中,您将实现您将在优化循环内调用的clip
函数。回想一下,您的整体循环结构通常包括正向传递(forward pass),成本计算(cost computation),反向传递(backward pass)和参数更新(parameter update)。在更新参数之前,您将在需要时执行梯度俢剪(gradient clipping),以确保梯度不会“爆炸”,这意味着需要采用过大的值。(overly large values)
在下面的练习中,您将实现一个函数clip
,它接收梯度字典并在需要时返回修剪版本的梯度。修剪梯度的方法有很多种;我们将使用一个简单的逐元素修剪过程(element-wise clipping procedure),其中梯度向量的每个元素被剪切为位于某个范围[-N,N]之间。更一般地说,你将提供一个maxValue
(比如说10)。在此示例中,如果梯度向量的任何分量大于10,则将其设置为10;如果梯度向量的任何分量小于-10,则将其设置为-10。如果介于-10和10之间,则不管它。
图 2: 在网络遇到轻微的“爆炸梯度”问题的情况下,有和没有梯度修剪的梯度下降的可视化。
练习: Implement the function below to return the clipped gradients of your dictionary gradients
. Your function takes in a maximum threshold and returns the clipped versions of your gradients. 实现下面的函数以返回字典“梯度”的修剪梯度。 您的函数采用最大阈值并返回渐变的修剪版本。你可以检查hint.
### GRADED FUNCTION: clip
def clip(gradients, maxValue):
'''
Clips the gradients' values between minimum and maximum.
Arguments:
gradients -- a dictionary containing the gradients "dWaa", "dWax", "dWya", "db", "dby"
maxValue -- everything above this number is set to this number, and everything less than -maxValue is set to -maxValue
Returns:
gradients -- a dictionary with the clipped gradients.
'''
dWaa, dWax, dWya, db, dby = gradients['dWaa'], gradients['dWax'], gradients['dWya'], gradients['db'], gradients['dby']
### START CODE HERE ###
# clip to mitigate exploding gradients, loop over [dWax, dWaa, dWya, db, dby]. (≈2 lines)
for gradient in [dWax, dWaa, dWya, db, dby]:
None
### END CODE HERE ###
gradients = {"dWaa": dWaa, "dWax": dWax, "dWya": dWya, "db": db, "dby": dby}
return gradients
2.2 - Sampling(采样)
现在假设您的模型已经过训练。 您想要生成新文本(字符)。 生成过程如下图所示:
图 3: 在这张图片中,我们假设模型已经过训练。 在第一步上,初始化
x
⟨
1
⟩
=
0
⃗
x^{\langle 1\rangle} = \vec{0}
x⟨1⟩=0 , 然后让网络一次采样一个字符。
练习: 在下面实现sample
函数来对字符进行采样。 你需要执行4个步骤:
-
步骤 1: 给网络传入第一个输入 x ⟨ 1 ⟩ = 0 ⃗ x^{\langle 1 \rangle} = \vec{0} x⟨1⟩=0 (零向量). 这是我们生成任何字符之前的默认输入。 我们也可以设置 a ⟨ 0 ⟩ = 0 ⃗ a^{\langle 0 \rangle} = \vec{0} a⟨0⟩=0
-
步骤 2: 运行前向传播的一步得到 a ⟨ 1 ⟩ a^{\langle 1 \rangle} a⟨1⟩ 和 y ^ ⟨ 1 ⟩ \hat{y}^{\langle 1 \rangle} y^⟨1⟩. 以下是等式:
(1) a ⟨ t + 1 ⟩ = tanh ( W a x x ⟨ t ⟩ + W a a a ⟨ t ⟩ + b ) a^{\langle t+1 \rangle} = \tanh(W_{ax} x^{\langle t \rangle } + W_{aa} a^{\langle t \rangle } + b)\tag{1} a⟨t+1⟩=tanh(Waxx⟨t⟩+Waaa⟨t⟩+b)(1)
(2) z ⟨ t + 1 ⟩ = W y a a ⟨ t + 1 ⟩ + b y z^{\langle t + 1 \rangle } = W_{ya} a^{\langle t + 1 \rangle } + b_y \tag{2} z⟨t+1⟩=Wyaa⟨t+1⟩+by(2)
(3) y ^ ⟨ t + 1 ⟩ = s o f t m a x ( z ⟨ t + 1 ⟩ ) \hat{y}^{\langle t+1 \rangle } = softmax(z^{\langle t + 1 \rangle })\tag{3} y^⟨t+1⟩=softmax(z⟨t+1⟩)(3)
注意
y
^
⟨
t
+
1
⟩
\hat{y}^{\langle t+1 \rangle }
y^⟨t+1⟩ 是一个 (softmax) 概率向量(probability vector) (它的各分量都在0和1之间,并且加起来等于1).
y
^
i
⟨
t
+
1
⟩
\hat{y}^{\langle t+1 \rangle}_i
y^i⟨t+1⟩ 表示由“i”索引的字符的下一个字符的概率。 我们提供了一个可以使用的softmax()
函数。
- 步骤 3: 进行采样:根据指定的概率分布选择下一个字符的索引
y
^
⟨
t
+
1
⟩
\hat{y}^{\langle t+1 \rangle }
y^⟨t+1⟩. 这意味着如果
y
^
i
⟨
t
+
1
⟩
=
0.16
\hat{y}^{\langle t+1 \rangle }_i = 0.16
y^i⟨t+1⟩=0.16, 你将以16%的概率选择指数“i”。 要实现它,你可以使用
np.random.choice
.
这是一个如何使用np.random.choice()
的例子:
np.random.seed(0)
p = np.array([0.1, 0.0, 0.7, 0.2])
index = np.random.choice([0, 1, 2, 3], p = p.ravel())
这意味着你将根据分布选择index
:
P
(
i
n
d
e
x
=
0
)
=
0.1
,
P
(
i
n
d
e
x
=
1
)
=
0.0
,
P
(
i
n
d
e
x
=
2
)
=
0.7
,
P
(
i
n
d
e
x
=
3
)
=
0.2
P(index = 0) = 0.1, P(index = 1) = 0.0, P(index = 2) = 0.7, P(index = 3) = 0.2
P(index=0)=0.1,P(index=1)=0.0,P(index=2)=0.7,P(index=3)=0.2.
- 步骤 4: The last step to implement in
sample()
is to overwrite the variablex
, which currently stores x ⟨ t ⟩ x^{\langle t \rangle } x⟨t⟩, with the value of x ⟨ t + 1 ⟩ x^{\langle t + 1 \rangle } x⟨t+1⟩. You will represent x ⟨ t + 1 ⟩ x^{\langle t + 1 \rangle } x⟨t+1⟩ by creating a one-hot vector corresponding to the character you’ve chosen as your prediction. You will then forward propagate x ⟨ t + 1 ⟩ x^{\langle t + 1 \rangle } x⟨t+1⟩ in Step 1 and keep repeating the process until you get a “\n” character, indicating you’ve reached the end of the dinosaur name.
# GRADED FUNCTION: sample
def sample(parameters, char_to_ix, seed):
"""
Sample a sequence of characters according to a sequence of probability distributions output of the RNN
Arguments:
parameters -- python dictionary containing the parameters Waa, Wax, Wya, by, and b.
char_to_ix -- python dictionary mapping each character to an index.
seed -- used for grading purposes. Do not worry about it.
Returns:
indices -- a list of length n containing the indexes of the sampled characters.
"""
# Retrieve parameters and relevant shapes from "parameters" dictionary
Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
vocab_size = by.shape[0]
n_a = Waa.shape[1]
### START CODE HERE ###
# Step 1: Create the one-hot vector x for the first character (initializing the sequence generation). (≈1 line)
x = None
# Step 1': Initialize a_prev as zeros (≈1 line)
a_prev = None
# Create an empty list of indices, this is the list which will contain the list of indexes of the characters to generate (≈1 line)
indices = None
# Idx is a flag to detect a newline character, we initialize it to -1
idx = -1
# Loop over time-steps t. At each time-step, sample a character from a probability distribution and append
# its index to "indexes". We'll stop if we reach 50 characters (which should be very unlikely with a well
# trained model), which helps debugging and prevents entering an infinite loop.
counter = 0
newline_character = char_to_ix['\n']
while (idx != newline_character and counter != 50):
# Step 2: Forward propagate x using the equations (1), (2) and (3)
a = None
z = None
y = None
# for grading purposes
np.random.seed(counter+seed)
# Step 3: Sample the index of a character within the vocabulary from the probability distribution y
idx = None
# Append the index to "indices"
None
# Step 4: Overwrite the input character as the one corresponding to the sampled index.
x = None
x[None] = None
# Update "a_prev" to be "a"
a_prev = None
# for grading purposes
seed += 1
counter +=1
### END CODE HERE ###
if (counter == 50):
indices.append(char_to_ix['\n'])
return indices
3 - Building the language model
It is time to build the character-level language model for text generation.
3.1 - Gradient descent (梯度下降)
In this section you will implement a function performing one step of stochastic gradient descent (with clipped gradients). You will go through the training examples one at a time, so the optimization algorithm will be stochastic gradient descent. As a reminder, here are the steps of a common optimization loop for an RNN:
在本节中,您将实现一个执行随机梯度(stochastic gradient descent)下降步骤的函数(具有修剪梯度的功能)。 您将一次一个地查看训练示例,因此优化算法将是随机梯度下降。 提醒一下,以下是RNN的常见优化循环的步骤:
- 正向传播通过RNN来计算损失
- 向后传播以计算相对于参数的损失梯度
- 如有必要,修剪梯度
- 使用梯度下降更新参数
练习: 实现此优化过程(随机梯度下降的一步)。
我们为您提供以下功能:
def rnn_forward(X, Y, a_prev, parameters):
""" Performs the forward propagation through the RNN and computes the cross-entropy loss.
It returns the loss' value as well as a "cache" storing values to be used in the backpropagation."""
....
return loss, cache
def rnn_backward(X, Y, parameters, cache):
""" Performs the backward propagation through time to compute the gradients of the loss with respect
to the parameters. It returns also all the hidden states."""
...
return gradients, a
def update_parameters(parameters, gradients, learning_rate):
""" Updates parameters using the Gradient Descent Update Rule."""
...
return parameters
# GRADED FUNCTION: optimize
def optimize(X, Y, a_prev, parameters, learning_rate = 0.01):
"""
Execute one step of the optimization to train the model.
Arguments:
X -- list of integers, where each integer is a number that maps to a character in the vocabulary.
Y -- list of integers, exactly the same as X but shifted one index to the left.
a_prev -- previous hidden state.
parameters -- python dictionary containing:
Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
b -- Bias, numpy array of shape (n_a, 1)
by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
learning_rate -- learning rate for the model.
Returns:
loss -- value of the loss function (cross-entropy)
gradients -- python dictionary containing:
dWax -- Gradients of input-to-hidden weights, of shape (n_a, n_x)
dWaa -- Gradients of hidden-to-hidden weights, of shape (n_a, n_a)
dWya -- Gradients of hidden-to-output weights, of shape (n_y, n_a)
db -- Gradients of bias vector, of shape (n_a, 1)
dby -- Gradients of output bias vector, of shape (n_y, 1)
a[len(X)-1] -- the last hidden state, of shape (n_a, 1)
"""
### START CODE HERE ###
# Forward propagate through time (≈1 line)
loss, cache = None
# Backpropagate through time (≈1 line)
gradients, a = None
# Clip your gradients between -5 (min) and 5 (max) (≈1 line)
gradients = None
# Update parameters (≈1 line)
parameters = None
### END CODE HERE ###
return loss, gradients, a[len(X)-1]
3.2 - Training the model (训练模型)
给定恐龙名称的数据集,我们使用数据集的每一行(一个名称)作为一个训练示例。 每100步随机梯度下降,您将采样10个随机选择的名称,以查看算法的运作方式。 请记住对数据集进行随机排列,以便随机梯度下降以随机顺序访问示例。
练习: 按照说明操作并实现model()
。 当examples [index]
包含一个恐龙名称(字符串)时,要创建一个示例(X,Y),您可以使用:
index = j % len(examples)
X = [None] + [char_to_ix[ch] for ch in examples[index]]
Y = X[1:] + [char_to_ix["\n"]]
Note that we use: index= j % len(examples)
, where j = 1....num_iterations
, to make sure that examples[index]
is always a valid statement (index
is smaller than len(examples)
).
The first entry of X
being None
will be interpreted by rnn_forward()
as setting
x
⟨
0
⟩
=
0
⃗
x^{\langle 0 \rangle} = \vec{0}
x⟨0⟩=0. Further, this ensures that Y
is equal to X
but shifted one step to the left, and with an additional “\n” appended to signify the end of the dinosaur name.
# GRADED FUNCTION: model
def model(data, ix_to_char, char_to_ix, num_iterations = 35000, n_a = 50, dino_names = 7, vocab_size = 27):
"""
Trains the model and generates dinosaur names.
Arguments:
data -- text corpus
ix_to_char -- dictionary that maps the index to a character
char_to_ix -- dictionary that maps a character to an index
num_iterations -- number of iterations to train the model for
n_a -- number of hidden neurons in the softmax layer
dino_names -- number of dinosaur names you want to sample at each iteration.
vocab_size -- number of unique characters found in the text, size of the vocabulary
Returns:
parameters -- learned parameters
"""
# Retrieve n_x and n_y from vocab_size
n_x, n_y = vocab_size, vocab_size
# Initialize parameters
parameters = initialize_parameters(n_a, n_x, n_y)
# Initialize loss (this is required because we want to smooth our loss, don't worry about it)
loss = get_initial_loss(vocab_size, dino_names)
# Build list of all dinosaur names (training examples).
with open("dinos.txt") as f:
examples = f.readlines()
examples = [x.lower().strip() for x in examples]
# Shuffle list of all dinosaur names
shuffle(examples)
# Initialize the hidden state of your LSTM
a_prev = np.zeros((n_a, 1))
# Optimization loop
for j in range(num_iterations):
### START CODE HERE ###
# Use the hint above to define one training example (X,Y) (≈ 2 lines)
index = None
X = None
Y = None
# Perform one optimization step: Forward-prop -> Backward-prop -> Clip -> Update parameters
# Choose a learning rate of 0.01
curr_loss, gradients, a_prev = None
### END CODE HERE ###
# Use a latency trick to keep the loss smooth. It happens here to accelerate the training.
loss = smooth(loss, curr_loss)
# Every 2000 Iteration, generate "n" characters thanks to sample() to check if the model is learning properly
if j % 2000 == 0:
print('Iteration: %d, Loss: %f' % (j, loss) + '\n')
# The number of dinosaur names to print
seed = 0
for name in range(dino_names):
# Sample indexes and print them
sampled_indexes = sample(parameters, char_to_ix, seed)
print_sample(sampled_indexes, ix_to_char)
seed += 1 # To get the same result for grading purposed, increment the seed by one.
print('\n')
return parameters
运行以下单元格,您应该观察模型在第一次迭代时输出随机字符。 经过几千次迭代后,您的模型应该学会生成看似合理的名称。
parameters = model(data, ix_to_char, char_to_ix)
Conclusion(总结)
您可以看到您的算法已经开始在训练结束时生成似乎合理的恐龙名称。 起初,它产生了随机字符,但最后你可以看到恐龙名字带有很酷的结局。 随意运行算法甚至更长时间并使用超参数来查看是否可以获得更好的结果。 我们的实现产生了一些非常酷的名字,如maconucon
,marloralus
和macingsersaurus
。 你的模型希望也知道恐龙的名字往往以“saurus”,“don”,“aura”,“tor”等结尾。
如果你的模型产生一些非酷炫的名字,不要完全归咎于模型 - 并非所有真正的恐龙名字听起来都很酷。 (例如,'dromaeosauroides`是一个真正的恐龙名称并且在训练集中。)但是这个模型应该给你一组候选人,你可以从中选择最酷的!
该任务使用了相对较小的数据集,因此您可以在CPU上快速训练RNN。 训练英语语言模型需要更大的数据集,并且通常需要更多的计算,并且可以在GPU上运行数小时。 我们把恐龙的名字运行了很长一段时间,到目前为止,我们的名字是伟大的,不可战胜的,凶猛的:芒果!
4 - Writing like Shakespeare(写作像莎士比亚)
这款笔记本的其余部分是可选的,没有评分(graded),但我们希望你能做到这一点,因为它非常有趣和信息丰富(informative)。
A similar (but more complicated) task is to generate Shakespeare poems. Instead of learning from a dataset of Dinosaur names you can use a collection of Shakespearian poems. Using LSTM cells, you can learn longer term dependencies that span many characters in the text–e.g., where a character appearing somewhere a sequence can influence what should be a different character much much later in ths sequence. These long term dependencies were less important with dinosaur names, since the names were quite short.
类似(但更复杂)的任务是制作莎士比亚诗歌。 您可以使用一系列莎士比亚诗歌,而不是从恐龙名字的数据集中学习。 使用LSTM单元格,您可以学习跨越文本中许多字符的长期依赖关系 - 例如,出现在序列某处的字符可以在序列中稍后影响应该是不同字符的字符。 这些长期依赖关系对于恐龙名称来说并不那么重要,因为名字很短。
Let’s become poets!
我们与Keras一起实施了莎士比亚诗歌发生器。 运行以下单元格以加载所需的包和模型。 这可能需要几分钟的时间。
from __future__ import print_function
from keras.callbacks import LambdaCallback
from keras.models import Model, load_model, Sequential
from keras.layers import Dense, Activation, Dropout, Input, Masking
from keras.layers import LSTM
from keras.utils.data_utils import get_file
from keras.preprocessing.sequence import pad_sequences
from shakespeare_utils import *
import sys
import io
为了节省你一些时间,我们已经在一个叫做[“The Sonnets”](shakespeare.txt)的莎士比亚诗集中训练了一个约1000个epoch的模型。
让我们再一次训练模型。 当它完成一个epoch的训练—这也需要几分钟—你可以运行generate_output
,它会提示你输入一个输入(`<40个字符)。 这首诗将从你的句子开始,我们的RNN-Shakespeare将为你完成诗歌的其余部分! 例如,尝试“Forsooth this maketh no sense”(不要输入引号)。 根据您是否在末尾包含空格,您的结果可能也会有所不同 - 尝试两种方式,并尝试其他输入。
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)
model.fit(x, y, batch_size=128, epochs=1, callbacks=[print_callback])
# Run this cell to try with different inputs without having to re-train the model
generate_output()
RNN-Shakespeare模型与您为恐龙名称建立的模型非常相似。 唯一的主要区别是:
- LSTM而不是基本的RNN来捕获更长范围的依赖关系
- 该模型是一个更深,堆叠的LSTM模型(2层)
- 使用Keras代替python来简化代码
如果您想了解更多信息,还可以在GitHub上查看Keras团队的文本生成实现:https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py.
Congratulations on finishing this notebook!