import numpy as np
from utils import *
import random
import pprint
import copy
data = open('dinos.txt', 'r').read()
data= data.lower()
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print('There are %d total characters and %d unique characters in your data.' % (data_size, vocab_size))
There are 19909 total characters and 27 unique characters in your data.
- The characters are a-z (26 characters) plus the "\n" (or newline character).
- In this assignment, the newline character "\n" plays a role similar to the
<EOS>
(or "End of sentence") token discussed in lecture.- Here, "\n" indicates the end of the dinosaur name rather than the end of a sentence.
char_to_ix
: In the cell below, you'll create a Python dictionary (i.e., a hash table) to map each character to an index from 0-26.ix_to_char
: Then, you'll create a second Python dictionary that maps each index back to the corresponding character.- This will help you figure out which index corresponds to which character in the probability distribution output of the softmax layer.
chars = sorted(chars)
print(chars)
['\n', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(ix_to_char)
{ 0: '\n',
1: 'a',
2: 'b',
3: 'c',
4: 'd',
5: 'e',
6: 'f',
7: 'g',
8: 'h',
9: 'i',
10: 'j',
11: 'k',
12: 'l',
13: 'm',
14: 'n',
15: 'o',
16: 'p',
17: 'q',
18: 'r',
19: 's',
20: 't',
21: 'u',
22: 'v',
23: 'w',
24: 'x',
25: 'y',
26: 'z'}
1.2 - Overview of the Model
Your model will have the following structure:
- Initialize parameters
- Run the optimization loop
- Forward propagation to compute the loss function
- Backward propagation to compute the gradients with respect to the loss function
- Clip the gradients to avoid exploding gradients
- Using the gradients, update your parameters with the gradient descent update rule.
- Return the learned parameters
Figure 1: Recurrent Neural Network, similar to what you built in the previous notebook "Building a Recurrent Neural Network - Step by Step."
- At each time-step, the RNN tries to predict what the next character is, given the previous characters.
- X=(x〈1〉,x〈2〉,...,x〈Tx〉)X=(x〈1〉,x〈2〉,...,x〈Tx〉) is a list of characters from the training set.
- Y=(y〈1〉,y〈2〉,...,y〈Tx〉)Y=(y〈1〉,y〈2〉,...,y〈Tx〉) is the same list of characters but shifted one character forward.
- At every time-step tt, y〈t〉=x〈t+1〉y〈t〉=x〈t+1〉. The prediction at time tt is the same as the input at time t+1t+1.
2.1 - Clipping the Gradients in the Optimization Loop
In this section you will implement the clip
function that you will call inside of your optimization loop.
Exploding gradients
- When gradients are very large, they're called "exploding gradients."
- Exploding gradients make the training process more difficult, because the updates may be so large that they "overshoot" the optimal values during back propagation.
Recall that your overall loop structure usually consists of:
- forward pass,
- cost computation,
- backward pass,
- parameter update.
Before updating the parameters, you will perform gradient clipping to make sure that your gradients are not "exploding."
Gradient clipping
In the exercise below, you will implement a function clip
that takes in a dictionary of gradients and returns a clipped version of gradients, if needed.
- There are different ways to clip gradients.
- You will use a simple element-wise clipping procedure, in which every element of the gradient vector is clipped to fall between some range [-N, N].
- For example, if the N=10
- The range is [-10, 10]
- If any component of the gradient vector is greater than 10, it is set to 10.
- If any component of the gradient vector is less than -10, it is set to -10.
- If any components are between -10 and 10, they keep their original values.
Figure 2: Visualization of gradient descent with and without gradient clipping, in a case where the network is running into "exploding gradient" problems.
Exercise 1 - clip
Return the clipped gradients of your dictionary gradients
.
- Your function takes in a maximum threshold and returns the clipped versions of the gradients.
- You can check out numpy.clip for more info.
- You will need to use the argument "
out = ...
". - Using the "
out
" parameter allows you to update a variable "in-place". - If you don't use "
out
" argument, the clipped variable is stored in the variable "gradient" but does not update the gradient variablesdWax
,dWaa
,dWya
,db
,dby
.
- You will need to use the argument "
# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED FUNCTION: clip
def clip(gradients, maxValue):
'''
Clips the gradients' values between minimum and maximum.
Arguments:
gradients -- a dictionary containing the gradients "dWaa", "dWax", "dWya", "db", "dby"
maxValue -- everything above this number is set to this number, and everything less than -maxValue is set to -maxValue
Returns:
gradients -- a dictionary with the clipped gradients.
'''
gradients = copy.deepcopy(gradients)
dWaa, dWax, dWya, db, dby = gradients['dWaa'], gradients['dWax'], gradients['dWya'], gradients['db'], gradients['dby']
### START CODE HERE ###
# Clip to mitigate exploding gradients, loop over [dWax, dWaa, dWya, db, dby]. (≈2 lines)
for gradient in [dWax, dWaa, dWya, db, dby]:
np.clip(gradient,-maxValue,maxValue,out=gradient)
### END CODE HERE ###
gradients = {
"dWaa": dWaa, "dWax": dWax, "dWya": dWya, "db": db, "dby": dby}
return gradients
# Test with a max value of 10
def clip_test(target, mValue):
print(f"\nGradients for mValue={
mValue}")
np.random.seed(3)
dWax = np.random.randn(5, 3) * 10
dWaa = np.random.randn(5, 5) * 10
dWya = np.random.randn(2, 5) * 10
db = np.random.randn(5, 1) * 10
dby = np.random.randn(2, 1) * 10
gradients = {
"dWax": dWax, "dWaa": dWaa, "dWya": dWya, "db": db, "dby": dby}
gradients2 = target(gradients, mValue)
print("gradients[\"dWaa