吴恩达深度学习课：恐龙岛字符级语言模型-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_71479720/article/details/125857421

这篇博客介绍了吴恩达深度学习课程中关于Dinosaurus Island字符级语言模型的编程作业。作业涉及问题陈述、数据预处理、模型概述、模型构建、梯度裁剪、采样等步骤，旨在训练一个能够预测恐龙名称的RNN模型。通过梯度裁剪避免梯度爆炸，并使用采样技术生成新的字符序列。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Packages
1 - Problem Statement
- 1.1 - Dataset and Preprocessing
- 1.2 - Overview of the Model
2 - Building Blocks of the Model
- 2.1 - Clipping the Gradients in the Optimization Loop
  - Exercise 1 - clip
- 2.2 - Sampling
  - Exercise 2 - sample
3 - Building the Language Model
- 3.1 - Gradient Descent
  - Exercise 3 - optimize
- 3.2 - Training the Model
  - Exercise 4 - model
4 - Writing like Shakespeare (OPTIONAL/UNGRADED)
5 - References

Packages

In [2]:

import numpy as np
from utils import *
import random
import pprint
import copy

1 - Problem Statement

1.1 - Dataset and Preprocessing

Run the following cell to read the dataset of dinosaur names, create a list of unique characters (such as a-z), and compute the dataset and vocabulary size.

In [3]:

data = open('dinos.txt', 'r').read()
data= data.lower()
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print('There are %d total characters and %d unique characters in your data.' % (data_size, vocab_size))

There are 19909 total characters and 27 unique characters in your data.

The characters are a-z (26 characters) plus the "\n" (or newline character).
In this assignment, the newline character "\n" plays a role similar to the <EOS> (or "End of sentence") token discussed in lecture.
- Here, "\n" indicates the end of the dinosaur name rather than the end of a sentence.
char_to_ix: In the cell below, you'll create a Python dictionary (i.e., a hash table) to map each character to an index from 0-26.
ix_to_char: Then, you'll create a second Python dictionary that maps each index back to the corresponding character.
- This will help you figure out which index corresponds to which character in the probability distribution output of the softmax layer.

In [4]:

chars = sorted(chars)
print(chars)

['\n', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

In [5]:

char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(ix_to_char)

{   0: '\n',
    1: 'a',
    2: 'b',
    3: 'c',
    4: 'd',
    5: 'e',
    6: 'f',
    7: 'g',
    8: 'h',
    9: 'i',
    10: 'j',
    11: 'k',
    12: 'l',
    13: 'm',
    14: 'n',
    15: 'o',
    16: 'p',
    17: 'q',
    18: 'r',
    19: 's',
    20: 't',
    21: 'u',
    22: 'v',
    23: 'w',
    24: 'x',
    25: 'y',
    26: 'z'}

1.2 - Overview of the Model

Your model will have the following structure:

Initialize parameters
Run the optimization loop
- Forward propagation to compute the loss function
- Backward propagation to compute the gradients with respect to the loss function
- Clip the gradients to avoid exploding gradients
- Using the gradients, update your parameters with the gradient descent update rule.
Return the learned parameters

Figure 1: Recurrent Neural Network, similar to what you built in the previous notebook "Building a Recurrent Neural Network - Step by Step."

At each time-step, the RNN tries to predict what the next character is, given the previous characters.
X=(x〈1〉,x〈2〉,...,x〈Tx〉)X=(x〈1〉,x〈2〉,...,x〈Tx〉) is a list of characters from the training set.
Y=(y〈1〉,y〈2〉,...,y〈Tx〉)Y=(y〈1〉,y〈2〉,...,y〈Tx〉) is the same list of characters but shifted one character forward.
At every time-step tt, y〈t〉=x〈t+1〉y〈t〉=x〈t+1〉. The prediction at time tt is the same as the input at time t+1t+1.

2 - Building Blocks of the Model

In this part, you will build two important blocks of the overall model:

Gradient clipping: to avoid exploding gradients
Sampling: a technique used to generate characters

You will then apply these two functions to build the model.

2.1 - Clipping the Gradients in the Optimization Loop

In this section you will implement the clip function that you will call inside of your optimization loop.

Exploding gradients

When gradients are very large, they're called "exploding gradients."
Exploding gradients make the training process more difficult, because the updates may be so large that they "overshoot" the optimal values during back propagation.

Recall that your overall loop structure usually consists of:

forward pass,
cost computation,
backward pass,
parameter update.

Before updating the parameters, you will perform gradient clipping to make sure that your gradients are not "exploding."

Gradient clipping

In the exercise below, you will implement a function clip that takes in a dictionary of gradients and returns a clipped version of gradients, if needed.

There are different ways to clip gradients.
You will use a simple element-wise clipping procedure, in which every element of the gradient vector is clipped to fall between some range [-N, N].
For example, if the N=10
- The range is [-10, 10]
- If any component of the gradient vector is greater than 10, it is set to 10.
- If any component of the gradient vector is less than -10, it is set to -10.
- If any components are between -10 and 10, they keep their original values.

Figure 2: Visualization of gradient descent with and without gradient clipping, in a case where the network is running into "exploding gradient" problems.

Exercise 1 - clip

Return the clipped gradients of your dictionary gradients.

Your function takes in a maximum threshold and returns the clipped versions of the gradients.
You can check out numpy.clip for more info.
- You will need to use the argument "out = ...".
- Using the "out" parameter allows you to update a variable "in-place".
- If you don't use "out" argument, the clipped variable is stored in the variable "gradient" but does not update the gradient variables dWax, dWaa, dWya, db, dby.

In [6]:

# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED FUNCTION: clip

def clip(gradients, maxValue):
    '''
 Clips the gradients' values between minimum and maximum.
    
 Arguments:
 gradients -- a dictionary containing the gradients "dWaa", "dWax", "dWya", "db", "dby"
 maxValue -- everything above this number is set to this number, and everything less than -maxValue is set to -maxValue
    
 Returns: 
 gradients -- a dictionary with the clipped gradients.
 '''
    gradients = copy.deepcopy(gradients)
    
    dWaa, dWax, dWya, db, dby = gradients['dWaa'], gradients['dWax'], gradients['dWya'], gradients['db'], gradients['dby']
   
    ### START CODE HERE ###
    # Clip to mitigate exploding gradients, loop over [dWax, dWaa, dWya, db, dby]. (≈2 lines)
    for gradient in [dWax, dWaa, dWya, db, dby]:
        np.clip(gradient,-maxValue,maxValue,out=gradient)
    ### END CODE HERE ###
    
    gradients = {
          "dWaa": dWaa, "dWax": dWax, "dWya": dWya, "db": db, "dby": dby}
    
    return gradients

In [7]:

# Test with a max value of 10
def clip_test(target, mValue):
    print(f"\nGradients for mValue={
            mValue}")
    np.random.seed(3)
    dWax = np.random.randn(5, 3) * 10
    dWaa = np.random.randn(5, 5) * 10
    dWya = np.random.randn(2, 5) * 10
    db = np.random.randn(5, 1) * 10
    dby = np.random.randn(2, 1) * 10
    gradients = {
          "dWax": dWax, "dWaa": dWaa, "dWya": dWya, "db": db, "dby": dby}

    gradients2 = target(gradients, mValue)
    print("gradients[\"dWaa