循环神经网络教程 Part 2笔记

最新推荐文章于 2023-04-07 09:45:31 发布

翻译最新推荐文章于 2023-04-07 09:45:31 发布 · 534 阅读

文章标签：

#RNN #Theano

本文详细介绍了如何使用Python和Theano从零开始实现一个循环神经网络(RNN)语言模型。主要内容涵盖模型结构、前向传播、损失计算及利用随机梯度下降(SGD)和反向传播通过时间(BPTT)进行训练。

Note: RECURRENT NEURAL NETWORKS TUTORIAL, PART 2 – IMPLEMENTING A RNN WITH PYTHON, NUMPY AND THEANO

这篇翻译粗糙了点，重点关注网络结构。主要是当学习笔记，侵删。

本教程包括以下几个部分
1.Introduction To RNNs
2.Implementing a RNN using Python and Theano
3.Understanding the Backpropagation Through Time (BPTT) algorithm and the vanishing gradient problem
4.Implementing a GRU/LSTM RNN

这是RNN Tutorial的第二部分。
Code to follow along is on Github.
这部分我们会从零实现一个完整的循环神经网络，利用Python，并利用Theano优化我们的实现（GPU实现库）。

语言建模

我们的目标是利用RNN建立一个语言模型。给定一个有m个单词的句子。一个语言模型允许我们预测句子（正确）的概率
$\begin{aligned} P(w_1,...,w_m)=\prod_{i=1}^{m}P(w_i\mid w_1,..., w_{i-1})\end{aligned}$
也就是说，句子的概率是之前出现的每个词的概率的乘积。所以，句子“He went to buy some chocolate”是给定“He went to buy some”是“chocolate”的概率乘以给定“He went to buy”时“some”的概率，依此类推。
这有什么用呢？为什么我们要给一个句子赋予概率呢？
首先，这个模型可以用作一个评分机制。如，一个机器翻译系统对一个输入产生多个候选。你可以通过语言模型去选择一个最可能的句子。直观上，最可能的句子是语法最正确的。同样，在声音识别中可用这样的评分机制。
但是，求解一个语言模型有另外一个很酷的作用。由于我们可以预测跟定词序列的下一个词的概率，我们就可以生成文本。这是一个生成模型。给定一个现有的词序列，我们对下一个词进行采样，并重复该操作直到产生一个完整的句子。Andrej Karparthy has a great post that demonstrates what language models are capable of.
注意上面每个单词的概率是基于前面所有的单词。实际中，许多模型由于计算能力或内存的限制，很难表示如此长的依赖。一般他们只能向前考虑几个词。理论上，RNNs能够捕获这么长的依赖，但实际中更复杂。之后会有介绍。

训练和处理

训练模型需要学习的文本。幸运的是，我们不需要任何标签去训练一个模型，只需要原始文本。I downloaded 15,000 longish reddit comments from a dataset available on Google’s BigQuery.Text generated by our model will sound like reddit commenters (hopefully)! 但是我们需要对数据进行预处理形成正确的格式。

文本切词

采用NLTK的word_tokenize 和sent_tokenize。

去除低频词

词汇量越大模型训练越慢，由于我们没有上下文示例，我们很难正确的学到这些。真正理解一个词需要在不同的文本中出现。
我们的代码中，我们只保留最常用的词（8000，可自行修改）。对于所有未出现的词以UNKNOWN_TOKEN代替。UNKNOWN_TOKEN也是词库的一部分，也会像其他词一样被预测。当我们生成文本时会替换掉，如使用一个随机采样的不在词库中的词，或者生成句子直到句中没有未知词。

预设特殊开始和结束符

我们希望知道句子的开始和结束。因此我们预设一个特殊标记SENTENCE_START、SENTENCE_END。这样我们就可以问，给定第一个标记是SENTENCE_START，那么下一个词（实际句子第一个词）是什么呢？

建立训练矩阵

训练得目的是预测下一词。
标签的打法

# Create the training data
X_train = np.asarray([[word_to_index[w] for w in sent[:-1]] for sent in tokenized_sentences])
y_train = np.asarray([[word_to_index[w] for w in sent[1:]] for sent in tokenized_sentences])

建立RNN

vocabulary size $C=8000$
hidden layer size $H=100$

$\begin{aligned}s_t&=\tanh(Ux_t+Ws_{t-1})\\o_t&=\mathrm{softmax}(Vs_t)\end{aligned}$

$\begin{aligned}x_t & \in \mathbb{R}^{8000} \\ o_t & \in \mathbb{R}^{8000}\\s_t& \in \mathbb{R}^{100} \\ U & \in \mathbb{R}^{100 \times 8000} \\ V & \in \mathbb{R}^{8000 \times 100} \\ W & \in \mathbb{R}^{100 \times 100} \\ \end{aligned}$

$U,V,W$ 是我们想要从数据中学习的网络参数。因此我们共需要学 $2HC + H^2$ 个参数。注意 $x_t$ 是独热向量，与U相乘，就是选择一列，而不需要计算整个乘法。因此最大的计算量在于 $Vs_t$ 。这就是为什么我们希望词汇量不要太大的原因。

初始化

初始化 $U,V$ 和 $W$ 有一定技巧性。我们不能简单地初始化为0，那会导致所有层的对称计算。必须随机初始化。因为正确的初始化对训练得结果影响很大。最好的初始化是以间隔
$\left[-\frac{1}{\sqrt{n}}, \frac{1}{\sqrt{n}}\right]$ 随机初始化，其中n是前一层的传入连接数（见代码示例）。取决于激活函数和one recommended approach。（句式没懂）。这听起来有点复杂（确实是），但莫担心，只要你用随机初始化参数，就可以了。

class RNNNumpy:

    def __init__(self, word_dim, hidden_dim=100, bptt_truncate=4):
        # Assign instance variables
        self.word_dim = word_dim
        self.hidden_dim = hidden_dim
        self.bptt_truncate = bptt_truncate
        # Randomly initialize the network parameters
        self.U = np.random.uniform(-np.sqrt(1./word_dim), np.sqrt(1./word_dim), (hidden_dim, word_dim))
        self.V = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (word_dim, hidden_dim))
        self.W = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (hidden_dim, hidden_dim))

前向传播

接着，我们实现前向传播。

def forward_propagation(self, x):
    # The total number of time steps
    T = len(x)
    # 前向传播中所有的隐层状态保存在s中以备后用
    # 为初始隐层增加一个元素s[-1],设为0
    s = np.zeros((T + 1, self.hidden_dim))
    s[-1] = np.zeros(self.hidden_dim)
    # 保存每一步的输出.
    o = np.zeros((T, self.word_dim))
    # For each time step...
    for t in np.arange(T):
        # 注意这里我们用x[t]索引U，这和U乘以一个独热矩阵是一样的
        s[t] = np.tanh(self.U[:,x[t]] + self.W.dot(s[t-1]))
        o[t] = softmax(self.V.dot(s[t]))
    return [o, s]

RNNNumpy.forward_propagation = forward_propagation

我们不仅返回了计算的输出，还返回了隐层状态。我们用他们来计算梯度。每个 $o_t$ 都是代表次库中一个词的概率，但实际中我们只要概率最高的下一个单词。这个操作就是predict：

def predict(self, x):
    # Perform forward propagation and return index of the highest score
    o, s = self.forward_propagation(x)
    #o[t]在第二维（行）
    return np.argmax(o, axis=1)

RNNNumpy.predict = predict

计算LOSS

交叉熵
$\begin{aligned}L(y,o)=-\frac{1}{N} \sum_{n \in N} y_{n} \log o_{n} \end{aligned}$

def calculate_total_loss(self, x, y):
    L = 0
    # For each sentence...
    for i in np.arange(len(y)):
        o, s = self.forward_propagation(x[i])
        # We only care about our prediction of the "correct" words， 即遍历O最后一行所有的y
        correct_word_predictions = o[np.arange(len(y[i])), y[i]]
        # Add to the loss based on how off we were
        L += -1 * np.sum(np.log(correct_word_predictions))
    return L

def calculate_loss(self, x, y):
    # Divide the total loss by the number of training examples
    N = np.sum((len(y_i) for y_i in y))
    return self.calculate_total_loss(x,y)/N

RNNNumpy.calculate_total_loss = calculate_total_loss
RNNNumpy.calculate_loss = calculate_loss

对于随机预测，loss应该是多少。
$L=-\frac{1}{N} N \log\frac{1}{C}=\log{C}$

# Limit to 1000 examples to save time
print "Expected Loss for random predictions: %f" % np.log(vocabulary_size)
print "Actual loss: %f" % model.calculate_loss(X_train[:1000], y_train[:1000])

Expected Loss for random predictions: 8.987197
Actual loss: 8.987440

利用SGD和BPTT训练RNN

终于到了膜拜SGD的时候啦（随机梯度下降）。原理就是随机按某个方向推动参数。方向由如下公式给出：
$\frac{\partial L}{\partial U},\frac{\partial L}{\partial V},\frac{\partial L}{\partial W}$

SGD需要一个学习速率。SGD是一个非常流行优化方法。因此有很多针对SGD优化的研究，如采用batching，parallelism（这啥？并行么？），adaptive learning rates（AdaDelta）。尽管基本思想很简单，高效实现SGD还是非常复杂的。SGD传送门。本po实现一个简单可理解的SGD。
下面介绍BPTT。这一部分没细讲，后面会介绍BPTT，先贴上。For a general introduction to backpropagation check out this and this post.
$\frac{\partial L}{\partial U},\frac{\partial L}{\partial V},\frac{\partial L}{\partial W}$

def bptt(self, x, y):
    T = len(y)
    # Perform forward propagation
    o, s = self.forward_propagation(x)
    # We accumulate the gradients in these variables
    dLdU = np.zeros(self.U.shape)
    dLdV = np.zeros(self.V.shape)
    dLdW = np.zeros(self.W.shape)
    delta_o = o
    delta_o[np.arange(len(y)), y] -= 1.
    # For each output backwards...
    for t in np.arange(T)[::-1]:
        dLdV += np.outer(delta_o[t], s[t].T)
        # Initial delta calculation
        delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2))
        # Backpropagation through time (for at most self.bptt_truncate steps)
        for bptt_step in np.arange(max(0, t-self.bptt_truncate), t+1)[::-1]:
            # print "Backpropagation step t=%d bptt step=%d " % (t, bptt_step)
            dLdW += np.outer(delta_t, s[bptt_step-1])              
            dLdU[:,x[bptt_step]] += delta_t
            # Update delta for next step
            delta_t = self.W.T.dot(delta_t) * (1 - s[bptt_step-1] ** 2)
    return [dLdU, dLdV, dLdW]

RNNNumpy.bptt = bptt

梯度校验

原理就是微分的原始定义。

SGD实现

分两步实现：1、函数sdg_step计算梯度更新一个batch，2、外循环迭代训练集并调整学习率

# Performs one step of SGD.
def numpy_sdg_step(self, x, y, learning_rate):
    # Calculate the gradients
    dLdU, dLdV, dLdW = self.bptt(x, y)
    # Change parameters according to gradients and learning rate
    self.U -= learning_rate * dLdU
    self.V -= learning_rate * dLdV
    self.W -= learning_rate * dLdW

RNNNumpy.sgd_step = numpy_sdg_step

# Outer SGD Loop
# - model: The RNN model instance
# - X_train: The training data set
# - y_train: The training data labels
# - learning_rate: Initial learning rate for SGD
# - nepoch: Number of times to iterate through the complete dataset
# - evaluate_loss_after: Evaluate the loss after this many epochs
def train_with_sgd(model, X_train, y_train, learning_rate=0.005, nepoch=100, evaluate_loss_after=5):
    # We keep track of the losses so we can plot them later
    losses = []
    num_examples_seen = 0
    for epoch in range(nepoch):
        # Optionally evaluate the loss
        if (epoch % evaluate_loss_after == 0):
            loss = model.calculate_loss(X_train, y_train)
            losses.append((num_examples_seen, loss))
            time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
            print "%s: Loss after num_examples_seen=%d epoch=%d: %f" % (time, num_examples_seen, epoch, loss)
            # 当loss增大，调整学习率
            if (len(losses) &gt; 1 and losses[-1][1] &gt; losses[-2][1]):
                learning_rate = learning_rate * 0.5 
                print "Setting learning rate to %f" % learning_rate
            sys.stdout.flush()
        # For each training example...
        for i in range(len(y_train)):
            # One SGD step
            model.sgd_step(X_train[i], y_train[i], learning_rate)
            num_examples_seen += 1

感受下训练时间。

np.random.seed(10)
model = RNNNumpy(vocabulary_size)
%timeit model.sgd_step(X_train[10], y_train[10], 0.005)

用theano和GPU训练网络

Just like the rest of this post, the code is also available Github.

参考博文
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/