神经网络 ——一个很好的解释以及简单实现 Implementing a Neural Network from Scratch in Python – An Introduction

最新推荐文章于 2024-06-10 00:57:41 发布

翻译最新推荐文章于 2024-06-10 00:57:41 发布 · 1.2k 阅读

文章标签：

#神经网络 #python #scratch

笔记同时被 3 个专栏收录

14 篇文章

订阅专栏

算法

14 篇文章

订阅专栏

深度学习

12 篇文章

订阅专栏

本文介绍了如何从零开始实现一个简单的3层神经网络，包括生成数据集、逻辑回归对比、网络结构搭建、参数学习等内容。

部署运行你感兴趣的模型镜像

代码均在这里：
Get the code: To follow along, all the code is also available as an iPython notebook on Github.

在这篇文章中，我们将从头开始实现一个简单的3层神经网络。
我们不会推导出所有需要的数学内容，但是我会尝试直观地解释我们正在做什么，我也会给出资源给你详细阅读用。

在这里，我假设你熟悉基本的微积分和机器学习概念，例如，你知道什么是分类和正则化。理想情况下，你还知道一些关于如何使用梯度下降优化的技术。但是即使你不熟悉上述任何列出来的，仍然可以是很有趣的 : )

然而，为什么要从头开始实现一个神经网络呢？即使您计划在未来使用像PyBrain这样的神经网络库，至少一次从头开始实现一个网络是非常有价值的练习。它可以帮助您了解神经网络的工作原理，这对设计有效的模型至关重要。

有一点需要注意的是，这里的代码示例并不是非常有效。它们被设计成比较容易理解。在即将发布的帖子中，我将探讨如何使用Theano编写一个高效的神经网络实现。（更新：now available）

Generating a dataset 生成数据集

让我们从生成一个数据集给我们使用开始。幸运的是，scikit-learn有一些有用的数据集生成器，所以我们不需要自己编写代码。我们将使用make_moons函数。

import numpy as np
from sklearn import datasets, linear_model
import matplotlib.pyplot as plt
%matplotlib inline

# Generate a dataset and plot it
np.random.seed(0)
X, y = sklearn.datasets.make_moons(200, noise=0.20)
plt.scatter(X[:,0], X[:,1], s=40, c=y, cmap=plt.cm.Spectral)

我们生成的数据集有两个类，绘制为红色和蓝色点。你可以将蓝点作为男性患者，将红点视为女性患者，X轴和Y轴是医学测量。

我们的目标是训练一个机器学习分类器，根据给出的xy坐标来预测正确的分类（男性或女性）。请注意数据不是线性分离的，我们不能绘制分开两个类的直线。这意味着线性分类器像逻辑回归，将无法适应数据，除非你手工设计适用于给定数据集的非线性特征（如多项式）。

事实上，这是神经网络的主要优点之一。您不需要担心特征工程。神经网络的隐藏层将为你学习feature。

Logistic Regression 逻辑回归

为了说明我们训练一个逻辑回归分类器，它的输入是xy值，输出预测类（0或1）。为了方便简单，我们从scikit-learn中使用 Logistic Regression class.

!!! 代码是GitHub上的那个文件 simple_classification.py

# Train the logistic rgeression classifier
clf = sklearn.linear_model.LogisticRegressionCV()
clf.fit(X, y)

# Plot the decision boundary
plot_decision_boundary(lambda x: clf.predict(x))
plt.title("Logistic Regression")

该图显示了由Logistic回归分类器学到的判定界限。它将数据分离成可以使用直线，但它无法捕获我们的数据的“月亮形状”。

Training a Neural Network

现在我们来构建一个三层神经网络，一个输入层，一个隐藏层和一个输出层。
输入层中的节点数由我们的数据的维度决定，就是2；
类似地，输出层中的节点数由我们所拥有的类的数量所决定，也是2。（因为我们只有2个类，实际上可以只有一个输出节点预测0或1，但是有2个可以更容易地将网络扩展到更多的类）。
网络的输入将是x和y坐标，其输出将是两个概率，一个用于class 0（“female”），一个用于class 1（“male”）。
看起来像这样：

我们可以选择隐藏层的维数（节点数）。我们放入隐藏层中的节点越多，我们将能够适应的更复杂的功能。但更高的维度会带来成本。
首先，需要更多的计算来进行预测并学习网络的参数。
更多的参数也意味着我们更容易过度拟合我们的数据。

那么如何选择隐藏层的大小呢？虽然有一些一般的指导方针和建议，但它总是取决于你的具体问题，更像是一门艺术而不是一门科学。稍后我们会着手于隐藏的节点数，看看它是如何影响我们的输出的。

我们还需要给我们的隐藏层选择激活函数。激活函数（The activation function）将层的输入转换为其输出。非线性激活函数是允许我们拟合非线性假设的。常见选择的激活函数是tanh, the sigmoid function, 或者ReLUs.
我们将使用tanh，在许多情况下表现相当好。这些函数的一个很好的属性是它们的倒数可以使用原始函数值来计算。
例如， $tanhx$ 的导数就是 $1-tanh^2x$ . 这是有用的，因为它允许我们计算 $tanhx$ 一次，然后重新使用这个 $tanhx$ 值来获得导数。

因为我们希望我们的network输出概率，输出层的激活函数将是softmax，一种方法简单地将原始score转换为概率。
如果你熟悉logistic function，you can think of softmax as its generalization to multiple classes.（这句还是放原文比较好 T_T）.

How our network makes predictions

我们的网络使用正向传播去预测，正向传播只是一堆矩阵乘法和上面定义的激活函数的应用。
x是我们的网络的二维输入，那么我们如下计算我们的预测 $y'$ （也是二维的）：

$z_1 = xW_1+b_1$
$a_1 = tanh(z_1)$
$z_2 = a_1W_2+b_2$
$a_2 = y' = softmax(z_2)$

$z_i$ 是第i层的输入， $a_i$ 是第i层应用激活函数之后的输出。
$W_1, b_1, W_2, b_2$ 是我们网络的参数，这些参数我们需要从training data中学习（learn from our training data）. 你可以把它们当做是矩阵，在网络层之间转换数据的矩阵。
看着上面的矩阵乘法，我们可以计算出这些矩阵的维数：
如果我们使用500个节点在隐含层，那么 $W_1∈R(2*500), b_1∈R(500), W_2∈R(500*2), b_2∈R(2)$
现在你可以看到了为什么我们需要更多的参数，如果我们提高了隐藏层的size。

Learning the Parameters

学习我们网络参数意味着我们需要找到能最小化错误，在我们的训练数据上。（the error on our training data）。不过，我们该如何定义我们的error？
We call the function that measures our error the loss funtion. 我们叫它 loss function。
softmax输出的一个常见的选择是分类交叉熵损失（也称负对数似然值）。
the categorical cross-entropy loss

如果我们有N个训练样本和 C个类，那么预测值 y’ 相对于真实的 labels y 的损失（loss）：

![loss](https://img-blog.youkuaiyun.com/20171129153846589?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvZ3VvaGFvX3poYW5n/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast) 这个公式看起来很复杂，但是它真正的作用是总结我们的训练样例，并且如果我们预测了不正确的类，那就增加了损失。两个概率分布越远 y（正确的标签）和 y’（我们的预测），我们的损失就越大。通过寻找能最小化损失的参数，we maximize the likelihood of our training data. 我们可以使用**梯度下降（gradient descent）**来找到最小值我将实现梯度下降的最普通的版本，也称为批量梯度下降，具有固定的学习率（batch gradient descent with a fixed learning rate） Variations such as **SGD (stochastic gradient descent) or minibatch gradient descent** typically perform better in practice. 【变体例如 SGD（随机梯度下降）或者小批次梯度下降等通常在实践中表现更好】作为输入，梯度下降需要相对于我们的参数的损失函数的梯度（导数的向量）: As an input, gradient descent needs the gradients (vector of derivatives) of the loss function with respect to our parameters:

∂L∂W1,∂L∂b1,∂L∂W2,∂L∂b2 $\frac{\partial{L}}{\partial{W_1}}, \frac{\partial{L}}{\partial{b_1}}, \frac{\partial{L}}{\partial{W_2}}, \frac{\partial{L}}{\partial{b_2}}$ 为了计算这些梯度，我们使用著名的**反向传播算法**，这是一种有效地计算从输出开始的梯度的方法。我不会详细讨论反向传播是如何工作的，但是在网络上有很多很好的解释。

Implementation：

We start by defining some useful variables and parameters for gradient descent:

num_examples = len(X) # training set size
nn_input_dim = 2 # input layer dimensionality
nn_output_dim = 2 # output layer dimensionality

# Gradient descent parameters (I picked these by hand)
epsilon = 0.01 # learning rate for gradient descent
reg_lambda = 0.01 # regularization strength

首先让我们来实现我们上面定义的损失函数。我们用这个来评估我们的模型在做什么：

# Helper function to evaluate the total loss on the dataset
def calculate_loss(model):
    W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
    # Forward propagation to calculate our predictions
    z1 = X.dot(W1) + b1
    a1 = np.tanh(z1)
    z2 = a1.dot(W2) + b2
    exp_scores = np.exp(z2)
    probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
    # Calculating the loss
    corect_logprobs = -np.log(probs[range(num_examples), y])
    data_loss = np.sum(corect_logprobs)
    # Add regulatization term to loss (optional)
    data_loss += reg_lambda/2 * (np.sum(np.square(W1)) + np.sum(np.square(W2)))
    return 1./num_examples * data_loss

我们还实现了一个辅助函数来计算网络的输出。它按前面定义的那样进行前向传播，并以最高概率返回类。

# Helper function to predict an output (0 or 1)
def predict(model, x):
    W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
    # Forward propagation
    z1 = x.dot(W1) + b1
    a1 = np.tanh(z1)
    z2 = a1.dot(W2) + b2
    exp_scores = np.exp(z2)
    probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
    return np.argmax(probs, axis=1)

最后，这里是训练我们的神经网络的功能。它使用我们在上面提到的反向传播实现批量梯度下降。

# This function learns parameters for the neural network and returns the model.
# - nn_hdim: Number of nodes in the hidden layer
# - num_passes: Number of passes through the training data for gradient descent
# - print_loss: If True, print the loss every 1000 iterations
def build_model(nn_hdim, num_passes=20000, print_loss=False):

    # Initialize the parameters to random values. We need to learn these.
    np.random.seed(0)
    W1 = np.random.randn(nn_input_dim, nn_hdim) / np.sqrt(nn_input_dim)
    b1 = np.zeros((1, nn_hdim))
    W2 = np.random.randn(nn_hdim, nn_output_dim) / np.sqrt(nn_hdim)
    b2 = np.zeros((1, nn_output_dim))

    # This is what we return at the end
    model = {}

    # Gradient descent. For each batch...
    for i in xrange(0, num_passes):

        # Forward propagation
        z1 = X.dot(W1) + b1
        a1 = np.tanh(z1)
        z2 = a1.dot(W2) + b2
        exp_scores = np.exp(z2)
        probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)

        # Backpropagation
        delta3 = probs
        delta3[range(num_examples), y] -= 1
        dW2 = (a1.T).dot(delta3)
        db2 = np.sum(delta3, axis=0, keepdims=True)
        delta2 = delta3.dot(W2.T) * (1 - np.power(a1, 2))
        dW1 = np.dot(X.T, delta2)
        db1 = np.sum(delta2, axis=0)

        # Add regularization terms (b1 and b2 don't have regularization terms)
        dW2 += reg_lambda * W2
        dW1 += reg_lambda * W1

        # Gradient descent parameter update
        W1 += -epsilon * dW1
        b1 += -epsilon * db1
        W2 += -epsilon * dW2
        b2 += -epsilon * db2

        # Assign new parameters to the model
        model = { 'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}

        # Optionally print the loss.
        # This is expensive because it uses the whole dataset, so we don't want to do it too often.
        if print_loss and i % 1000 == 0:
          print "Loss after iteration %i: %f" %(i, calculate_loss(model))

    return model

A network with a hidden layer of size 3

让我们看看如果我们训练隐藏层大小为3的网络会发生什么：

# Build a model with a 3-dimensional hidden layer
model = build_model(3, print_loss=True)

# Plot the decision boundary
plot_decision_boundary(lambda x: predict(model, x))
plt.title("Decision Boundary for hidden layer size 3")

这里写图片描述

这看起来很不错。我们的神经网络能够找到一个成功分离类的决策边界。

Varying the hidden layer size 改变隐藏层的大小

在上面的例子中，我们选择了隐藏层大小3.现在让我们了解隐藏层大小如何改变结果。

plt.figure(figsize=(16, 32))
hidden_layer_dimensions = [1, 2, 3, 4, 5, 20, 50]
for i, nn_hdim in enumerate(hidden_layer_dimensions):
    plt.subplot(5, 2, i+1)
    plt.title('Hidden Layer size %d' % nn_hdim)
    model = build_model(nn_hdim)
    plot_decision_boundary(lambda x: predict(model, x))
plt.show()

这里写图片描述

我们可以看到隐藏的低维度层很好地捕捉了我们数据的总体趋势。更高的维度容易过度拟合。他们是“记忆”数据而不是拟合一般的形状。如果我们要在一个单独的测试集上评估我们的模型（而且您应该！），那么具有较小隐藏层大小的模型由于更好的泛化可能会更好。我们可以用更强的正则化来抵消过度拟合，但是为隐藏层选择正确的大小是一个更“economical”的解决方案。

英文原博网址在：
http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch/

您可能感兴趣的与本文相关的镜像