









介绍cross-entropy cost 函数























>>> import mnist_loader >>> training_data, validation_data, test_data = \ ... mnist_loader.load_data_wrapper() >>> import network2 >>> net = network2.Network([784, 30, 10], cost=network2.CrossEntropyCost) >>> net.large_weight_initializer() >>> net.SGD(training_data[:1000], 400, 10, 0.5, ... evaluation_data=test_data, lmbda = 0.1, ... monitor_evaluation_cost=True, monitor_evaluation_accuracy=True, ... monitor_training_cost=True, monitor_training_accuracy=True)
但是这次accuracy在test data上面持续增加:
最高的accuracy也增加了, 说明regularization减少了overfitting
如果用50,000张训练集:
同样的参数: 30 epochs, 学习率 0.5, mini-batch size: 10
需要改变λ, 因为n从1,000变到50,000了

>>> net.SGD(training_data, 30, 10, 0.5, ... evaluation_data=test_data, lmbda = 5.0, ... monitor_evaluation_accuracy=True, monitor_training_accuracy=True)
结果好很多, accuracy对于测试集提高了, 两条曲线之间的距离大大减小
如果用隐藏层100个神经元
>>> net = network2.Network([784, 100, 10], cost=network2.CrossEntropyCost) >>> net.large_weight_initializer() >>> net.SGD(training_data, 30, 10, 0.5, lmbda=5.0, ... evaluation_data=validation_data, ... monitor_evaluation_accuracy=True)
最终结果在测试集上accuracy达到97.92, 比隐藏层30个神经元提高很多
如果调整优化一下参数 用 学习率=0.1, λ=5.0, 只需要30个epoch, 准确率就超过了98%,达到了98.04%
加入regularization不仅减小了overfitting, 还对避免陷入局部最小点 (local minimum), 更容易重现实验结果
为什么Regularization可以减少overfitting?
假设一个简单数据集

实现提高版本的神经网络算法来识别手写数字:
复习之前原始的版本: Network.py
我们从以下方面做了提高:
Cost函数: cross-entropy
Regularization: L1, L2
Softmax layer
初始化 1/sqrt(n_in)


>>> import mnist_loader >>> training_data, validation_data, test_data = \ ... mnist_loader.load_data_wrapper() >>> import network2 >>> net = network2.Network([784, 30, 10]) >>> net.SGD(training_data, 30, 10, 10.0, lmbda = 1000.0, ... evaluation_data=validation_data, monitor_evaluation_accuracy=True)
结果:
Epoch 0 training complete Accuracy on evaluation data: 1030 / 10000 Epoch 1 training complete Accuracy on evaluation data: 990 / 10000 Epoch 2 training complete Accuracy on evaluation data: 1009 / 10000 ... Epoch 27 training complete Accuracy on evaluation data: 1009 / 10000 Epoch 28 training complete Accuracy on evaluation data: 983 / 10000 Epoch 29 training complete Accuracy on evaluation data: 967 / 10000
差到跟随机猜测一样!
神经网络中可变化调整的因素很多:
神经网络结构: 层数, 每层神经元个数多少
初始化w和b的方法
Cost函数
Regularization: L1, L2
Sigmoid输出还是Softmax?
使用Droput?
训练集大小
mini-batch size
>>> net = network2.Network([784, 10]) >>> net.SGD(training_data[:1000], 30, 10, 10.0, lmbda = 1000.0, \ ... evaluation_data=validation_data[:100], \ ... monitor_evaluation_accuracy=True) Epoch 0 training complete Accuracy on evaluation data: 10 / 100 Epoch 1 training complete Accuracy on evaluation data: 10 / 100 Epoch 2 training complete Accuracy on evaluation data: 10 / 100 ...
更快得到反馈, 之前可能每轮要等10秒,现在不到1秒:
λ之前设置为1000, 因为减少了训练集的数量, λ为了保证weight decay一样,对应的减少λ = 20.0
>>> net = network2.Network([784, 10]) >>> net.SGD(training_data[:1000], 30, 10, 10.0, lmbda = 20.0, \ ... evaluation_data=validation_data[:100], \ ... monitor_evaluation_accuracy=True)
结果:
Epoch 0 training complete Accuracy on evaluation data: 12 / 100 Epoch 1 training complete Accuracy on evaluation data: 14 / 100 Epoch 2 training complete Accuracy on evaluation data: 25 / 100 Epoch 3 training complete Accuracy on evaluation data: 18 / 100
也许学习率η=10.0太低? 应该更高?
增大到100:
>>> net = network2.Network([784, 10]) >>> net.SGD(training_data[:1000], 30, 10, 100.0, lmbda = 20.0, \ ... evaluation_data=validation_data[:100], \ ... monitor_evaluation_accuracy=True)
结果:
Epoch 0 training complete Accuracy on evaluation data: 10 / 100 Epoch 1 training complete Accuracy on evaluation data: 10 / 100 Epoch 2 training complete Accuracy on evaluation data: 10 / 100 Epoch 3 training complete Accuracy on evaluation data: 10 / 100
结果非常差, 也许结果学习率应该更低? =10
>>> net = network2.Network([784, 10]) >>> net.SGD(training_data[:1000], 30, 10, 1.0, lmbda = 20.0, \ ... evaluation_data=validation_data[:100], \ ... monitor_evaluation_accuracy=True)
结果好很多: Epoch 0 training complete Accuracy on evaluation data: 62 / 100 Epoch 1 training complete Accuracy on evaluation data: 42 / 100 Epoch 2 training complete Accuracy on evaluation data: 43 / 100 Epoch 3 training complete Accuracy on evaluation data: 61 / 100
假设保持其他参数不变: 30 epochs, mini-batch size: 10, λ=5.0
实验学习率=0.025, 0.25, 2.5




>>> import mnist_loader >>> training_data, validation_data, test_data = \ ... mnist_loader.load_data_wrapper()
>>> import network2 >>> net = network2.Network([784, 30, 10])
>>> net.SGD(training_data, 30, 10, 0.1, lmbda=5.0, ... evaluation_data=validation_data, monitor_evaluation_accuracy=True)
结果: 96.48%
加入一个隐藏层:
>>> net = network2.Network([784, 30, 30, 10]) >>> net.SGD(training_data, 30, 10, 0.1, lmbda=5.0, ... evaluation_data=validation_data, monitor_evaluation_accuracy=True)
结果: 96.9%
再加入一个隐藏层:
>>> net = network2.Network([784, 30, 30, 30, 10]) >>> net.SGD(training_data, 30, 10, 0.1, lmbda=5.0, ... evaluation_data=validation_data, monitor_evaluation_accuracy=True)
结果: 96.57%
为什么加入一层反而降低了准确率?
条形区域长度代表∂C/∂b, Cost对于bias的变化率:



∥δ1∥=0.07, ∥δ2∥=0.31








cnn的结构很不一样,输入层是一个二维的神经元
>>> import network3 >>> from network3 import Network >>> from network3 import ConvPoolLayer, FullyConnectedLayer, SoftmaxLayer >>> training_data, validation_data, test_data = network3.load_data_shared() >>> mini_batch_size = 10 >>> net = Network([ FullyConnectedLayer(n_in=784, n_out=100), SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size) >>> net.SGD(training_data, 60, mini_batch_size, 0.1, validation_data, test_data)
结果: 97.8 accuracy (上节课98.04%
这次: 没有regularization, 上次有
这次: softmax 上次: sigmoid + cross-entropy
加入convolution层:
>>> net = Network([ ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28), filter_shape=(20, 1, 5, 5), poolsize=(2, 2)), FullyConnectedLayer(n_in=20*12*12, n_out=100), SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size) >>> net.SGD(training_data, 60, mini_batch_size, 0.1, validation_data, test_data)
准确率: 98.78 比上次有显著提高
再加入一层convolution (共两层):
>>> net = Network([ ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28), filter_shape=(20, 1, 5, 5), poolsize=(2, 2)), ConvPoolLayer(image_shape=(mini_batch_size, 20, 12, 12), filter_shape=(40, 20, 5, 5), poolsize=(2, 2)), FullyConnectedLayer(n_in=40*4*4, n_out=100), SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size) >>> net.SGD(training_data, 60, mini_batch_size, 0.1, validation_data, test_data)
准确率: 99.06% (再一次刷新)
用Rectified Linear Units代替sigmoid:
f(z) = max(0, z)
>>> net = Network([ ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28), filter_shape=(20, 1, 5, 5), poolsize=(2, 2), activation_fn=ReLU), ConvPoolLayer(image_shape=(mini_batch_size, 20, 12, 12), filter_shape=(40, 20, 5, 5), poolsize=(2, 2), activation_fn=ReLU), FullyConnectedLayer(n_in=40*4*4, n_out=100, activation_fn=ReLU), SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size) >>> net.SGD(training_data, 60, mini_batch_size, 0.03, validation_data, test_data, lmbda=0.1)
准确率: 99.23 比之前用sigmoid函数的99.06%稍有提高
库大训练集: 每个图像向上,下,左,右移动一个像素
总训练集: 50,000 * 5 = 250,000
$ python expand_mnist.py
>>> expanded_training_data, _, _ = network3.load_data_shared( "../data/mnist_expanded.pkl.gz") >>> net = Network([ ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28), filter_shape=(20, 1, 5, 5), poolsize=(2, 2), activation_fn=ReLU), ConvPoolLayer(image_shape=(mini_batch_size, 20, 12, 12), filter_shape=(40, 20, 5, 5), poolsize=(2, 2), activation_fn=ReLU), FullyConnectedLayer(n_in=40*4*4, n_out=100, activation_fn=ReLU), SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size) >>> net.SGD(expanded_training_data, 60, mini_batch_size, 0.03, validation_data, test_data, lmbda=0.1)
结果: 99.37%
加入一个100个神经元的隐藏层在fully-connected层:
>>> net = Network([ ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28), filter_shape=(20, 1, 5, 5), poolsize=(2, 2), activation_fn=ReLU), ConvPoolLayer(image_shape=(mini_batch_size, 20, 12, 12), filter_shape=(40, 20, 5, 5), poolsize=(2, 2), activation_fn=ReLU), FullyConnectedLayer(n_in=40*4*4, n_out=100, activation_fn=ReLU), FullyConnectedLayer(n_in=100, n_out=100, activation_fn=ReLU), SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size) >>> net.SGD(expanded_training_data, 60, mini_batch_size, 0.03, validation_data, test_data, lmbda=0.1)
结果: 99.43%, 并没有大的提高
有可能overfit
加上dropout到最后一个fully-connected层:
>>> expanded_training_data, _, _ = network3.load_data_shared( "../data/mnist_expanded.pkl.gz")
>>> net = Network([
ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28), filter_shape=(20, 1, 5, 5), poolsize=(2, 2), activation_fn=ReLU), ConvPoolLayer(image_shape=(mini_batch_size, 20, 12, 12), filter_shape=(40, 20, 5, 5), poolsize=(2, 2), activation_fn=ReLU), FullyConnectedLayer( n_in=40*4*4, n_out=1000, activation_fn=ReLU, p_dropout=0.5), FullyConnectedLayer( n_in=1000, n_out=1000, activation_fn=ReLU, p_dropout=0.5), SoftmaxLayer(n_in=1000, n_out=10, p_dropout=0.5)], mini_batch_size) >>> net.SGD(expanded_training_data, 40, mini_batch_size, 0.03, validation_data, test_data)
结果: 99.60% 显著提高
epochs: 减少到了40
隐藏层有 1000 个神经元
Ensemble of network: 训练多个神经网络, 投票决定结果, 有时会提高
为何只对最后一层用dropout?
CNN本身的convolution层对于overfitting有防止作用: 共享的权重造成convolution filter强迫对于整个图像进行学习
为什么可以克服深度学习里面的一些困难?
用CNN大大减少了参数数量
用dropout减少了overfitting
用Rectified Linear Units代替了sigmoid, 避免了overfitting, 不同层学习率差别大的问题
用GPU计算更快, 每次更新较少, 但是可以训练很多次
目前的深度神经网络有多深? (多少层)?
最多有20多层

activation f((weight w * input x) + bias b ) = output a
多个输入:
多个隐藏层:
Reconstructions:
隐藏层变成输入层, 反向更新, 用老的权重和新的bias:
回到原始输入层:
算出的值跟原始输入层的值比较, 最小化error, 接着迭代更新:
正向更新: 用输入预测神经元的activation, 也就是输出的概率, 在给定的权重下: p(a|x; w)
反向更新的时候:
activation被输入到网络里面,来预测原始的数据X, RBM尝试估计X的概率, 对于给定的activation a: p(x|a; w)

784 (input) ----> 1000 ----> 500 ----> 250 ----> 100 -----> 30
1000 > 784, sigmoid-brief unit代表的信息量比实数少
Decoding:
784 (output) <---- 1000 <---- 500 <---- 250 <---- 30
用来降低维度, 图像搜索(压缩), 数据压缩, 信息检索
scikit-learn nerualnetwork:
iris 数据库:
https://en.wikipedia.org/wiki/Iris_flower_data_sethttps://github.com/aigamedev/scikit-neuralnetwork
举例:
import logging
logging.basicConfig()
from sknn.mlp import Classifier, Layer
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
iris = datasets.load_iris()
# iris.data.shape, iris.target.shape
X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)
nn = Classifier(
layers=[
Layer("Rectifier", units=100),
Layer("Linear")],
learning_rate=0.02,
n_iter=10)
nn.fit(X_train, y_train)
y_pred = nn.predict(X_test)
score = nn.score(X_test, y_test)
# print("y_test", y_test)
# print("y_pred", y_pred)
print("score", score)
正向更新: 给定这些像素, 权重应该送出一个更强的信号给大象还是狗?
反向更新: 给定大象和狗, 我应该期待什么样的像素分布?
discriminative learning: 把输入映射到输出, 区分几类点