这一章提出了一些训练DNN的技术并用python实现(主要就是CNN);拓展了一下最近在image recognition,speech recognition还有其他应用中使用的DL技术;最后总结展望了一下神经网络和AI的未来。
Introducing convolutional networks
之前的手写识别,用了如下的层间全连接的神经网络:
具体来说,输入层有
28∗28
个神经元,对应着
784(=28∗28)
个像素;训练这个神经网络的biases和weights,使得这个网络正确地判别数字。
之前的全连接神经网络已经可以做到98%的正确率,然而用全连接的神经网络很片面,因为没有考虑到图片的空间结构(spatial structure of the images)。比如把两个相邻的像素点和距离相差较远的像素点等同考虑了。
CNN(convolutional neural networks)就是利用了图片空间结构优点的一种结构。这种结构加快了网路的训练速度,目前常用在图片识别上。
CNN包括三个部分:local receptive fields,shared weights,pooling。
Local receptive fields
在CNN中,图像输入是一个二维数组。如下
28∗28
的数组:
和之前一样将每个点作为一个神经元组成输入层。与之前不同的是每个神经元不与第一个隐藏层的每个神经元相连,而是第一个隐藏层的每个神经元连接一个小范围的输入神经元。如下例子是连接一个
5∗5
的小范围:
这个范围称为local receptive field。每条连接的边都有一个weight,每个隐藏神经元也有一个bias。
每个field都采取这样的方式,可以得到第一个隐藏层的第一个神经元:
第二个神经元:
以此类推,构建出完整的第一个隐藏层。
如上例子,如果我们用的是一个
28∗28
的输入图像,
5∗5
的local receptive fields,我们就可以得到一个
24∗24(=28−5+1))
的隐藏层了。
上面的例子一次只移动了一个像素点来构建local receptive fields,有时可以移动两个或更多个来构建。
Shared weights and biases
上面例子中,每个隐藏层的神经元有
1
个bias和
其中
σ
为激活函数,
b
为shared bias,
这说明第一个隐藏层的所有神经元检测的是同样的特征。由于这个原因,我们把输入层到隐藏层的映射(map)称为feature map。把其weight称为shared weight,同理shared bias。把shared weight和shared bias同时称做kernel或filter。
为了做到图像识别,我们需要用到多个feature map,所以一个完整的convolutional层用到了多个feature map:
上面例子中有3个feature maps,每个定义了一个
5∗5
的shared weights,还包含一个shared bias。这个网络可以检测3个不同的特征(features)。
share weights和biases的最大优点在于相比fully-connected模型大量缩减了参数的数量,这将使得训练速度变快。
最后一点,之所以把这样的结构称为卷积,是因为式(125)长的很像卷积的形式。
对于卷积的理解,记得数字信号处理老师给画的波浪叠加;还有信号系统老师给比喻的连锁敲钟,面积平移法求卷积。对比过来,这个式子的确有平移一个矩形,覆盖求和的味道。
Pooling layers
Pooling layers常常跟在卷积层之后,其作用是简化了卷积层的输出信息。
具体来说,pooling layer把每个卷积层的feature map的输出浓缩成一个更小的feature map。
比如下面的例子,把每
2∗2
个神经元取最大的输出作为结果浓缩,这样的方法称为max-polling:
原来我们的神经元数量为
24∗24
,目前只剩下
12∗12
个了。
将上述的内容组合起来,便得到了如下结构:
对于max-pooling可以这样理解:网络询问一个给定的feature是否在图片中的任意一个地方能找到,pooling给出确切的能找到的地点。
常用的pooling方法有:max-pooling,L2 pooling(取 2∗2 矩阵的和的平方根)。
Put it all together
最后加入10个输出神经元,代表“1 2…10”,完整的神经网络如下:
注意输出层和池化层是fully-connected,上图简化了。
这个结构的目的与之前的结构相同:使用数据训练除weights和biases来使得网络能很好的分类输入数据。训练方法还是用backpropagation和SGD,但细节有改变。
以下这个问题暂时先留着,之后再推。
Problem:Backpropagation in a convolutional network
The core equations of backpropagation in a network with fully-connected layers are (BP1)-(BP4) (link). Suppose we have a network containing a convolutional layer, a max-pooling layer, and a fully-connected output layer, as in the network discussed above. How are the equations of backpropagation modified?
Convolutional neural networks in practice
这节主要介绍了CNN怎么应用在手写识别上那个例子上。具体代码在下一节,这一节把其当作库函数来做实验和演示,并直接选用Theano库来简化CNN中的backpropagation。
实验中用到了之前提到过的各种方法,比如更换激活函数,softmax,dropout等等。
这里解释了上一章提出的问题:存在vanish和explode问题,为什么我们能训练?
答案是并没有避开这些问题,而是使用了一些方法来优化:
- 使用卷积层能大大降低这些层的参数数量,使得学习变得简单。
- 使用强大的regularization技术(dropout,convolutional layers)来减小过拟合。
- 使用ReLu代替sigmoid,加快学习速度。
- 使用GPUs。
还有其他的一些细节:充足的数据来避免过拟合;正确的cost function来避免学习过慢;好的weight初始化方式来避免学习过慢和神经元饱和;以及人为扩展训练数据等等。
The code for our convolutional networks
完整代码如下,细节还没打磨。
"""network3.py
~~~~~~~~~~~~~~
A Theano-based program for training and running simple neural
networks.
Supports several layer types (fully connected, convolutional, max
pooling, softmax), and activation functions (sigmoid, tanh, and
rectified linear units, with more easily added).
When run on a CPU, this program is much faster than network.py and
network2.py. However, unlike network.py and network2.py it can also
be run on a GPU, which makes it faster still.
Because the code is based on Theano, the code is different in many
ways from network.py and network2.py. However, where possible I have
tried to maintain consistency with the earlier programs. In
particular, the API is similar to network2.py. Note that I have
focused on making the code simple, easily readable, and easily
modifiable. It is not optimized, and omits many desirable features.
This program incorporates ideas from the Theano documentation on
convolutional neural nets (notably,
http://deeplearning.net/tutorial/lenet.html ), from Misha Denil's
implementation of dropout (https://github.com/mdenil/dropout ), and
from Chris Olah (http://colah.github.io ).
"""
#### Libraries
# Standard library
import cPickle
import gzip
# Third-party libraries
import numpy as np
import theano
import theano.tensor as T
from theano.tensor.nnet import conv
from theano.tensor.nnet import softmax
from theano.tensor import shared_randomstreams
from theano.tensor.signal import downsample
# Activation functions for neurons
def linear(z): return z
def ReLU(z): return T.maximum(0.0, z)
from theano.tensor.nnet import sigmoid
from theano.tensor import tanh
#### Constants
GPU = True
if GPU:
print "Trying to run under a GPU. If this is not desired, then modify "+\
"network3.py\nto set the GPU flag to False."
try: theano.config.device = 'gpu'
except: pass # it's already set
theano.config.floatX = 'float32'
else:
print "Running with a CPU. If this is not desired, then the modify "+\
"network3.py to set\nthe GPU flag to True."
#### Load the MNIST data
def load_data_shared(filename="../data/mnist.pkl.gz"):
f = gzip.open(filename, 'rb')
training_data, validation_data, test_data = cPickle.load(f)
f.close()
def shared(data):
"""Place the data into shared variables. This allows Theano to copy
the data to the GPU, if one is available.
"""
shared_x = theano.shared(
np.asarray(data[0], dtype=theano.config.floatX), borrow=True)
shared_y = theano.shared(
np.asarray(data[1], dtype=theano.config.floatX), borrow=True)
return shared_x, T.cast(shared_y, "int32")
return [shared(training_data), shared(validation_data), shared(test_data)]
#### Main class used to construct and train networks
class Network(object):
def __init__(self, layers, mini_batch_size):
"""Takes a list of `layers`, describing the network architecture, and
a value for the `mini_batch_size` to be used during training
by stochastic gradient descent.
"""
self.layers = layers
self.mini_batch_size = mini_batch_size
self.params = [param for layer in self.layers for param in layer.params]
self.x = T.matrix("x")
self.y = T.ivector("y")
init_layer = self.layers[0]
init_layer.set_inpt(self.x, self.x, self.mini_batch_size)
for j in xrange(1, len(self.layers)):
prev_layer, layer = self.layers[j-1], self.layers[j]
layer.set_inpt(
prev_layer.output, prev_layer.output_dropout, self.mini_batch_size)
self.output = self.layers[-1].output
self.output_dropout = self.layers[-1].output_dropout
def SGD(self, training_data, epochs, mini_batch_size, eta,
validation_data, test_data, lmbda=0.0):
"""Train the network using mini-batch stochastic gradient descent."""
training_x, training_y = training_data
validation_x, validation_y = validation_data
test_x, test_y = test_data
# compute number of minibatches for training, validation and testing
num_training_batches = size(training_data)/mini_batch_size
num_validation_batches = size(validation_data)/mini_batch_size
num_test_batches = size(test_data)/mini_batch_size
# define the (regularized) cost function, symbolic gradients, and updates
l2_norm_squared = sum([(layer.w**2).sum() for layer in self.layers])
cost = self.layers[-1].cost(self)+\
0.5*lmbda*l2_norm_squared/num_training_batches
grads = T.grad(cost, self.params)
updates = [(param, param-eta*grad)
for param, grad in zip(self.params, grads)]
# define functions to train a mini-batch, and to compute the
# accuracy in validation and test mini-batches.
i = T.lscalar() # mini-batch index
train_mb = theano.function(
[i], cost, updates=updates,
givens={
self.x:
training_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
self.y:
training_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
})
validate_mb_accuracy = theano.function(
[i], self.layers[-1].accuracy(self.y),
givens={
self.x:
validation_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
self.y:
validation_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
})
test_mb_accuracy = theano.function(
[i], self.layers[-1].accuracy(self.y),
givens={
self.x:
test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
self.y:
test_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
})
self.test_mb_predictions = theano.function(
[i], self.layers[-1].y_out,
givens={
self.x:
test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
})
# Do the actual training
best_validation_accuracy = 0.0
for epoch in xrange(epochs):
for minibatch_index in xrange(num_training_batches):
iteration = num_training_batches*epoch+minibatch_index
if iteration % 1000 == 0:
print("Training mini-batch number {0}".format(iteration))
cost_ij = train_mb(minibatch_index)
if (iteration+1) % num_training_batches == 0:
validation_accuracy = np.mean(
[validate_mb_accuracy(j) for j in xrange(num_validation_batches)])
print("Epoch {0}: validation accuracy {1:.2%}".format(
epoch, validation_accuracy))
if validation_accuracy >= best_validation_accuracy:
print("This is the best validation accuracy to date.")
best_validation_accuracy = validation_accuracy
best_iteration = iteration
if test_data:
test_accuracy = np.mean(
[test_mb_accuracy(j) for j in xrange(num_test_batches)])
print('The corresponding test accuracy is {0:.2%}'.format(
test_accuracy))
print("Finished training network.")
print("Best validation accuracy of {0:.2%} obtained at iteration {1}".format(
best_validation_accuracy, best_iteration))
print("Corresponding test accuracy of {0:.2%}".format(test_accuracy))
#### Define layer types
class ConvPoolLayer(object):
"""Used to create a combination of a convolutional and a max-pooling
layer. A more sophisticated implementation would separate the
two, but for our purposes we'll always use them together, and it
simplifies the code, so it makes sense to combine them.
"""
def __init__(self, filter_shape, image_shape, poolsize=(2, 2),
activation_fn=sigmoid):
"""`filter_shape` is a tuple of length 4, whose entries are the number
of filters, the number of input feature maps, the filter height, and the
filter width.
`image_shape` is a tuple of length 4, whose entries are the
mini-batch size, the number of input feature maps, the image
height, and the image width.
`poolsize` is a tuple of length 2, whose entries are the y and
x pooling sizes.
"""
self.filter_shape = filter_shape
self.image_shape = image_shape
self.poolsize = poolsize
self.activation_fn=activation_fn
# initialize weights and biases
n_out = (filter_shape[0]*np.prod(filter_shape[2:])/np.prod(poolsize))
self.w = theano.shared(
np.asarray(
np.random.normal(loc=0, scale=np.sqrt(1.0/n_out), size=filter_shape),
dtype=theano.config.floatX),
borrow=True)
self.b = theano.shared(
np.asarray(
np.random.normal(loc=0, scale=1.0, size=(filter_shape[0],)),
dtype=theano.config.floatX),
borrow=True)
self.params = [self.w, self.b]
def set_inpt(self, inpt, inpt_dropout, mini_batch_size):
self.inpt = inpt.reshape(self.image_shape)
conv_out = conv.conv2d(
input=self.inpt, filters=self.w, filter_shape=self.filter_shape,
image_shape=self.image_shape)
pooled_out = downsample.max_pool_2d(
input=conv_out, ds=self.poolsize, ignore_border=True)
self.output = self.activation_fn(
pooled_out + self.b.dimshuffle('x', 0, 'x', 'x'))
self.output_dropout = self.output # no dropout in the convolutional layers
class FullyConnectedLayer(object):
def __init__(self, n_in, n_out, activation_fn=sigmoid, p_dropout=0.0):
self.n_in = n_in
self.n_out = n_out
self.activation_fn = activation_fn
self.p_dropout = p_dropout
# Initialize weights and biases
self.w = theano.shared(
np.asarray(
np.random.normal(
loc=0.0, scale=np.sqrt(1.0/n_out), size=(n_in, n_out)),
dtype=theano.config.floatX),
name='w', borrow=True)
self.b = theano.shared(
np.asarray(np.random.normal(loc=0.0, scale=1.0, size=(n_out,)),
dtype=theano.config.floatX),
name='b', borrow=True)
self.params = [self.w, self.b]
def set_inpt(self, inpt, inpt_dropout, mini_batch_size):
self.inpt = inpt.reshape((mini_batch_size, self.n_in))
self.output = self.activation_fn(
(1-self.p_dropout)*T.dot(self.inpt, self.w) + self.b)
self.y_out = T.argmax(self.output, axis=1)
self.inpt_dropout = dropout_layer(
inpt_dropout.reshape((mini_batch_size, self.n_in)), self.p_dropout)
self.output_dropout = self.activation_fn(
T.dot(self.inpt_dropout, self.w) + self.b)
def accuracy(self, y):
"Return the accuracy for the mini-batch."
return T.mean(T.eq(y, self.y_out))
class SoftmaxLayer(object):
def __init__(self, n_in, n_out, p_dropout=0.0):
self.n_in = n_in
self.n_out = n_out
self.p_dropout = p_dropout
# Initialize weights and biases
self.w = theano.shared(
np.zeros((n_in, n_out), dtype=theano.config.floatX),
name='w', borrow=True)
self.b = theano.shared(
np.zeros((n_out,), dtype=theano.config.floatX),
name='b', borrow=True)
self.params = [self.w, self.b]
def set_inpt(self, inpt, inpt_dropout, mini_batch_size):
self.inpt = inpt.reshape((mini_batch_size, self.n_in))
self.output = softmax((1-self.p_dropout)*T.dot(self.inpt, self.w) + self.b)
self.y_out = T.argmax(self.output, axis=1)
self.inpt_dropout = dropout_layer(
inpt_dropout.reshape((mini_batch_size, self.n_in)), self.p_dropout)
self.output_dropout = softmax(T.dot(self.inpt_dropout, self.w) + self.b)
def cost(self, net):
"Return the log-likelihood cost."
return -T.mean(T.log(self.output_dropout)[T.arange(net.y.shape[0]), net.y])
def accuracy(self, y):
"Return the accuracy for the mini-batch."
return T.mean(T.eq(y, self.y_out))
#### Miscellanea
def size(data):
"Return the size of the dataset `data`."
return data[0].get_value(borrow=True).shape[0]
def dropout_layer(layer, p_dropout):
srng = shared_randomstreams.RandomStreams(
np.random.RandomState(0).randint(999999))
mask = srng.binomial(n=1, p=1-p_dropout, size=layer.shape)
return layer*T.cast(mask, theano.config.floatX)
Recent progress in image recognition
这个部分介绍了一些前两年关于图像识别的研究。
Other approaches to deep neural nets
这个部分简要介绍了RNNs,LSTMs,Deep belief nets,generative models,and Boltzmann machines
On the future of neural networks
这部分展望了一下NN的未来。
至此,一个多月的时间终于把这本书读完了,感觉效率略低,大概是最近太舒适了吧。
读的时候比较重理论部分,代码部分大多是简略的理解了一下。接下来应该会整理一下这本书里面出现的论文,然后找一找下一步学习的资料。