Lecture 8: Deep Learning Software
Lecture 9: CNN Architectures
AlexNet

VGGNet

GoogLeNet
- 22 total layers with weights (including each parallel layer in an Inception module)
-
“Inception module”: design a good local network topology(network within a network) and then stack these modules on top of each other
Naive Inception module:
原始的inception module 计算复杂度太高
Apply parallel filter operations on the input from previous layer:
- - Multiple receptive field sizes for convolution (1x1, 3x3,5x5)
- - Pooling operation (3x3)
Concatenate all filter outputs together depth-wise
What is the problem with this?
Computational complexity
Solution: “bottleneck” layers that use 1x1 convolutions to reduce feature depth
-
Preserves spatial dimensions, reduces depth! 减小深度
-
Projects depth to lower dimension (combination of
32
feature maps)
ResNet
当我们继续在plain CNN上stacking deeper layers,会发生什么?
56-layer model performs worse on both training and test error
-> The deeper model performs worse, but it’s not caused by overfitting!
更深的网络在训练集和测试集上都表现更差,说明不是由overfitting造成的。
问题出在
optimization上,更深的网络更难优化
- The deeper model should be able to perform at least as well as the shallower model.
- A solution by construction is copying the learned layers from the shallower model and setting additional layers to identity mapping.
Training ResNet in practice:
- Batch Normalization after every CONV layer
- Xavier/2 initialization from He et al.
- SGD + Momentum (0.9)
- Learning rate: 0.1, divided by 10 when validation error plateaus
- Mini-batch size 256
- Weight decay of 1e-5
- No dropout used
Lecture 10
RNN
LSTM
Lecture 11: Detection and Segmentation
R-CNN
目标检测
Lecture 12: Visualizing and Understanding
Lecture 13: 生成模型
生成模型是一种无监督(不需要external label)
Given training data, generate new samples from same distribution
Want to learn p
model(x) similar to p
data(x)
- pixelRNN/CNN 显示密度估计,Explicit density estimation: explicitly define and solve for pmodel(x)
- Variational Autoencoder
- GAN 隐式密度估计,Implicit density estimation: learn model that can sample from pmodel(x) w/o explicitly defining it

PixelRNN and PixelCNN
Variational Autoencoders(VAE)
After training, throw away decoder
Auto encoders can reconstruct
data, and can learn features to
initialize a supervised model
Features capture factors of
variation in training data. Can we
generate new images from an
autoencoder?

We want to estimate the true parameters of this generative model.
How should we represent this model?
Choose prior p(z) to be simple, e.g.Gaussian.
Reasonable for latent attributes, e.g. pose, how much smile.
Conditional p(x|z) is complex (generates image) => represent with neural network







Let's look at computing the bound (forward pass) for a given minibatch of input data


Diagonal prior on z
=> independent
latent variables
Different
dimensions of z
encode
interpretable factors
of variation

Generative Adversarial Networks(GANs)
GANs: don’t work with any explicit density function!
Instead, take game-theoretic approach: learn to generate from training distribution
through 2-player game




Aside: Jointly training two networks is challenging, can be unstable. Choosing objectives with better loss landscapes helps training, is an active area of research.


GAN从结构上来讲巧妙而简单(尽管有与其他经典工作Idea相似的争议[6~7]),也非常易于理解,
整个模型只有两个部件:1.生成器G;2.判别器D。
生成模型其实由来已久,所以生成器也并不新鲜,
生成器G的目标是生成出最接近于真实样本的假样本分布,
在以前没有判别器D的时候,生成器的训练依靠每轮迭代返回当前生成样本与真实样本的差异(把这个差异转化成loss)来进行参数优化,
而判别器D的出现改变了这一点,判别器D的目标是尽可能准确地辨别生成样本和真实样本,
而这时生成器G的训练目标就由最小化“生成-真实样本差异”变为了尽量弱化判别器D的辨别能力(这时候训练的目标函数中包含了判别器D的输出)。
GAN模型的大体框架如下图所示:

总结:
