Lecture I: Introduction of Deep Learning
Three Steps for Deep Learning
- define a set of function(Neural Network)
- goodness of function
- pick the best function
Soft max layer as the output layer.
FAQhow many layers? How many neutons for each layer?
Trial and Error + Intuition(试错+直觉)
Gradient Descent
- pick an initial value for
w
- random(good enough)
- RBM pre-train
- compute
∂L∂w
w←w−η∂L∂w , where η is called “learning rate” - repeat Until ∂L∂w is approximately small
But gradient descent never guarantee global minima
Modularization(模块化)
Deep → Modularization
Each basic classifier can have sufficient training examples
Sharing by the following classifiers as module
The modulrization is automatically learned from data
Lecture II: Tips for Training DNN
Do not always blame overfitting
- Reason for overfitting: Training data and testing data can be different
- Panacea for Overfitting: Have more training data or Create more training data
Different approaches for different problems
Chossing proper loss
- square error(mse)
- ∑(yi−yi^)2
- cross entropy(categorical_crosssentropy)
- −∑(yi^lnyi)
- When using softmax output layer, choose cross entropy
- square error(mse)
Mini-batch
- Mini-batch is Faster
- Randomly initialize network parameters
- Pick the 1 st batch, update parameters once
- Pick the 2 nd batch, update parameters once
- …
- Until all mini-batches have been picked(one epoch finished)
- Repeat the above process(2-5)
- Mini-batch is Faster
New activation function
- Vanishing Gradient Problem
- RBM pre-training
- Rectified Linear Unit (ReLU)
- Fast to compute
- Biological reason
- Infinite sigmoid z with different biases
- A Thinner linear network
- A special cases of Maxout
- Vanishing gradient problem
- ReLU - variant
Adaptive Learning Rate
Popular & Simple Idea: Reduce the learning rate by some factor every few epochs
- ηt=ηt+1√
Adagrad
- Original: w←w−η∂L∂w
- Adagrad: w←w−ηw∂L∂w,ηw=η∑ti=0(gi)2√
Momentum
- Movement = Negative of ∂L/∂w + Momentum
- Adam = RMSProp (Advanced Adagrad) + Momentum
Early Stopping
Weight Decay
- Original: w←w−η∂L∂w
- Weight Decay: w←0.99w−η∂L∂w
Dropout
- Training:
- Each neuron has p% to dropout
- The structure of the network is changed.
- Using the new network for training
- Testing:
- If the dropout rate at training is p%, all the weights times (1-p)%
- Dropout is a kind of ensemble
Lecture III: Variants of Neural Networks
Convolutional Neural Network (CNN)
- The convolution is not fully connected
- The convolution is sharing weights
- Learning: gradient descent
Recurrent Neural Network (RNN)
Long Short-term Memory (LSTM)
- Gated Recurrent Unit (GRU): simpler than LSTM
Lecture IV: Next Wave
Supervised Learning
Ultra Deep Network
Worry about training first!
This ultra deep network have special structure
Ultra deep network is the ensemble of many networks with different depth
Ensemble: 6 layers, 4 layers or 2 layers
FractalNet
Residual Network
Highway Network
Attention Model
Attention-based Model
Attention-based Model v2
Reinforcement Learning
- Agent learns to take actions to maximize expected reward.
- Difficulties of Reinforcement Learning
- It may be better to sacrifice immediate reward to gain more long-term reward
- Agent’s actions affect the subsequent data it receives
Unsupervised Learning
- Image: Realizing what the World Looks Like
- Text: Understanding the Meaning of Words
- Audio: Learning human language without supervision