Data Mining2复习笔记4 - Neural Networks and Deep Learning_a 200-dimensional vector sequence relying on conti-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_45012798/article/details/139545757

4. Neural Networks and Deep Learning

4.1 Recap of neural networks

Example

Artificial Neural Networks (ANN)

Smallest model: if at least two of Employed, Owns House, and Balance Account are yes → Get Credit is yes
Given that we represent yes and no by 1 and 0, we want if(Employed + Owns House + Balance Acount)>1.5 → Get Credit is yes

Artificial Neural Networks (ANN)

Model is an assembly of inter-connected nodes and weighted links
Output node sums up each of its input value according to the weights of its links
Compare output node against some threshold t
Perceptron Model

General Structure of ANN

Algorithm for Learning ANN

This is simple for a single layer perceptron. But for a multi-layer network, Yi is not known.

4.2 The backpropagation algorithm

Sketch of the Backpropagation Algorithm

Present an example to the ANN
Compute error at the output layer
Distribute error to hidden layer according to weights
i.e., the error is distributed according to the contribution of the previous neurons to the result
Adjust weights so that the error is minimized
Adjustment factor: learning rate
Use gradient descent
Repeat until input layer is reached

4.2.1 Training in Batches

In theory, one could present the examples one after the other
In practice: use batches

Neural network gets to see a number of examples (batch)
All examples in the batch are predicted by the network
Errors are accumulated
Weights in the neural network are adapted (backpropagated) after predicting all examples in the batch

Pros: fewer model updates, faster convergence
Cons: may not find global optimum, harder to handle imbalanced datasets

4.2.2 Backpropagation Algorithm

Important notions:
Predictions are pushed forward through the network (“feed-forward neural network”)
Errors are pushed backwards through the network (“backpropagation”)

feed-forward neural network

backpropagation

Backpropagation Algorithm – Gradient Descent

Example

Alternative Activation Functions

Properties of ANNs and Backpropagation
Non-linear activation function: May approximate any arbitrary function, even with one hidden layer

Convergence
Convergence may take time & higher learning rate, faster convergence

Gradient Descent Strategy
Danger of ending in local optima: Use momentum to prevent getting stuck
Lower learning rate, higher probability of finding global optimum

4.2.3 Learning Rate, Momentum, and Local Minima

Learning rate: how much do we adapt the weights with each step
0: no adaptation, use previous weight
1: forget everything we have learned so far, simply use weights that are best for current example

Smaller: slow convergence, less overfitting
Higher: faster convergence, more overfitting

Momentum: how much do we change the adaptation of weights
Small: allow changes in every direction soon
High: keep changing in the same direction for longer

Smaller: better convergence, sticks in local minimum
Higher: worse convergence, does not get stuck

Dynamic Learning Rates

Local Learning Rates
Observation: not all parameters change equally often, e.g., text classification: input neuron weights for infrequent words

AdaGrad (Duchi et al., 2011)
maintain list of gradient changes for each parameter
adapt learning rates locally

AdaDelta (Zeiler, 2012)
restrict total updates per parameter

Bottom line: optimization functions often have a large impact.
Optimization functions are used to adjust the parameters of a neural network to minimize (or maximize) a defined loss function. The objective of optimization functions is to find parameter values that minimize (or maximize) the loss function.

ANNs vs. SVMs
ANNs have arbitrary decision boundaries – and keep the data as it is
SVMs have linear decision boundaries – and transform the data first

Recap: Feature Subset Selection & PCA
Idea: reduce the dimensionality of high dimensional data
Feature Subset Selection: focus on relevant attributes
PCA: create new attributes

In both cases
We assume that the data can be described with fewer variables
Without losing much information

What Happens at the Hidden Layer?

4.3 AutoEncoders

4.3.1 Auto encoders

Use the same example as input and output, i.e., they train a model for predicting an example from itself (using fewer variables)

Similar to PCA, but PCA provides only a linear transformation. ANNs can also create non-linear parameter transformations

Auto Encoders

4.3.2 Denoising Auto Encoders

Instead of training with the same input and output, Denoising Auto Encoders add random noise to input and keep output clean
Result: A model that learns to remove noise from an instance

4.3.3 Stacked (Denoising) Auto Encoders

Stacked Auto Encoders contain several hidden layers
Hidden layers capture more complex hidden variables and/or denoising patterns. They are often trained consecutively:
First: train an auto encoder with one hidden layer
Second: train a second one-layer neural net (first hidden layer as input & original as output)

Stacked (Denoising) Auto Encoders

4.3.4 Auto Encoders for Outlier Detection

Also known as Replicator Neural Networks (Hawkins et al., 2002)
Train an autoencoder that captures the patterns in the data
Encode and decode each data point, measure deviation
Deviation is a measure for outlier score

4.3.5 From Classifiers to Feature Detectors

What does a particular neuron in the hidden layer do?
Adjust weight -> strong signal for a horizontal line in the top row, ignoring everywhere else; strong signal for a dark area in the top left corner,… -> detect vertical lines / horizontal lines / circles …

Challenges: Positional variance, Color variance, …

On the Quest for Higher Level Features

4.4 Deep Learning

4.4.1 Memorizing vs. Learning

The goal of training a neural network is learning a generalized model should classify unseen examples

The opposite of generalization is memorization
Model learns training examples “by heart”
Lesser performance on unseen examples

Indicator: performance increase on training set, decrease on test set

Hidden layers define the capacity of a neural network
Roughly: how complex can the patterns be that are stored
more complex task -> more capacity need
Too low capacity: underfitting
Patterns that, e.g., separate classes have a certain complexity. We need enough hidden neuron connections to “store” those patterns

Too high capacity: overfitting
Examples can be identified by certain combinations of features. With enough hidden neuron connections, we can learn those combinations instead of generalized patterns

4.4.2 Regularization with Dropout

ANNs, and in particular Deep ANNs, tend to overfitting

Example

Regularization
Randomly deactivate hidden neurons when training an example
E.g., factor α=0.4: deactivate neurons randomly with probability 0.4

Example

Result: Learned model is more robust, less overfit
For classification: use all hidden neurons
Problem: activation levels will be higher! –> Correction: multiply each output with 1/(1+α)

Example

1/(1+α)=1/(1+0.4)=5/7

Example

4.5 Network Architectures

4.5.1 Architectures: Convolutional Neural Networks

Special architecture for image processing
Problem: imagine a 4k resolution picture (3840x2160)
Treating each pixel as an input: 8M input neurons and Connecting that to a hidden layer of the same size: 8M2 = 64 trillion weights to learn
This is hardly practical…

Solution: Convolutional neural networks

Two parts: Convolution layer + Pooling layer
Stacks of those are usually used

1. Convolution layer
Each neuron is connected to a small n x n square of the input neurons, i.e., number of connections is linear, not quadratic

Use different neurons for detecting different features
They can share their weights (intuition: a horizontal line looks the same everywhere)
Convolution layer

Pooling layer (aka subsampling layer)
Use only the maximum value of a neighborhood of neurons
Think:downsizing a picture
Number of neurons is divided by four with each pooling layer

Example

Big picture of CNN

The 4K picture revisited (3840x2160):
Treating each pixel as an input: 8M input neurons. Connecting that to a hidden layer of the same size: 8M^2 = 64 trillion weights to learn
Number of connections (weights to be learned) in the first convolutional layer: Assume each hidden neuron is connected to a 16x16 square and we learn 256 hidden features (i.e., 256 layers of convolutional neurons) 16x16x256x8M = still 526 billion weights
But: neurons for the same hidden feature share their weight. Thus, it’s just 16x16x256 = 65k weights

In practice, several layers are used.
Google’sGoogLeNet(Inception) - Current state of the art in image classification
Can be used as a pre-trained network

Image recognition networks can be attacked

4.5.2 Possible Implications

Face Detection, Autonomous Driving, …

Using ANNs for Time Series Prediction

Example

Long Short Term Memory Networks (LSTM)

LSTM

CNNs for Time Series Prediction

Word2vec
word2vec is similar to an auto encoder for words
Training set: a text corpus
Training task variants:

Continuous bag of words (CBOW): predict a word from the surrounding words
Skip-Gram: predicts surrounding words of a word

word2vec

word2vec creates an n-dimensional vector for each word
Each word becomes a point in a vector space
Properties:
Similar words are positioned to each other
Relations have the same direction

Arithmetics are possible in the vector space
king–man+woman≈queen
This allows for finding analogies:
king:man ↔ queen:woman
knee:leg ↔ elbow:forearm
Hillary Clinton:democrat ↔ Donald Trump:Republican

BERT
Learns a representation of words in context
Unlike word2vec: one fixed representation per word
Larger training corpus required
BERT

4.6 Transfer Learning with Neural Networks

Reusing Pre-trained Networks
The output of a network can be used as an input to yet another classifier (neural network or other)
Think: a multi-label image classifier as an auto-encoder
Example: predict movie genre from poster
Using an image classifier trained for object recognition

In many cases, the last or second-to last layer are reused
Fine-tuning on a task at hand often leads to advantages, i.e., use the trained network, add a new classification layer, and present examples -> Referred to as “transfer learning”

Pre-trained neural networks can be (re)used for other tasks
They can also be retrained, using the pre-training as initialization Sometimes, different layers are frozen.
Rationale: There are already some valuable information learned

transfer learning

4.7 Summary

summary