Logistic regression
- This the perceptron with a sigmoid activation
- It actually computes the probability that the input belongs to class 1
- Decision boundaries may be obtained by comparing the probability to a threshold
- These boundaries will be lines (hyperplanes in higher dimensions)
- The sigmoid perceptron is a linear classifier
Estimating the model
- Given: Training data: (X1,y1),(X2,y2),…,(XN,yN)\left(X_{1}, y_{1}\right),\left(X_{2}, y_{2}\right), \ldots,\left(X_{N}, y_{N}\right)(X1,y1),(X2,y2),…,(XN,yN)
- XXX are vectors, yyy are binary (0/1) class values
- Total probability of data
P((X1,y1),(X2,y2),…,(XN,yN))=∏iP(Xi,yi)=∏iP(yi∣Xi)P(Xi)=∏i11+e−yi(w0+wTXi)P(Xi) \begin{array}{l} P\left(\left(X_{1}, y_{1}\right),\left(X_{2}, y_{2}\right), \ldots,\left(X_{N}, y_{N}\right)\right)= \prod_{i} P\left(X_{i}, y_{i}\right) \\\\ =\prod_{i} P\left(y_{i} \mid X_{i}\right) P\left(X_{i}\right)=\prod_{i} \frac{1}{1+e^{-y_{i}\left(w_{0}+w^{T} X_{i}\right)}} P\left(X_{i}\right) \end{array} P((X1,y1),(X2,y2),…,(XN,yN))=∏iP(Xi,yi)=∏iP(yi∣Xi)P(Xi)=∏i1+e−yi(w0+wTXi)1P(Xi)
- Likelihood
P(Training data)=∏i11+e−yi(w0+wTXi)P(Xi) P(\text {Training data})=\prod_{i} \frac{1}{1+e^{-y_{i}\left(w_{0}+w^{T} X_{i}\right)}} P\left(X_{i}\right) P(Training data)=i∏1+e−yi(w0+wTXi)1P(Xi)
- Log likelihood
logP(Training data)=∑ilogP(Xi)−∑ilog(1+e−yi(w0+wTXi)) \begin{array}{l} \log P(\text {Training data})= \sum_{i} \log P\left(X_{i}\right)-\sum_{i} \log \left(1+e^{-y_{i}\left(w_{0}+w^{T} X_{i}\right)}\right) \end{array} logP(Training data)=∑ilogP(Xi)−∑ilog(1+e−yi(w0+wTXi))
- Maximum Likelihood Estimate
w0,w1=argmaxw0,w1logP(Training data) w_{0}, w_{1}=\underset{w_{0}, w_{1}}{\operatorname{argmax}} \log P(\text {Training data}) w0,w1=w0,w1argmaxlogP(Training data)
- Equals (note argmin rather than argmax)
w0,w1=argminw0,w∑ilog(1+e−yi(w0+wTXi)) w_{0}, w_{1}=\underset{w_{0}, w}{\operatorname{argmin}} \sum_{i} \log \left(1+e^{-y_{i}\left(w_{0}+w^{T} X_{i}\right)}\right) w0,w1=w0,wargmini∑log(1+e−yi(w0+wTXi))
- Identical to minimizing the KL divergence between the desired output and actual output 11+e−(w0+wTXi)\frac{1}{1+e^{-\left(w_{0}+w^{T} X_{i}\right)}}1+e−(w0+wTXi)1
MLP
Separable case
- The rest of the network may be viewed as a transformation that transforms data from non-linear classes to linearly separable features
- We can now attach any linear classifier above it for perfect classification
- Need not be a perceptron
- Could even train an SVM on top of the features!
- For insufficient structures, the network may attempt to transform the inputs to linearly separable features
- Will fail to separate exactly, but will try to minimize error
- The network until the second-to-last layer is a non-linear function f(X)f(X)f(X) that converts the input space XXX of into the feature space where the classes are maximally linearly separable
Lower layers
- Manifold hypothesis: For separable classes, the classes are linearly separable on a non-linear manifold
- Layers sequentially “straighten” the data manifold
- The “feature extraction” layer transforms the data such that the posterior probability may now be modelled by a logistic
Weight as a template
- In high dimensional space, all vectors are more or less the same length
- Which means all xxx are in this surface of sphere
- The perceptron fires if the input is within a specified angle of the weight
- Represents a convex region on the surface of the sphere!
- The network is a Boolean function over these regions
- Neuron fires if the input vector is close enough to the weight vector
- If the input pattern matches the weight pattern closely enough
- The perceptron is a correlation filter!
Autoencoder
- The lowest layers of a network detect significant features in the signal
- The signal could be (partially) reconstructed using these features
- Will retain all the significant components of the signal
Simplest autoencoder
- This is just PCA!
- The autoencoder finds the direction of maximum energy
- Simply varying the hidden representation will result in an output that lies along the major axis
Terminology
- Encoder
- The “Analysis” net which computes the hidden representation
- Decoder
- The “Synthesis” which recomposes the data from the hidden representation
Nonlinearity
- When the hidden layer has a linear activation the decoder represents the best linear manifold to fit the data
- Varying the hidden value will move along this linear manifold
- When the hidden layer has non-linear activation, the net performs nonlinear PCA
- The decoder represents the best non-linear manifold to fit the data
- Varying the hidden value will move along this non-linear manifold
- The model is specific to the training data
- Varying the hidden layer value only generates data along the learned manifold
- Any input will result in an output along the learned manifold
- But may not generalize beyond the manifold
- Input unseen data may behave beyond intuitive manner, no constrain!
- The decoder can only generate data on the manifold that the training data lie on
- This also makes it an excellent “generator” of the distribution of the training data
Dictionary-based techniques
- The decoder represents a source-specific generative dictionary
- Exciting it will produce typical data from the source!
Signal separation
- Separation: Identify the combination of entries from both dictionaries that compose the mixed signal
- Given mixed signal and source dictionaries, find excitation that best recreates mixed signal
- Simple backpropagation
- Intermediate results are separated signals