Linear Autoencoder
the optimal linear autocoder: projection to the principal subspace, is same as the result from SVD
decoder is the transpose of encoder
If the data points are too much, then SVD maybe not suitable to be carried out, the computational complexity is too much.
decoder and encoder is unique up to invertible transformation, A is an arbitrary regular matrix.
Decoder and encoder are not necessarily aligh with the eigen-basis, but do span the data into some sub-principal basis.
The true difference is between by spanning principal subspace by some basises versus having the actual orthogonal basis, often the eigenbasis.
THe true is to identify the principal subspace by Linear Autoencoder by stochastic gradient descent.
Find the compressive representation.
L1 norm can be applied to guarantee the sparisity, since L1 is tend to be sparse.
Autoencoder may also have the the power to denoise, since the signal may lie on some manifold while the white noise is not, thus the autoencoder can identify this signal subspace which is robust under noise.
Exercise
KL Divergence: also called relative entropy, is a measure of how one probability distribution diverge from the second, expected probability distribution.
In bayesian view, DKL(P‖Q) D K L ( P ‖ Q ) : Q is the prior distribution which is used to approximate the true distribution P.
Definition: DKL(P‖Q) D K L ( P ‖ Q ) is the expectation of logarithmic difference between P and Q, while the expectation is taken using probability of P.
Transposed Convolution: In contrast with the many-to-one operation in convolution, transposed convolution establish a many-to-one operation.
Tutorial
ELBO: also called variational lower bound
- Problem set-up: Given observations X(training data), we want to estimate some latent variable Z (distribution with some parameters like mean or covariance.)
- Goal & Challenge: we want to find P(Z|X) P ( Z | X ) (the true posterior distribution), but often it’s impossible, thus instead we estimate Pθ(Z) P θ ( Z ) with some parameter θ θ to approximate P(Z|X) P ( Z | X )
- KL divergence: The difference between the true and approximation probability distribution is measured by KL divergence
- Representation of log-likelihood of training data:
logP(x)=L+DKL(qθ(Z)‖p(z|x)) l o g P ( x ) = L + D K L ( q θ ( Z ) ‖ p ( z | x ) )where qθ(Z) q θ ( Z ) is the approximation, while p(z|x) p ( z | x ) is the true posterior distribution.
- L is the ELBO
- Note that KL divergence is always bigger than 0.