deep learning by Bengio part1
Information theory
•
Likely events should have low information content, and in the extreme case,
events that are guaranteed to happen should have no information content
whatsoever.
•
Less likely events should have higher information content.
•
Independent events should have additive information.
In order to satisfy all three of these properties, we define the
self-information
of an event x =
x to be
I(
x) =
− log
P(x).
the Shannon entropy of a distribution is the
expected amount of information in an event drawn from that distribution
the
Kullback-Leibler (KL) divergence:
Because the KL divergence is non-negative and measures the difference
between two distributions, it is often conceptualized as measuring some sort of
distance between these distributions. However, it is not a true distance measure
because it is not symmetric:
rounding error:
Underflow occurs when numbers near zero are rounded to
zero
Overflow occurs when numbers with large magnitude are
approximated as ∞
or −∞
.
One example of a function that must be stabilized against underflow and
overflow is the softmax function
poor conditioning
gradient-based optimization
Beyond the Gradient: Jacobian and Hessian Matrices
Equivalently, the Hessian is the Jacobian of the gradient.
Most of the functions we encounter in the context of deep learning have a symmetric
Hessian almost everywhere.
The (directional) second derivative tells us how well we can expect a gradient descent
step to perform.
We can make a second-order Taylor series approximation
to the function around
f(x) the current point
x(0)
:
The second derivative can be used to determine whether a critical point is a
local maximum, a local minimum, or saddle point
In more than one dimension, it is not necessary to have an eigenvalue
of 0 in order to get a saddle point: it is only necessary to have both positive and negative eigenvalues.
Constrained Optimization
wish to find the maximal or minimal value of f
( x) for values of
x in some set S.
The
Karush–Kuhn–Tucker
(KKT) approach1
provides a very general solution
to constrained optimization. With the KKT approach, we introduce a new function
called the
generalized Lagrangian
or generalized Lagrange function.拉格朗日乘数法