Applied Math and Machine Learning Basics 摘要笔记

loveszn

于 2016-04-29 11:00:19 发布

阅读量469

点赞数

CC 4.0 BY-SA版权

分类专栏：数学基础文章标签： math 深度学习

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.youkuaiyun.com/qq_34778011/article/details/51280196

数学基础专栏收录该内容

1 篇文章

订阅专栏

本文介绍了深度学习中信息论的基本概念，包括自信息、香农熵及Kullback-Leibler散度等，并探讨了这些概念如何应用于衡量概率分布间的差异。此外，还讨论了数值稳定性问题，如下溢和上溢，及其对softmax函数的影响。

deep learning by Bengio part1

Information theory

• Likely events should have low information content, and in the extreme case,

events that are guaranteed to happen should have no information content

whatsoever.

• Less likely events should have higher information content.

• Independent events should have additive information.

In order to satisfy all three of these properties, we define the self-information

of an event x = x to be

I( x) = − log P(x).

the Shannon entropy of a distribution is the

expected amount of information in an event drawn from that distribution

the Kullback-Leibler (KL) divergence:

Because the KL divergence is non-negative and measures the difference

between two distributions, it is often conceptualized as measuring some sort of

distance between these distributions. However, it is not a true distance measure

because it is not symmetric:

rounding error:

Underflow occurs when numbers near zero are rounded to zero

Overflow occurs when numbers with large magnitude are approximated as ∞ or −∞ .

One example of a function that must be stabilized against underflow and

overflow is the softmax function

poor conditioning

gradient-based optimization

Beyond the Gradient: Jacobian and Hessian Matrices

Equivalently, the Hessian is the Jacobian of the gradient.

Most of the functions we encounter in the context of deep learning have a symmetric

Hessian almost everywhere.

The (directional) second derivative tells us how well we can expect a gradient descent step to perform.

We can make a second-order Taylor series approximation

to the function around f(x) the current point x(0) :

The second derivative can be used to determine whether a critical point is a

local maximum, a local minimum, or saddle point

In more than one dimension, it is not necessary to have an eigenvalue

of 0 in order to get a saddle point: it is only necessary to have both positive and negative eigenvalues.

Constrained Optimization

wish to find the maximal or minimal value of f ( x) for values of x in some set S.

The Karush–Kuhn–Tucker (KKT) approach1 provides a very general solution

to constrained optimization. With the KKT approach, we introduce a new function

called the generalized Lagrangian or generalized Lagrange function.拉格朗日乘数法

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。