Information Entropy
Information Entropy measures the information missing before reception, saying the level of uncertainty of a random variable
X
X
.
Information entropy definition:
where, X X is a random variable, p(xi) p ( x i ) is the probability of X=xi X = x i . When log l o g is log2 l o g 2 the unit of H(X) H ( X ) is bit. When log l o g is log10 l o g 10 the unit of H(X) H ( X ) is dit.
Example
English character
X
X
is a random variable. It could be one character of
a,b,c...x,y,z
a
,
b
,
c
.
.
.
x
,
y
,
z
. The information entropy of
X
X
:
This means the information entropy of a English character is 4.7 bit, meaning 5 binary numbers are able to encode a English character.
ASCII code
X
X
is a random variable. It could be one ASCII code. The total number of ASCII code is 128. The information entropy of
X
X
:
This means the information entropy of a English character is 7 bit, meaning 7 binary numbers are able to encode an ASCII code. We use a Byte, which is 8 bits, to stand for a ASCII code. The extra one bit is used for checking.
================================================
Cross Entropy in Machine Learning
In information theory, the cross entropy between two probability distributions
p
p
and
q
q
over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an “unnatural” probability distribution
q
q
, rather than the “true” distribution
p
p
.
Cross entropy definition:
where, x x is each certain value of the set. p p is the target distribution, q q is the temporary, unreal or unnatural distribution.
The more similar
p
p
and
q
q
, the smaller
S(p,q)
S
(
p
,
q
)
. So
S(.)
S
(
.
)
could be used as training target. There is an application, named tf.nn.softmax_cross_entropy_with_logits_v2(), in tensorflow for this.
Example
The training instance one-hot label is y_target=[0,1,0,0,0] y _ t a r g e t = [ 0 , 1 , 0 , 0 , 0 ] . The one-hot label calculated by you algorithm y_tmp=[0.1,0.1,0.2,0.1,0.5] y _ t m p = [ 0.1 , 0.1 , 0.2 , 0.1 , 0.5 ] .
You want to make y_tmp y _ t m p approximating y_target y _ t a r g e t . In other word, you goal is to make y_tmp[0] y _ t m p [ 0 ] smaller, to make y_tmp[1] y _ t m p [ 1 ] greater, to make y_tmp[2] y _ t m p [ 2 ] smaller … It’s a complex task. so much to be consider.
How about make S(y_target,y_tmp) S ( y _ t a r g e t , y _ t m p ) be smaller? One shot all done. Better.
Ref
Cross entropy - Wikipedia
https://en.wikipedia.org/wiki/Cross_entropy
A Friendly Introduction to Cross-Entropy Loss
https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/#entropy
Entropy - Wikipedia
https://en.wikipedia.org/wiki/Entropy
本文介绍了信息论中的两个核心概念:信息熵与交叉熵。信息熵衡量了接收前信息的不确定性,而交叉熵则用于量化两个概率分布之间的差异。文章通过实例解释了如何计算英语字符的信息熵,并探讨了在机器学习中交叉熵的应用。

2374

被折叠的 条评论
为什么被折叠?



