关于信息熵的理解

aWty_

于 2022-09-14 19:09:07 发布

阅读量612

点赞数 1

分类专栏：题解文章标签：算法

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.youkuaiyun.com/ID246783/article/details/126855789

版权

题解专栏收录该内容

76 篇文章

订阅专栏

wordle

一个小游戏，你的目标是要猜中某一个五个字母组成的单词，你有六次机会来猜，每次猜完之后你会得到一些关于你的猜单词与正确单词的一些相关信息。

举个例子，如果我第一次猜的是 $e a r t h$ ，那么我可能会得到一个这样的字符串 $\textcolor{forestgreen}{e}\textcolor{goldenrod}{a}r\textcolor{goldenrod}{t}h$

也就是说目标单词中间有 $e, a, t$ 这三个字母，且 $e$ 的位置是正确的， $a$ 和 $t$ 的位置不正确。

规则大概就是这样。

引入

显然，我们每次猜测的词语能给到我们的信息量是不一样的，现在我们需要找到一种方法来定量的求出猜测某一个单词反馈给我们的信息量，当然我们希望这个词的信息量越大越好。

显然我么们可以假设所有目标单词 $T$ 再某一局游戏中出现的概率是相同的，所以某一个 $T$ 作为这一句游戏的目标单词的概率 $\frac 1{|N|}$ ， $N$ 就是所有 $5$ 位英文单词的集合。

显然，我们可以这样考虑，对于输入的某一个单词 $S$ ，对于一个目标单词 $T$ ，它会得到一种颜色的方案 $k$ ，并且对于不同的 $T$ 得到的 $k$ 有可能是相同的，所以我们记录 $p (k)$ 表示输入 $S$ 得到颜色方案 $k$ 的概率，他就等于 $\frac{num(k)}{|N|}$ ，其中 $n u m (k)$ 就是输入 $S$ 之后能使出现 $k$ 的 $T$ 的个数。

现在我们要计算的是一个 $S$ 能给我们提供多少的价值，那么很显然，我们可以把这个问题转化成一个数学期望问题，也就是：

$\sum_{k}p(k) \times I(k)$

这里的 $E (S)$ 就是 $S$ 能给我们提供的价值的期望，而 $I (k)$ 就是我们对于这个 $k$ 能获得的信息量，现在我们就要考虑怎样量化这个 $k$ 提供给我们的信息量。

信息

我们考虑一个 $k$ ，我们发现 $k$ 其实是对于 $T$ 的一个约束条件，目标单词肯定是符合这个约束条件的某一个 $T$ ，所以我们记 $Y (k)$ 表示颜色方案 $k$ 对答案的约束，那么每出现一个 $k$ 答案的范围就会缩小到这样的一个集合中：

$\{ T \in N \mid T \;\; s.t.Y(k) \}$

那么显然 $∣ G (k) ∣$ 越小，这个 $k$ 提供的信息量 $I (k)$ 就越大。

那么在信息论中，我们这样定义一个 $k$ 的信息量，如果一个 $k$ 的 $Y (k)$ 所形成的 $G (k)$ 有 $\frac 12 |N|$ ，那么我们称 $\text{bit}$ ，同理，如果 $\frac{1}{2^x}|N|$ ，那么这个 $k$ 的 $\text{bit}$ 。

总结一下就是 $\frac{1}{2^{I(k)}}|N|$

同时我们发现， $p (k)$ 其实就等于 $\frac{|G(k)|}{|N|}$ ，那么就可以得到这样一个简洁的式子：

$I(k) = -\log_2p(k)$

这样定义有一个好处，就是对于两个约束条件，一个给你了 $a\text{bit}$ 的信息，另一个给你了 $b\text{bit}$ 的信息，你就可以把这两个信息量直接相加，得到 $\text{bit}$ 的信息。这在数学上和感性上都很好理解。

信息熵

那么回到我们之前的问题上来，我们要求的 $E (S)$ 就可以变成这样：

$\sum_{k}p(k)I(k) = -\sum_{k}p(k)\log_2p(k)$

这里，算出的这个信息量的期望值在信息论中就被叫做 “熵”，也就是：

$-\sum_{x \in X}p(x)\log_2p(x)$

这样一来，我们就可以写一个程序暴力的求出所有的 $S$ 的 $H (S)$ ，然后贪心的按照 $H (S)$ 来选择填什么单词进去，有人写过代码实测下来平均 $4$ 次就能猜中目标单词。

博客等级

码龄4年

162
原创

225
点赞

348
收藏

200
粉丝

关注

私信

热门文章

分类专栏

最新评论

机器学习--神经网络
2401_84079994: I guess the math part is used in certain functions in the whole process flow. I did not try to memorize all the related functions. At least not doing so before I can catch the supporting logic and/or math prove behind. In fact I am trying to run the simple network on a very simple demo example provided in a book, line by line, for about 2 months on and off. Kind of frustrated by the line by line explanations on the book. Only understand what it is doing but lack why it is doing so. Well, it is college experience. Professor can not tell everything and student can not learn everything.
机器学习--神经网络
aWty_: In a particular training, those parameters are randomly given in the initialization part, and the SGD algorithm would help you adjust those parameters automatically to reach a lower loss, which means reaching higher accuracy for prediction. So the initialization for parameters is not a part of network design, but a part of training process. That is my understanding for your question.
机器学习--神经网络
aWty_: In a particular training, those parameters are randomly given in the initialization part, and the SGD algorithm would help you adjust those parameters automatically to reach a lower loss, which means reaching higher accuracy for prediction. So the initialization for parameters is a part of network design, but a part of training process. That is my understanding for your question.
机器学习--神经网络
2401_84079994: thanks a lot for reply. any input fills my curious mind. for deep network, my thought is that it may run much less times in loop while simple network may run much more time in loop, given a timeframe. but, since the designs are different with unknown effect on back and forth process between any two layers, the deeper network may still get better result than simpler network. by the way let me ask a dummy question about the math part. is it true that, given a designed network, you do have a fixed set of parameters (or variables, like x1,x2, etc) and therefore those partial derivatives comes out in play? if true, those pre-fixed parameters are part of the neural network design for a problem? if it is still true, I would guess that the deep learning process may have certain capacity to increase/decrease parameters and change parameter values at the same time in order to generate better result. to this point, it suddenly tastes like partial random evolution process. LOL. thanks again. If I have more energy I may get into the math part a little bit and ask for more understanding.
机器学习--神经网络
aWty_: Firstly, sorry for being unable to answer your questions, I'm also a green hand learner in the machine learning field. However, I would still like to share my viewpoint on your questions. 1. The network cannot guarantee 100% accuracy for predictions, yet you can increase it by adjusting parameters while training your network with SGD. 2. If you are curious about how to design a network, I recommend you learn something about LeNet AlexNet VGG GoogLeNet ResNet NiN and so on. These are both the nets that are proven to be efficient in solving problems like computer vision. 3. As for the third question, the answer is absolutely no. As we all know, the more middle layers a net has, the more time it requires to train. So in a fixed timeframe, an over-complex net may even unable to finish training, let alone provide a better prediction. Additionally, if a network has too many layers, it may encounter a problem called overfitting. So the answer is NO.

最新文章

目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。