概率论与贝叶斯方法-优快云博客

本文链接：https://blog.youkuaiyun.com/u014168855/article/details/105018432

1.数学知识背景
(1)相对熵
设 $p (x), q (x)$ 是X中取值的两个概率分布，则p对q的相对熵是
$\| q)=\sum_{x} p(x) \log \frac{p(x)}{q(x)}=E_{p(x)} \log \frac{p(x)}{q(x)}$
相对熵可以度量两个随机变量的距离。
(2)互信息
两个随机变量的X，Y互信息可以定义为X，Y的联合分布和独立分布乘积的相对
熵。
$\begin{array}{c} \mathrm{I}(\mathrm{X}, \mathrm{Y})=\mathrm{D}(\mathrm{P}(\mathrm{X}, \mathrm{Y}) \| \mathrm{P}(\mathrm{X}) \mathrm{P}(\mathrm{Y})) \\ I(X, Y)=\sum_{x, y} p(x, y) \log \frac{p(x, y)}{p(x) p(y)} \end{array}$
(3)信息增益
信息增益表示得知特征A的信息，而使得类X的信息不确定性减少的程度。定义：特征A对训练数据集D的信息增益为 $\mathrm{g}(\mathrm{D}, \mathrm{A})$ ，定义为集合D的经验熵 $\mathrm{H}(\mathrm{D})$ 与特征A给定条件条件下D的经验条件熵 $\mathrm{H}(\mathrm{D} | \mathrm{A})$ 之差，即：
$\mathrm{g}(\mathrm{D}, \mathrm{A})=\mathrm{H}(\mathrm{D})-\mathrm{H}(\mathrm{D} | \mathrm{A})$
(4)概率
条件概率
$B)=\frac{P(A B)}{P(B)}$
全概率公式
$P(A)=\sum_{i} P\left(A | B_{i}\right) P\left(B_{i}\right)$
贝叶斯公式
$P\left(B_{i} | A\right)=\frac{P\left(A | B_{i}\right) P\left(B_{i}\right)}{\sum_{j} P\left(A | B_{j}\right) P\left(B_{j}\right)}$
贝叶斯公式应用
在这里插入图片描述
(5)贝叶斯公式带来的思考
给定某些样本D，在这些样本中计算某结论 $A_{1}、A_{2} \ldots \ldots A_{n}$ 出现的概率，即 $\mathrm{P}\left(\mathrm{A}_{\mathrm{i}} | \mathrm{D}\right)$ ，
$\begin{array}{c} \max P\left(A_{i} | D\right)=\max \frac{P\left(D | A_{i}\right) P\left(A_{i}\right)}{P(D)}=\max \left(P\left(D | A_{i}\right) P\left(A_{i}\right)\right) \rightarrow \max P\left(D | A_{i}\right) \\ \Rightarrow \max P\left(A_{i} | D\right) \rightarrow \max P\left(D | A_{i}\right) \end{array}$
2.朴素贝叶斯
朴素贝叶斯是基于特征之间是独立的这一朴素假设，应用贝叶斯定理的监督学习算法。
对于给定的特征向量 $x_{1}, x_{2}, \cdots, x_{n}$ ，类别y的概率可以由贝叶斯公式得到：
$P\left(y | x_{1}, x_{2}, \cdots, x_{n}\right)=\frac{P(y) P\left(x_{1}, x_{2}, \cdots, x_{n} | y\right)}{P\left(x_{1}, x_{2}, \cdots, x_{n}\right)}$
使用朴素的独立性假设：
$P\left(x_{i} | y, x_{1}, \cdots, x_{i-1}, x_{i+1}, \cdots, x_{n}\right)=P\left(x_{i} | y\right)$
类别y的概率可以简化为
$P\left(y | x_{1}, x_{2}, \cdots, x_{n}\right)=\frac{P(y) P\left(x_{1}, x_{2}, \cdots, x_{n} | y\right)}{P\left(x_{1}, x_{2}, \cdots, x_{n}\right)}=\frac{P(y) \prod_{i=1}^{n} P\left(x_{i} | y\right)}{P\left(x_{1}, x_{2}, \cdots, x_{n}\right)}$
在给定样本的前提下， $P\left(x_{1}, x_{2}, \cdots, x_{n}\right)$ 是常数，则
$P\left(y | x_{1}, x_{2}, \cdots, x_{n}\right) \propto P(y) \prod_{i=1}^{n} P\left(x_{i} | y\right)$
从而，
$\hat{y}=\underset{y}{\arg \max } P(y) \prod_{i=1}^{n} P\left(x_{i} | y\right)$
高斯朴素贝叶斯和多项分布朴素贝叶斯
根据样本使用MAP估计 $P (y)$ ，建立合理的模型估计 $P\left(x_{i} | y\right)$ ，从而得到样本的类别。
$\hat{y}=\underset{y}{\arg \max } P(y) \prod_{i=1}^{n} P\left(x_{i} | y\right)$
（1）假设特征服从高斯分布，即：
$P\left(x_{i} | y\right)=\frac{1}{\sqrt{2 \pi} \sigma_{y}} \exp \left(-\frac{\left(x_{i}-\mu_{y}\right)^{2}}{2 \sigma_{y}^{2}}\right)$
参数估计使用MLE估计即可。
（2）假设特征服从多项式分布，从而，对于每个类别y，参数为 $\theta_{y}=\left(\theta_{y 1}, \theta_{y 2}, \cdots, \theta_{y n}\right)$ ，其中n为特征的数目， $P\left(x_{i} | y\right)$ 的概率为 $\theta_{y i}$ 。
参数 $\theta_{y}$ 使用MLE估计的结果为：
$\hat{\theta}_{y i}=\frac{N_{y i}+\alpha}{N_{y}+\alpha \cdot n}, \quad \alpha \geq 0$
假定训练集为T，有 $\left\{\begin{array}{l}N_{y i}=\sum_{x \in T} x_{i} \\ N_{y}=\sum_{i=1}^{|T|} N_{y i}\end{array}\right.$ ，其中 $\alpha=1$
应用举例–垃圾邮件分类
在这里插入图片描述

朴素贝叶斯的思考

3.贝叶斯网络
（1）概念
把某个研究系统中涉及到的随机变量，根据是否条件独立绘制在一张图中，就形成了贝叶斯网络，又称为有向无欢图模型，是一种概率图模型，
根据概率图的拓扑结构，考察一组随机变量 $\left\{\mathrm{X}_{1}, \mathrm{X}_{2} \dots \mathrm{X}_{\mathrm{n}}\right\}$ 及其n组条件概率分布的性质。
贝叶斯网络中的有向无环图的节点代表随机变量，他们可以是可观察到的变量或者隐变量，未知参数等。连接两个节点的箭头代表此随机变量具有因果关系（或非条件独立），若两个节点以一个单箭头连接，表示其中一个节点是因，一个节点是果，两节点就会产生一个条件概率值。
（2）一个简单的贝叶斯网络
$p (a, b, c) = p (c ∣ a, b) p (b ∣ a) p (a)$
在这里插入图片描述
（3）全连接的贝叶斯网络
每一对节点都有边连接
$\begin{aligned} &p\left(x_{1}, \ldots, x_{K}\right)=p\left(x_{K} | x_{1}, \ldots, x_{K-1}\right) \ldots p\left(x_{2} | x_{1}\right) p\left(x_{1}\right)\\ &\mathrm{P}\left(X_{1}=x_{1}, \ldots, X_{n}=x_{n}\right)=\prod_{i=1}^{n} \mathrm{P}\left(X_{i}=x_{i} | X_{i+1}=x_{i+1}, \ldots, X_{n}=x_{n}\right) \end{aligned}$
（4）朴素贝叶斯
$P\left(x_{1}, x_{2}, x_{3}, x_{4} | y\right)=P\left(x_{1} | y\right) * P\left(x_{2} | y\right) * P\left(x_{3} | y\right) * P\left(x_{4} | y\right)$
在这里插入图片描述
（5）普通的贝叶斯网络
有些边缺失， $x_1$ , $x_2$ 独立， $x_6$ 和 $x_7$ 在给定条件 $x_4$ 上独立。 $p\left(x_{1}\right) p\left(x_{2}\right) p\left(x_{3}\right) p\left(x_{4} | x_{1}, x_{2}, x_{3}\right) p\left(x_{5} | x_{1}, x_{3}\right) p\left(x_{6} | x_{4}\right) p\left(x_{7} | x_{4}, x_{5}\right)$
在这里插入图片描述
（6）一个实际的贝叶斯网络分析

（7）贝叶斯网络判定条件独立
（i）给定条件下，tail-to-tail独立

根据图模型，得到： $\mathrm{P}(\mathrm{a}, \mathrm{b}, \mathrm{c})=\mathrm{P}(\mathrm{c})^{*} \mathrm{P}(\mathrm{a} | \mathrm{c})^{*} \mathrm{P}(\mathrm{b} | \mathrm{c})$
从而， $\mathrm{P}(\mathrm{a}, \mathrm{b}, \mathrm{c}) / \mathrm{P}(\mathrm{c})=\mathrm{P}(\mathrm{a} | \mathrm{c})^{*} \mathrm{P}(\mathrm{b} | \mathrm{c})$
由于 $\mathrm{P}(\mathrm{a}, \mathrm{b} | \mathrm{c})=\mathrm{P}(\mathrm{a}, \mathrm{b}, \mathrm{c}) / \mathrm{P}(\mathrm{c})$ ，所以有 $\mathrm{P}(\mathrm{a}, \mathrm{b} | \mathrm{c})=\mathrm{P}(\mathrm{a} | \mathrm{c})^{*} \mathrm{P}(\mathrm{b} | \mathrm{c})$
即在c给定的条件下，a和b是阻断独立的。
（ii）给定条件下，head-to-tail 独立
在这里插入图片描述
$\mathrm{P}(\mathrm{a}, \mathrm{b}, \mathrm{c})=\mathrm{P}(\mathrm{a})^{*} \mathrm{P}(\mathrm{c} | \mathrm{a})^{*} \mathrm{P}(\mathrm{b} | \mathrm{c})$
$\begin{aligned} & \mathrm{P}(\mathrm{a}, \mathrm{b} | \mathrm{c}) \\ =& \mathrm{P}(\mathrm{a}, \mathrm{b}, \mathrm{c}) / \mathrm{P}(\mathrm{c}) \\ =& \mathrm{P}(\mathrm{a}) * \mathrm{P}(\mathrm{c} | \mathrm{a}) * \mathrm{P}(\mathrm{b} | \mathrm{c}) / \mathrm{P}(\mathrm{c}) \\ =& \mathrm{P}(\mathrm{a}, \mathrm{c}) * \mathrm{P}(\mathrm{b} | \mathrm{c}) / \mathrm{P}(\mathrm{c}) \\ =& \mathrm{P}(\mathrm{a} | \mathrm{c}) * \mathrm{P}(\mathrm{b} | \mathrm{c}) \end{aligned}$
即在c给定的条件下，a和b是阻断独立的。
（iii）定条件下，head-to-tail 独立
在这里插入图片描述
$\begin{aligned} &\mathrm{P}(\mathrm{a}, \mathrm{b}, \mathrm{c})=\mathrm{P}(\mathrm{a}) * \mathrm{P}(\mathrm{b}) * \mathrm{P}(\mathrm{c} | \mathrm{a}, \mathrm{b})\\ &\sum_{c} \mathrm{P}(\mathrm{a}, \mathrm{b}, \mathrm{c})=\sum_{\mathrm{c}} \mathrm{P}(\mathrm{a}) * \mathrm{P}(\mathrm{b}) * \mathrm{P}(\mathrm{c} | \mathrm{a}, \mathrm{b})\\ &\Rightarrow P(a, b)=\mathrm{P}(\mathrm{a}) * \mathrm{P}(\mathrm{b}) \end{aligned}$
即在c未知条件下，a和b是阻断独立的。