1.数学知识背景
(1)相对熵
设
p
(
x
)
,
q
(
x
)
p(x), q(x)
p(x),q(x)是X中取值的两个概率分布,则p对q的相对熵是
D
(
p
∥
q
)
=
∑
x
p
(
x
)
log
p
(
x
)
q
(
x
)
=
E
p
(
x
)
log
p
(
x
)
q
(
x
)
D(p \| q)=\sum_{x} p(x) \log \frac{p(x)}{q(x)}=E_{p(x)} \log \frac{p(x)}{q(x)}
D(p∥q)=x∑p(x)logq(x)p(x)=Ep(x)logq(x)p(x)
相对熵可以度量两个随机变量的距离。
(2)互信息
两个随机变量的X,Y互信息可以定义为X,Y的联合分布和独立分布乘积的相对
熵。
I
(
X
,
Y
)
=
D
(
P
(
X
,
Y
)
∥
P
(
X
)
P
(
Y
)
)
I
(
X
,
Y
)
=
∑
x
,
y
p
(
x
,
y
)
log
p
(
x
,
y
)
p
(
x
)
p
(
y
)
\begin{array}{c} \mathrm{I}(\mathrm{X}, \mathrm{Y})=\mathrm{D}(\mathrm{P}(\mathrm{X}, \mathrm{Y}) \| \mathrm{P}(\mathrm{X}) \mathrm{P}(\mathrm{Y})) \\ I(X, Y)=\sum_{x, y} p(x, y) \log \frac{p(x, y)}{p(x) p(y)} \end{array}
I(X,Y)=D(P(X,Y)∥P(X)P(Y))I(X,Y)=∑x,yp(x,y)logp(x)p(y)p(x,y)
(3)信息增益
信息增益表示得知特征A的信息,而使得类X的信息不确定性减少的程度。定义:特征A对训练数据集D的信息增益为
g
(
D
,
A
)
\mathrm{g}(\mathrm{D}, \mathrm{A})
g(D,A),定义为集合D的经验熵
H
(
D
)
\mathrm{H}(\mathrm{D})
H(D)与特征A给定条件条件下D的经验条件熵
H
(
D
∣
A
)
\mathrm{H}(\mathrm{D} | \mathrm{A})
H(D∣A)之差,即:
g
(
D
,
A
)
=
H
(
D
)
−
H
(
D
∣
A
)
\mathrm{g}(\mathrm{D}, \mathrm{A})=\mathrm{H}(\mathrm{D})-\mathrm{H}(\mathrm{D} | \mathrm{A})
g(D,A)=H(D)−H(D∣A)
(4)概率
条件概率
P
(
A
∣
B
)
=
P
(
A
B
)
P
(
B
)
P(A | B)=\frac{P(A B)}{P(B)}
P(A∣B)=P(B)P(AB)
全概率公式
P
(
A
)
=
∑
i
P
(
A
∣
B
i
)
P
(
B
i
)
P(A)=\sum_{i} P\left(A | B_{i}\right) P\left(B_{i}\right)
P(A)=i∑P(A∣Bi)P(Bi)
贝叶斯公式
P
(
B
i
∣
A
)
=
P
(
A
∣
B
i
)
P
(
B
i
)
∑
j
P
(
A
∣
B
j
)
P
(
B
j
)
P\left(B_{i} | A\right)=\frac{P\left(A | B_{i}\right) P\left(B_{i}\right)}{\sum_{j} P\left(A | B_{j}\right) P\left(B_{j}\right)}
P(Bi∣A)=∑jP(A∣Bj)P(Bj)P(A∣Bi)P(Bi)
贝叶斯公式应用
(5)贝叶斯公式带来的思考
给定某些样本D,在这些样本中计算某结论
A
1
、
A
2
…
…
A
n
A_{1}、A_{2} \ldots \ldots A_{n}
A1、A2……An出现的概率,即
P
(
A
i
∣
D
)
\mathrm{P}\left(\mathrm{A}_{\mathrm{i}} | \mathrm{D}\right)
P(Ai∣D),
max
P
(
A
i
∣
D
)
=
max
P
(
D
∣
A
i
)
P
(
A
i
)
P
(
D
)
=
max
(
P
(
D
∣
A
i
)
P
(
A
i
)
)
→
max
P
(
D
∣
A
i
)
⇒
max
P
(
A
i
∣
D
)
→
max
P
(
D
∣
A
i
)
\begin{array}{c} \max P\left(A_{i} | D\right)=\max \frac{P\left(D | A_{i}\right) P\left(A_{i}\right)}{P(D)}=\max \left(P\left(D | A_{i}\right) P\left(A_{i}\right)\right) \rightarrow \max P\left(D | A_{i}\right) \\ \Rightarrow \max P\left(A_{i} | D\right) \rightarrow \max P\left(D | A_{i}\right) \end{array}
maxP(Ai∣D)=maxP(D)P(D∣Ai)P(Ai)=max(P(D∣Ai)P(Ai))→maxP(D∣Ai)⇒maxP(Ai∣D)→maxP(D∣Ai)
2.朴素贝叶斯
朴素贝叶斯是基于特征之间是独立的这一朴素假设,应用贝叶斯定理的监督学习算法。
对于给定的特征向量
x
1
,
x
2
,
⋯
,
x
n
x_{1}, x_{2}, \cdots, x_{n}
x1,x2,⋯,xn,类别y的概率可以由贝叶斯公式得到:
P
(
y
∣
x
1
,
x
2
,
⋯
,
x
n
)
=
P
(
y
)
P
(
x
1
,
x
2
,
⋯
,
x
n
∣
y
)
P
(
x
1
,
x
2
,
⋯
,
x
n
)
P\left(y | x_{1}, x_{2}, \cdots, x_{n}\right)=\frac{P(y) P\left(x_{1}, x_{2}, \cdots, x_{n} | y\right)}{P\left(x_{1}, x_{2}, \cdots, x_{n}\right)}
P(y∣x1,x2,⋯,xn)=P(x1,x2,⋯,xn)P(y)P(x1,x2,⋯,xn∣y)
使用朴素的独立性假设:
P
(
x
i
∣
y
,
x
1
,
⋯
,
x
i
−
1
,
x
i
+
1
,
⋯
,
x
n
)
=
P
(
x
i
∣
y
)
P\left(x_{i} | y, x_{1}, \cdots, x_{i-1}, x_{i+1}, \cdots, x_{n}\right)=P\left(x_{i} | y\right)
P(xi∣y,x1,⋯,xi−1,xi+1,⋯,xn)=P(xi∣y)
类别y的概率可以简化为
P
(
y
∣
x
1
,
x
2
,
⋯
,
x
n
)
=
P
(
y
)
P
(
x
1
,
x
2
,
⋯
,
x
n
∣
y
)
P
(
x
1
,
x
2
,
⋯
,
x
n
)
=
P
(
y
)
∏
i
=
1
n
P
(
x
i
∣
y
)
P
(
x
1
,
x
2
,
⋯
,
x
n
)
P\left(y | x_{1}, x_{2}, \cdots, x_{n}\right)=\frac{P(y) P\left(x_{1}, x_{2}, \cdots, x_{n} | y\right)}{P\left(x_{1}, x_{2}, \cdots, x_{n}\right)}=\frac{P(y) \prod_{i=1}^{n} P\left(x_{i} | y\right)}{P\left(x_{1}, x_{2}, \cdots, x_{n}\right)}
P(y∣x1,x2,⋯,xn)=P(x1,x2,⋯,xn)P(y)P(x1,x2,⋯,xn∣y)=P(x1,x2,⋯,xn)P(y)∏i=1nP(xi∣y)
在给定样本的前提下,
P
(
x
1
,
x
2
,
⋯
,
x
n
)
P\left(x_{1}, x_{2}, \cdots, x_{n}\right)
P(x1,x2,⋯,xn)是常数,则
P
(
y
∣
x
1
,
x
2
,
⋯
,
x
n
)
∝
P
(
y
)
∏
i
=
1
n
P
(
x
i
∣
y
)
P\left(y | x_{1}, x_{2}, \cdots, x_{n}\right) \propto P(y) \prod_{i=1}^{n} P\left(x_{i} | y\right)
P(y∣x1,x2,⋯,xn)∝P(y)i=1∏nP(xi∣y)
从而,
y
^
=
arg
max
y
P
(
y
)
∏
i
=
1
n
P
(
x
i
∣
y
)
\hat{y}=\underset{y}{\arg \max } P(y) \prod_{i=1}^{n} P\left(x_{i} | y\right)
y^=yargmaxP(y)i=1∏nP(xi∣y)
高斯朴素贝叶斯和多项分布朴素贝叶斯
根据样本使用MAP估计
P
(
y
)
P(y)
P(y),建立合理的模型估计
P
(
x
i
∣
y
)
P\left(x_{i} | y\right)
P(xi∣y),从而得到样本的类别。
y
^
=
arg
max
y
P
(
y
)
∏
i
=
1
n
P
(
x
i
∣
y
)
\hat{y}=\underset{y}{\arg \max } P(y) \prod_{i=1}^{n} P\left(x_{i} | y\right)
y^=yargmaxP(y)i=1∏nP(xi∣y)
(1)假设特征服从高斯分布,即:
P
(
x
i
∣
y
)
=
1
2
π
σ
y
exp
(
−
(
x
i
−
μ
y
)
2
2
σ
y
2
)
P\left(x_{i} | y\right)=\frac{1}{\sqrt{2 \pi} \sigma_{y}} \exp \left(-\frac{\left(x_{i}-\mu_{y}\right)^{2}}{2 \sigma_{y}^{2}}\right)
P(xi∣y)=2πσy1exp(−2σy2(xi−μy)2)
参数估计使用MLE估计即可。
(2)假设特征服从多项式分布,从而,对于每个类别y,参数为
θ
y
=
(
θ
y
1
,
θ
y
2
,
⋯
,
θ
y
n
)
\theta_{y}=\left(\theta_{y 1}, \theta_{y 2}, \cdots, \theta_{y n}\right)
θy=(θy1,θy2,⋯,θyn),其中n为特征的数目,
P
(
x
i
∣
y
)
P\left(x_{i} | y\right)
P(xi∣y)的概率为
θ
y
i
\theta_{y i}
θyi。
参数
θ
y
\theta_{y}
θy使用MLE估计的结果为:
θ
^
y
i
=
N
y
i
+
α
N
y
+
α
⋅
n
,
α
≥
0
\hat{\theta}_{y i}=\frac{N_{y i}+\alpha}{N_{y}+\alpha \cdot n}, \quad \alpha \geq 0
θ^yi=Ny+α⋅nNyi+α,α≥0
假定训练集为T,有
{
N
y
i
=
∑
x
∈
T
x
i
N
y
=
∑
i
=
1
∣
T
∣
N
y
i
\left\{\begin{array}{l}N_{y i}=\sum_{x \in T} x_{i} \\ N_{y}=\sum_{i=1}^{|T|} N_{y i}\end{array}\right.
{Nyi=∑x∈TxiNy=∑i=1∣T∣Nyi,其中
α
=
1
\alpha=1
α=1
应用举例–垃圾邮件分类
朴素贝叶斯的思考
3.贝叶斯网络
(1)概念
把某个研究系统中涉及到的随机变量,根据是否条件独立绘制在一张图中,就形成了贝叶斯网络,又称为有向无欢图模型,是一种概率图模型,
根据概率图的拓扑结构,考察一组随机变量
{
X
1
,
X
2
…
X
n
}
\left\{\mathrm{X}_{1}, \mathrm{X}_{2} \dots \mathrm{X}_{\mathrm{n}}\right\}
{X1,X2…Xn}及其n组条件概率分布的性质。
贝叶斯网络中的有向无环图的节点代表随机变量,他们可以是可观察到的变量或者隐变量,未知参数等。连接两个节点的箭头代表此随机变量具有因果关系(或非条件独立),若两个节点以一个单箭头连接,表示其中一个节点是因,一个节点是果,两节点就会产生一个条件概率值。
(2)一个简单的贝叶斯网络
p
(
a
,
b
,
c
)
=
p
(
c
∣
a
,
b
)
p
(
b
∣
a
)
p
(
a
)
p(a, b, c)=p(c | a, b) p(b | a) p(a)
p(a,b,c)=p(c∣a,b)p(b∣a)p(a)
(3)全连接的贝叶斯网络
每一对节点都有边连接
p
(
x
1
,
…
,
x
K
)
=
p
(
x
K
∣
x
1
,
…
,
x
K
−
1
)
…
p
(
x
2
∣
x
1
)
p
(
x
1
)
P
(
X
1
=
x
1
,
…
,
X
n
=
x
n
)
=
∏
i
=
1
n
P
(
X
i
=
x
i
∣
X
i
+
1
=
x
i
+
1
,
…
,
X
n
=
x
n
)
\begin{aligned} &p\left(x_{1}, \ldots, x_{K}\right)=p\left(x_{K} | x_{1}, \ldots, x_{K-1}\right) \ldots p\left(x_{2} | x_{1}\right) p\left(x_{1}\right)\\ &\mathrm{P}\left(X_{1}=x_{1}, \ldots, X_{n}=x_{n}\right)=\prod_{i=1}^{n} \mathrm{P}\left(X_{i}=x_{i} | X_{i+1}=x_{i+1}, \ldots, X_{n}=x_{n}\right) \end{aligned}
p(x1,…,xK)=p(xK∣x1,…,xK−1)…p(x2∣x1)p(x1)P(X1=x1,…,Xn=xn)=i=1∏nP(Xi=xi∣Xi+1=xi+1,…,Xn=xn)
(4)朴素贝叶斯
P
(
x
1
,
x
2
,
x
3
,
x
4
∣
y
)
=
P
(
x
1
∣
y
)
∗
P
(
x
2
∣
y
)
∗
P
(
x
3
∣
y
)
∗
P
(
x
4
∣
y
)
P\left(x_{1}, x_{2}, x_{3}, x_{4} | y\right)=P\left(x_{1} | y\right) * P\left(x_{2} | y\right) * P\left(x_{3} | y\right) * P\left(x_{4} | y\right)
P(x1,x2,x3,x4∣y)=P(x1∣y)∗P(x2∣y)∗P(x3∣y)∗P(x4∣y)
(5)普通的贝叶斯网络
有些边缺失,
x
1
x_1
x1,
x
2
x_2
x2独立,
x
6
x_6
x6和
x
7
x_7
x7在给定条件
x
4
x_4
x4上独立。
p
(
x
1
)
p
(
x
2
)
p
(
x
3
)
p
(
x
4
∣
x
1
,
x
2
,
x
3
)
p
(
x
5
∣
x
1
,
x
3
)
p
(
x
6
∣
x
4
)
p
(
x
7
∣
x
4
,
x
5
)
p\left(x_{1}\right) p\left(x_{2}\right) p\left(x_{3}\right) p\left(x_{4} | x_{1}, x_{2}, x_{3}\right) p\left(x_{5} | x_{1}, x_{3}\right) p\left(x_{6} | x_{4}\right) p\left(x_{7} | x_{4}, x_{5}\right)
p(x1)p(x2)p(x3)p(x4∣x1,x2,x3)p(x5∣x1,x3)p(x6∣x4)p(x7∣x4,x5)
(6)一个实际的贝叶斯网络分析
(7)贝叶斯网络判定条件独立
(i)给定条件下,tail-to-tail独立
根据图模型,得到:
P
(
a
,
b
,
c
)
=
P
(
c
)
∗
P
(
a
∣
c
)
∗
P
(
b
∣
c
)
\mathrm{P}(\mathrm{a}, \mathrm{b}, \mathrm{c})=\mathrm{P}(\mathrm{c})^{*} \mathrm{P}(\mathrm{a} | \mathrm{c})^{*} \mathrm{P}(\mathrm{b} | \mathrm{c})
P(a,b,c)=P(c)∗P(a∣c)∗P(b∣c)
从而,
P
(
a
,
b
,
c
)
/
P
(
c
)
=
P
(
a
∣
c
)
∗
P
(
b
∣
c
)
\mathrm{P}(\mathrm{a}, \mathrm{b}, \mathrm{c}) / \mathrm{P}(\mathrm{c})=\mathrm{P}(\mathrm{a} | \mathrm{c})^{*} \mathrm{P}(\mathrm{b} | \mathrm{c})
P(a,b,c)/P(c)=P(a∣c)∗P(b∣c)
由于
P
(
a
,
b
∣
c
)
=
P
(
a
,
b
,
c
)
/
P
(
c
)
\mathrm{P}(\mathrm{a}, \mathrm{b} | \mathrm{c})=\mathrm{P}(\mathrm{a}, \mathrm{b}, \mathrm{c}) / \mathrm{P}(\mathrm{c})
P(a,b∣c)=P(a,b,c)/P(c),所以有
P
(
a
,
b
∣
c
)
=
P
(
a
∣
c
)
∗
P
(
b
∣
c
)
\mathrm{P}(\mathrm{a}, \mathrm{b} | \mathrm{c})=\mathrm{P}(\mathrm{a} | \mathrm{c})^{*} \mathrm{P}(\mathrm{b} | \mathrm{c})
P(a,b∣c)=P(a∣c)∗P(b∣c)
即在c给定的条件下,a和b是阻断独立的。
(ii)给定条件下,head-to-tail 独立
P
(
a
,
b
,
c
)
=
P
(
a
)
∗
P
(
c
∣
a
)
∗
P
(
b
∣
c
)
\mathrm{P}(\mathrm{a}, \mathrm{b}, \mathrm{c})=\mathrm{P}(\mathrm{a})^{*} \mathrm{P}(\mathrm{c} | \mathrm{a})^{*} \mathrm{P}(\mathrm{b} | \mathrm{c})
P(a,b,c)=P(a)∗P(c∣a)∗P(b∣c)
P
(
a
,
b
∣
c
)
=
P
(
a
,
b
,
c
)
/
P
(
c
)
=
P
(
a
)
∗
P
(
c
∣
a
)
∗
P
(
b
∣
c
)
/
P
(
c
)
=
P
(
a
,
c
)
∗
P
(
b
∣
c
)
/
P
(
c
)
=
P
(
a
∣
c
)
∗
P
(
b
∣
c
)
\begin{aligned} & \mathrm{P}(\mathrm{a}, \mathrm{b} | \mathrm{c}) \\ =& \mathrm{P}(\mathrm{a}, \mathrm{b}, \mathrm{c}) / \mathrm{P}(\mathrm{c}) \\ =& \mathrm{P}(\mathrm{a}) * \mathrm{P}(\mathrm{c} | \mathrm{a}) * \mathrm{P}(\mathrm{b} | \mathrm{c}) / \mathrm{P}(\mathrm{c}) \\ =& \mathrm{P}(\mathrm{a}, \mathrm{c}) * \mathrm{P}(\mathrm{b} | \mathrm{c}) / \mathrm{P}(\mathrm{c}) \\ =& \mathrm{P}(\mathrm{a} | \mathrm{c}) * \mathrm{P}(\mathrm{b} | \mathrm{c}) \end{aligned}
====P(a,b∣c)P(a,b,c)/P(c)P(a)∗P(c∣a)∗P(b∣c)/P(c)P(a,c)∗P(b∣c)/P(c)P(a∣c)∗P(b∣c)
即在c给定的条件下,a和b是阻断独立的。
(iii)定条件下,head-to-tail 独立
P
(
a
,
b
,
c
)
=
P
(
a
)
∗
P
(
b
)
∗
P
(
c
∣
a
,
b
)
∑
c
P
(
a
,
b
,
c
)
=
∑
c
P
(
a
)
∗
P
(
b
)
∗
P
(
c
∣
a
,
b
)
⇒
P
(
a
,
b
)
=
P
(
a
)
∗
P
(
b
)
\begin{aligned} &\mathrm{P}(\mathrm{a}, \mathrm{b}, \mathrm{c})=\mathrm{P}(\mathrm{a}) * \mathrm{P}(\mathrm{b}) * \mathrm{P}(\mathrm{c} | \mathrm{a}, \mathrm{b})\\ &\sum_{c} \mathrm{P}(\mathrm{a}, \mathrm{b}, \mathrm{c})=\sum_{\mathrm{c}} \mathrm{P}(\mathrm{a}) * \mathrm{P}(\mathrm{b}) * \mathrm{P}(\mathrm{c} | \mathrm{a}, \mathrm{b})\\ &\Rightarrow P(a, b)=\mathrm{P}(\mathrm{a}) * \mathrm{P}(\mathrm{b}) \end{aligned}
P(a,b,c)=P(a)∗P(b)∗P(c∣a,b)c∑P(a,b,c)=c∑P(a)∗P(b)∗P(c∣a,b)⇒P(a,b)=P(a)∗P(b)
即在c未知条件下,a和b是阻断独立的。