最近对EM算法用于无监督学习的朴素贝叶斯分类决策器很感兴趣,但奈何找了不少资料我也没彻底看懂。
这里贴一个datawhale的:
EM算法
那么我们知道EM算法的Q步是针对完全数据(包含观测数据序列和隐变量数据序列)。由于
P
(
D
,
Z
∣
θ
)
P(D,Z| \theta)
P(D,Z∣θ)设计隐变量Z,我们通过Q函数对其求关于Z随机变量的期望迭代近似。
接下来进入正题:
假设有一个未标注的数据集 D = { x 1 , x 2 , … , x N } D = \{ \mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_N \} D={x1,x2,…,xN},其中每个数据点 x i ∈ { x i 1 , x i 2 , … , x i M } \mathbf{x}_i \in \{x_{i1}, x_{i2}, \dots, x_{iM}\} xi∈{xi1,xi2,…,xiM}。
每个数据点对应一个隐藏的类别标签 y i y_i yi,取值范围为 y i ∈ { 1 , 2 , … , K } y_i \in \{1, 2, \dots, K\} yi∈{1,2,…,K}。
根据Naive Bayes假设:在给定类别的条件下,各特征相互独立,即
P
(
x
i
∣
y
i
=
k
)
=
∏
m
=
1
M
P
(
x
i
m
∣
y
i
=
k
)
P(\mathbf{x}_i | y_i = k) = \prod_{m=1}^M P(x_{im} | y_i = k)
P(xi∣yi=k)=m=1∏MP(xim∣yi=k)
先验概率:
π
k
=
P
(
y
=
k
)
\pi_k = P(y = k)
πk=P(y=k),满足
∑
k
=
1
K
π
k
=
1
\sum\limits_{k=1}^K \pi_k = 1
k=1∑Kπk=1。
条件概率:
θ
k
,
m
(
x
)
=
P
(
x
i
(
m
)
=
x
i
m
∣
y
=
k
)
\theta_{k,m}(x) = P(x_i^{(m)} = x_{im} | y = k)
θk,m(x)=P(xi(m)=xim∣y=k)
EM算法步骤
1. 初始化参数
- 随机初始化先验概率 π k ( 0 ) \pi_k^{(0)} πk(0)。
- 随机或基于启发式方法初始化条件概率 θ k , m ( 0 ) ( x ) \theta_{k,m}^{(0)}(x) θk,m(0)(x)。
2. 迭代执行到收敛
E步(期望步):计算每个数据点属于每个类别的后验概率(责任度)
对于每个数据点
x
i
\mathbf{x}_i
xi和每个类别
k
k
k,计算:
γ
i
k
(
t
)
=
P
(
y
i
=
k
∣
x
i
,
θ
(
t
)
,
π
(
t
)
)
=
π
k
(
t
)
∏
m
=
1
M
θ
k
,
m
(
t
)
(
x
i
m
)
∑
j
=
1
K
π
j
(
t
)
∏
m
=
1
M
θ
j
,
m
(
t
)
(
x
i
m
)
\gamma_{ik}^{(t)} = P(y_i = k | \mathbf{x}_i, \theta^{(t)}, \pi^{(t)}) = \frac{ \pi_k^{(t)} \prod_{m=1}^M \theta_{k,m}^{(t)}(x_{im}) }{ \sum_{j=1}^K \pi_j^{(t)} \prod_{m=1}^M \theta_{j,m}^{(t)}(x_{im}) }
γik(t)=P(yi=k∣xi,θ(t),π(t))=∑j=1Kπj(t)∏m=1Mθj,m(t)(xim)πk(t)∏m=1Mθk,m(t)(xim)
M步(最大化步):更新参数以最大化期望的对数似然
-
更新先验概率 π k \pi_k πk:
π k ( t + 1 ) = 1 N ∑ i = 1 N γ i k ( t ) \pi_k^{(t+1)} = \frac{1}{N} \sum_{i=1}^N \gamma_{ik}^{(t)} πk(t+1)=N1i=1∑Nγik(t) -
更新条件概率 θ k , m ( x ) \theta_{k,m}(x) θk,m(x):
θ k , m ( t + 1 ) ( x ) = ∑ i = 1 N γ i k ( t ) I ( x i m = x ) ∑ i = 1 N γ i k ( t ) \theta_{k,m}^{(t+1)}(x) = \frac{ \sum_{i=1}^N \gamma_{ik}^{(t)} \mathbb{I}(x_{im} = x) }{ \sum_{i=1}^N \gamma_{ik}^{(t)} } θk,m(t+1)(x)=∑i=1Nγik(t)∑i=1Nγik(t)I(xim=x)
其中, I ( ⋅ ) \mathbb{I}(\cdot) I(⋅)为指示函数,当 x i m = x x_{im} = x xim=x时取1,否则取0。
自然地,当参数更新的变化量低于预设的阈值,或者达到最大迭代次数时,停止迭代。
以上只叙述了 P ( Z ∣ D , θ ) P(Z|D,\theta) P(Z∣D,θ)这个核心,并不够完整,完整的Q步M步如下:
Q步
完全数据对数似然函数为:
log
P
(
D
,
Y
∣
θ
,
π
)
=
∑
i
=
1
N
log
P
(
x
i
,
y
i
∣
θ
,
π
)
\log P(D, Y | \theta, \pi) = \sum_{i=1}^N \log P(\mathbf{x}_i, y_i | \theta, \pi)
logP(D,Y∣θ,π)=i=1∑NlogP(xi,yi∣θ,π)
由于类别标签
y
i
y_i
yi是隐藏的,我们取其期望:
Q
(
θ
,
π
;
θ
(
t
)
,
π
(
t
)
)
=
E
Y
∣
D
,
θ
(
t
)
,
π
(
t
)
[
log
P
(
D
,
Y
∣
θ
,
π
)
]
Q(\theta, \pi ; \theta^{(t)}, \pi^{(t)}) = \mathbb{E}_{Y | D, \theta^{(t)}, \pi^{(t)}} [\log P(D, Y | \theta, \pi)]
Q(θ,π;θ(t),π(t))=EY∣D,θ(t),π(t)[logP(D,Y∣θ,π)]
那么期望展开其实就是:
Q
(
θ
,
π
;
θ
(
t
)
,
π
(
t
)
)
=
∑
i
=
1
N
log
P
(
x
i
,
y
i
∣
θ
,
π
)
P
(
Y
∣
D
,
θ
(
t
)
,
π
(
t
)
)
Q(\theta, \pi ; \theta^{(t)}, \pi^{(t)}) = \sum_{i=1}^N \log P(\mathbf{x}_i, y_i | \theta, \pi) P(Y | D, \theta^{(t)}, \pi^{(t)})
Q(θ,π;θ(t),π(t))=i=1∑NlogP(xi,yi∣θ,π)P(Y∣D,θ(t),π(t))
注意这个写法是错误的,只是为了看清楚期望乘的是哪个因子。
再展开
P
(
Y
∣
D
)
P(Y|D)
P(Y∣D):
Q
(
θ
,
π
;
θ
(
t
)
,
π
(
t
)
)
=
∑
k
=
1
K
∑
i
=
1
N
log
P
(
x
i
,
y
i
=
k
∣
θ
,
π
)
P
(
y
i
=
k
∣
x
i
,
θ
(
t
)
,
π
(
t
)
)
Q(\theta, \pi ; \theta^{(t)}, \pi^{(t)}) = \sum_{k = 1}^K \sum_{i=1}^N \log P(\mathbf{x}_i, y_i = k | \theta, \pi) P(y_i = k|\mathbf{x}_i, \theta^{(t)}, \pi^{(t)})
Q(θ,π;θ(t),π(t))=k=1∑Ki=1∑NlogP(xi,yi=k∣θ,π)P(yi=k∣xi,θ(t),π(t))
这个才是对的。
即:(这里把Q里迭代的参数去掉了,其实没变)
log
Q
(
θ
,
π
)
=
∑
i
=
1
N
∑
k
=
1
K
γ
i
k
^
log
P
(
x
i
,
y
i
=
k
∣
θ
,
π
)
\log Q(\theta, \pi) = \sum_{i=1}^N \sum_{k=1}^K \hat{\gamma_{ik}} \log P(\mathbf{x}_i, y_i = k | \theta, \pi)
logQ(θ,π)=i=1∑Nk=1∑Kγik^logP(xi,yi=k∣θ,π)
其中,
γ
i
k
^
=
P
(
y
i
=
k
∣
x
i
,
θ
(
t
)
,
π
(
t
)
)
\hat{\gamma_{ik}} = P(y_i = k | \mathbf{x}_i, \theta^{(t)}, \pi^{(t)})
γik^=P(yi=k∣xi,θ(t),π(t))是在E步中计算得到的“责任度”,可以类比混合高斯算法,这里的一个是离散的一个是连续的。也就是说,无论在混合高斯算法里,还是这里,
γ
i
k
^
\hat{\gamma_{ik}}
γik^已经是基于
γ
i
k
\gamma_{ik}
γik在期望上的新结果了,也就是上文的
γ
i
k
(
t
)
\gamma_{ik}^{(t)}
γik(t)。
Note:
接下来的步骤里依然用
γ
i
k
接下来的步骤里依然用\gamma_{ik}
接下来的步骤里依然用γik。
在Q式子里把联合概率展开即:
P
(
x
i
,
y
i
=
k
∣
θ
,
π
)
=
P
(
x
i
∣
y
i
=
k
,
θ
,
π
)
P
(
y
i
=
k
∣
θ
,
π
)
=
θ
k
,
m
(
x
i
m
)
⋅
π
k
P(\mathbf{x}_i, y_i = k | \theta, \pi) = P(\mathbf{x}_i | y_i = k,\theta,\pi) P(y_i = k|\theta, \pi) = \theta_{k,m}(x_{im}) \cdot \pi_k
P(xi,yi=k∣θ,π)=P(xi∣yi=k,θ,π)P(yi=k∣θ,π)=θk,m(xim)⋅πk
注意,按统计学习方法书上朴素贝叶斯的写法,早在
P
(
x
i
,
y
i
∣
θ
,
π
)
P(\mathbf{x}_i, y_i | \theta, \pi)
P(xi,yi∣θ,π)就应该写成
P
(
X
i
(
m
)
=
x
i
m
,
Y
=
y
i
∣
θ
,
π
)
(
y
i
∈
{
1
,
2
,
…
,
K
}
)
P(X_i^{(m)} = \mathbf{x}_{im}, Y = y_i | \theta, \pi) (y_i \in \{1,2,\dots,K\})
P(Xi(m)=xim,Y=yi∣θ,π)(yi∈{1,2,…,K})。当然,后面加了
log
\log
log之后这个两项相乘就是两项相加了,接下来就找这两项的两个迭代参数
π
k
,
θ
k
,
m
(
x
i
m
)
\pi_k,\theta_{k,m}(x_{im})
πk,θk,m(xim)。
M步
过程如下:
根据完全数据对数似然函数:
log
P
(
D
,
Y
∣
π
)
=
∑
i
=
1
N
∑
k
=
1
K
γ
i
k
log
π
k
\log P(D, Y | \pi) = \sum_{i=1}^N \sum_{k=1}^K \gamma_{ik} \log \pi_k
logP(D,Y∣π)=i=1∑Nk=1∑Kγiklogπk
为了最大化
Q
(
θ
,
π
)
Q(\theta, \pi)
Q(θ,π)对
π
k
\pi_k
πk的部分,我们需要解以下优化问题:
max
π
k
∑
i
=
1
N
∑
k
=
1
K
γ
i
k
log
π
k
\max_{\pi_k} \sum_{i=1}^N \sum_{k=1}^K \gamma_{ik} \log \pi_k
πkmaxi=1∑Nk=1∑Kγiklogπk
约束条件:
∑
k
=
1
K
π
k
=
1
且
π
k
≥
0
,
∀
k
∈
{
1
,
2
,
…
,
K
}
\sum_{k=1}^K \pi_k = 1 \quad \text{且} \quad \pi_k \geq 0,\forall k \in \{1,2,\dots,K\}
k=1∑Kπk=1且πk≥0,∀k∈{1,2,…,K}
利用拉格朗日乘数法,引入
λ
\lambda
λ:
L
=
∑
i
=
1
N
∑
k
=
1
K
γ
i
k
log
π
k
+
λ
(
1
−
∑
k
=
1
K
π
k
)
\mathcal{L} = \sum_{i=1}^N \sum_{k=1}^K \gamma_{ik} \log \pi_k + \lambda \left( 1 - \sum_{k=1}^K \pi_k \right)
L=i=1∑Nk=1∑Kγiklogπk+λ(1−k=1∑Kπk)
对每个
π
k
\pi_k
πk求偏导并令为0:
∂
L
∂
π
k
=
∑
i
=
1
N
γ
i
k
π
k
−
λ
=
0
\frac{\partial \mathcal{L}}{\partial \pi_k} = \frac{\sum_{i=1}^N \gamma_{ik}}{\pi_k} - \lambda = 0
∂πk∂L=πk∑i=1Nγik−λ=0
得:
π
k
=
1
λ
∑
i
=
1
N
γ
i
k
\pi_k = \frac{1}{\lambda} \sum_{i=1}^N \gamma_{ik}
πk=λ1i=1∑Nγik
利用约束条件
∑
k
=
1
K
π
k
=
1
\sum\limits_{k=1}^K \pi_k = 1
k=1∑Kπk=1:
∑
k
=
1
K
1
λ
∑
i
=
1
N
γ
i
k
=
1
⇒
1
λ
∑
i
=
1
N
∑
k
=
1
K
γ
i
k
=
1
\sum_{k=1}^K \frac{1}{\lambda} \sum_{i=1}^N \gamma_{ik} = 1 \Rightarrow \frac{1}{\lambda} \sum_{i=1}^N \sum_{k=1}^K \gamma_{ik} = 1
k=1∑Kλ1i=1∑Nγik=1⇒λ1i=1∑Nk=1∑Kγik=1
由于
∑
k
=
1
K
γ
i
k
=
1
\sum\limits_{k=1}^K \gamma_{ik} = 1
k=1∑Kγik=1对所有
i
i
i:
1
λ
N
=
1
⇒
λ
=
N
\frac{1}{\lambda} N = 1 \Rightarrow \lambda = N
λ1N=1⇒λ=N
因此:
π
k
(
t
+
1
)
=
1
N
∑
i
=
1
N
γ
i
k
\pi_k^{(t+1)} = \frac{1}{N} \sum_{i=1}^N \gamma_{ik}
πk(t+1)=N1i=1∑Nγik
对于条件概率
θ
k
,
m
(
x
)
=
P
(
x
i
(
m
)
=
x
i
m
∣
y
=
k
)
\theta_{k,m}(x) = P(x_i^{(m)} = x_{im} | y = k)
θk,m(x)=P(xi(m)=xim∣y=k),目标是最大化:
∑
i
=
1
N
∑
k
=
1
K
γ
i
k
log
θ
k
,
m
(
x
i
m
)
\sum_{i=1}^N \sum_{k=1}^K \gamma_{ik} \log \theta_{k,m}(x_{im})
i=1∑Nk=1∑Kγiklogθk,m(xim)
不难发现约束条件:(k,m取值不写了)
∑
x
θ
k
,
m
(
x
)
=
1
∀
k
,
m
\sum_{x} \theta_{k,m}(x) = 1 \quad \forall k, m
x∑θk,m(x)=1∀k,m
同样,使用拉格朗日乘数法,引入
λ
k
\lambda_k
λk:
L
=
∑
i
=
1
N
∑
k
=
1
K
γ
i
k
log
θ
k
,
m
(
x
i
m
)
+
∑
k
=
1
K
λ
k
(
1
−
∑
x
θ
k
,
m
(
x
)
)
\mathcal{L} = \sum_{i=1}^N \sum_{k=1}^K \gamma_{ik} \log \theta_{k,m}(x_{im}) + \sum_{k=1}^K \lambda_k \left( 1 - \sum_{x} \theta_{k,m}(x) \right)
L=i=1∑Nk=1∑Kγiklogθk,m(xim)+k=1∑Kλk(1−x∑θk,m(x))
对每个
θ
k
,
m
(
x
)
\theta_{k,m}(x)
θk,m(x)求偏导并设为零:
∂
L
∂
θ
k
,
m
(
x
)
=
∑
i
=
1
N
γ
i
k
I
(
x
i
m
=
x
)
θ
k
,
m
(
x
)
−
λ
k
=
0
\frac{\partial \mathcal{L}}{\partial \theta_{k,m}(x)} = \frac{\sum_{i=1}^N \gamma_{ik} \mathbb{I}(x_{im} = x)}{\theta_{k,m}(x)} - \lambda_k = 0
∂θk,m(x)∂L=θk,m(x)∑i=1NγikI(xim=x)−λk=0
解得:
θ
k
,
m
(
x
)
=
1
λ
k
∑
i
=
1
N
γ
i
k
I
(
x
i
m
=
x
)
\theta_{k,m}(x) = \frac{1}{\lambda_k} \sum_{i=1}^N \gamma_{ik} \mathbb{I}(x_{im} = x)
θk,m(x)=λk1i=1∑NγikI(xim=x)
利用约束条件:
∑
x
θ
k
,
m
(
x
)
=
1
⇒
1
λ
k
∑
x
∑
i
=
1
N
γ
i
k
I
(
x
i
m
=
x
)
=
1
\sum_{x} \theta_{k,m}(x) = 1 \Rightarrow \frac{1}{\lambda_k} \sum_{x} \sum_{i=1}^N \gamma_{ik} \mathbb{I}(x_{im} = x) = 1
x∑θk,m(x)=1⇒λk1x∑i=1∑NγikI(xim=x)=1
注意到对于每个
i
i
i,
∑
x
I
(
x
i
m
=
x
)
=
1
\sum\limits_{x} \mathbb{I}(x_{im} = x) = 1
x∑I(xim=x)=1:
1
λ
k
∑
i
=
1
N
γ
i
k
=
1
⇒
λ
k
=
∑
i
=
1
N
γ
i
k
\frac{1}{\lambda_k} \sum_{i=1}^N \gamma_{ik} = 1 \Rightarrow \lambda_k = \sum_{i=1}^N \gamma_{ik}
λk1i=1∑Nγik=1⇒λk=i=1∑Nγik
因此,条件概率的更新公式为:
θ
k
,
m
(
x
)
(
t
+
1
)
=
∑
i
=
1
N
γ
i
k
(
t
)
I
(
x
i
m
=
x
)
∑
i
=
1
N
γ
i
k
(
t
)
\theta_{k,m}(x)^{(t+1)} = \frac{ \sum_{i=1}^N \gamma_{ik}^{(t)} \mathbb{I}(x_{im} = x) }{ \sum_{i=1}^N \gamma_{ik}^{(t)} }
θk,m(x)(t+1)=∑i=1Nγik(t)∑i=1Nγik(t)I(xim=x)