统计学习方法之EM算法作业9.4,朴素贝叶斯法的无监督学习

最近对EM算法用于无监督学习的朴素贝叶斯分类决策器很感兴趣,但奈何找了不少资料我也没彻底看懂。
这里贴一个datawhale的:
EM算法
那么我们知道EM算法的Q步是针对完全数据(包含观测数据序列和隐变量数据序列)。由于 P ( D , Z ∣ θ ) P(D,Z| \theta) P(D,Zθ)设计隐变量Z,我们通过Q函数对其求关于Z随机变量的期望迭代近似。

接下来进入正题:

假设有一个未标注的数据集 D = { x 1 , x 2 , … , x N } D = \{ \mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_N \} D={x1,x2,,xN},其中每个数据点 x i ∈ { x i 1 , x i 2 , … , x i M } \mathbf{x}_i \in \{x_{i1}, x_{i2}, \dots, x_{iM}\} xi{xi1,xi2,,xiM}

每个数据点对应一个隐藏的类别标签 y i y_i yi,取值范围为 y i ∈ { 1 , 2 , … , K } y_i \in \{1, 2, \dots, K\} yi{1,2,,K}

根据Naive Bayes假设:在给定类别的条件下,各特征相互独立,即
P ( x i ∣ y i = k ) = ∏ m = 1 M P ( x i m ∣ y i = k ) P(\mathbf{x}_i | y_i = k) = \prod_{m=1}^M P(x_{im} | y_i = k) P(xiyi=k)=m=1MP(ximyi=k)

先验概率: π k = P ( y = k ) \pi_k = P(y = k) πk=P(y=k),满足 ∑ k = 1 K π k = 1 \sum\limits_{k=1}^K \pi_k = 1 k=1Kπk=1
条件概率: θ k , m ( x ) = P ( x i ( m ) = x i m ∣ y = k ) \theta_{k,m}(x) = P(x_i^{(m)} = x_{im} | y = k) θk,m(x)=P(xi(m)=ximy=k)

EM算法步骤

1. 初始化参数

  • 随机初始化先验概率 π k ( 0 ) \pi_k^{(0)} πk(0)
  • 随机或基于启发式方法初始化条件概率 θ k , m ( 0 ) ( x ) \theta_{k,m}^{(0)}(x) θk,m(0)(x)

2. 迭代执行到收敛

E步(期望步):计算每个数据点属于每个类别的后验概率(责任度)

对于每个数据点 x i \mathbf{x}_i xi和每个类别 k k k,计算:
γ i k ( t ) = P ( y i = k ∣ x i , θ ( t ) , π ( t ) ) = π k ( t ) ∏ m = 1 M θ k , m ( t ) ( x i m ) ∑ j = 1 K π j ( t ) ∏ m = 1 M θ j , m ( t ) ( x i m ) \gamma_{ik}^{(t)} = P(y_i = k | \mathbf{x}_i, \theta^{(t)}, \pi^{(t)}) = \frac{ \pi_k^{(t)} \prod_{m=1}^M \theta_{k,m}^{(t)}(x_{im}) }{ \sum_{j=1}^K \pi_j^{(t)} \prod_{m=1}^M \theta_{j,m}^{(t)}(x_{im}) } γik(t)=P(yi=kxi,θ(t),π(t))=j=1Kπj(t)m=1Mθj,m(t)(xim)πk(t)m=1Mθk,m(t)(xim)

M步(最大化步):更新参数以最大化期望的对数似然
  • 更新先验概率 π k \pi_k πk
    π k ( t + 1 ) = 1 N ∑ i = 1 N γ i k ( t ) \pi_k^{(t+1)} = \frac{1}{N} \sum_{i=1}^N \gamma_{ik}^{(t)} πk(t+1)=N1i=1Nγik(t)

  • 更新条件概率 θ k , m ( x ) \theta_{k,m}(x) θk,m(x)
    θ k , m ( t + 1 ) ( x ) = ∑ i = 1 N γ i k ( t ) I ( x i m = x ) ∑ i = 1 N γ i k ( t ) \theta_{k,m}^{(t+1)}(x) = \frac{ \sum_{i=1}^N \gamma_{ik}^{(t)} \mathbb{I}(x_{im} = x) }{ \sum_{i=1}^N \gamma_{ik}^{(t)} } θk,m(t+1)(x)=i=1Nγik(t)i=1Nγik(t)I(xim=x)
    其中, I ( ⋅ ) \mathbb{I}(\cdot) I()为指示函数,当 x i m = x x_{im} = x xim=x时取1,否则取0。

自然地,当参数更新的变化量低于预设的阈值,或者达到最大迭代次数时,停止迭代。

以上只叙述了 P ( Z ∣ D , θ ) P(Z|D,\theta) P(ZD,θ)这个核心,并不够完整,完整的Q步M步如下:

Q步

完全数据对数似然函数为:
log ⁡ P ( D , Y ∣ θ , π ) = ∑ i = 1 N log ⁡ P ( x i , y i ∣ θ , π ) \log P(D, Y | \theta, \pi) = \sum_{i=1}^N \log P(\mathbf{x}_i, y_i | \theta, \pi) logP(D,Yθ,π)=i=1NlogP(xi,yiθ,π)
由于类别标签 y i y_i yi是隐藏的,我们取其期望:
Q ( θ , π ; θ ( t ) , π ( t ) ) = E Y ∣ D , θ ( t ) , π ( t ) [ log ⁡ P ( D , Y ∣ θ , π ) ] Q(\theta, \pi ; \theta^{(t)}, \pi^{(t)}) = \mathbb{E}_{Y | D, \theta^{(t)}, \pi^{(t)}} [\log P(D, Y | \theta, \pi)] Q(θ,π;θ(t),π(t))=EYD,θ(t),π(t)[logP(D,Yθ,π)]
那么期望展开其实就是:
Q ( θ , π ; θ ( t ) , π ( t ) ) = ∑ i = 1 N log ⁡ P ( x i , y i ∣ θ , π ) P ( Y ∣ D , θ ( t ) , π ( t ) ) Q(\theta, \pi ; \theta^{(t)}, \pi^{(t)}) = \sum_{i=1}^N \log P(\mathbf{x}_i, y_i | \theta, \pi) P(Y | D, \theta^{(t)}, \pi^{(t)}) Q(θ,π;θ(t),π(t))=i=1NlogP(xi,yiθ,π)P(YD,θ(t),π(t))
注意这个写法是错误的,只是为了看清楚期望乘的是哪个因子
再展开 P ( Y ∣ D ) P(Y|D) P(YD)
Q ( θ , π ; θ ( t ) , π ( t ) ) = ∑ k = 1 K ∑ i = 1 N log ⁡ P ( x i , y i = k ∣ θ , π ) P ( y i = k ∣ x i , θ ( t ) , π ( t ) ) Q(\theta, \pi ; \theta^{(t)}, \pi^{(t)}) = \sum_{k = 1}^K \sum_{i=1}^N \log P(\mathbf{x}_i, y_i = k | \theta, \pi) P(y_i = k|\mathbf{x}_i, \theta^{(t)}, \pi^{(t)}) Q(θ,π;θ(t),π(t))=k=1Ki=1NlogP(xi,yi=kθ,π)P(yi=kxi,θ(t),π(t))
这个才是对的
即:(这里把Q里迭代的参数去掉了,其实没变)
log ⁡ Q ( θ , π ) = ∑ i = 1 N ∑ k = 1 K γ i k ^ log ⁡ P ( x i , y i = k ∣ θ , π ) \log Q(\theta, \pi) = \sum_{i=1}^N \sum_{k=1}^K \hat{\gamma_{ik}} \log P(\mathbf{x}_i, y_i = k | \theta, \pi) logQ(θ,π)=i=1Nk=1Kγik^logP(xi,yi=kθ,π)
其中, γ i k ^ = P ( y i = k ∣ x i , θ ( t ) , π ( t ) ) \hat{\gamma_{ik}} = P(y_i = k | \mathbf{x}_i, \theta^{(t)}, \pi^{(t)}) γik^=P(yi=kxi,θ(t),π(t))是在E步中计算得到的“责任度”,可以类比混合高斯算法,这里的一个是离散的一个是连续的。也就是说,无论在混合高斯算法里,还是这里, γ i k ^ \hat{\gamma_{ik}} γik^已经是基于 γ i k \gamma_{ik} γik在期望上的新结果了,也就是上文的 γ i k ( t ) \gamma_{ik}^{(t)} γik(t)
Note: 接下来的步骤里依然用 γ i k 接下来的步骤里依然用\gamma_{ik} 接下来的步骤里依然用γik

在Q式子里把联合概率展开即:
P ( x i , y i = k ∣ θ , π ) = P ( x i ∣ y i = k , θ , π ) P ( y i = k ∣ θ , π ) = θ k , m ( x i m ) ⋅ π k P(\mathbf{x}_i, y_i = k | \theta, \pi) = P(\mathbf{x}_i | y_i = k,\theta,\pi) P(y_i = k|\theta, \pi) = \theta_{k,m}(x_{im}) \cdot \pi_k P(xi,yi=kθ,π)=P(xiyi=k,θ,π)P(yi=kθ,π)=θk,m(xim)πk
注意,按统计学习方法书上朴素贝叶斯的写法,早在 P ( x i , y i ∣ θ , π ) P(\mathbf{x}_i, y_i | \theta, \pi) P(xi,yiθ,π)就应该写成 P ( X i ( m ) = x i m , Y = y i ∣ θ , π ) ( y i ∈ { 1 , 2 , … , K } ) P(X_i^{(m)} = \mathbf{x}_{im}, Y = y_i | \theta, \pi) (y_i \in \{1,2,\dots,K\}) P(Xi(m)=xim,Y=yiθ,π)(yi{1,2,,K})。当然,后面加了 log ⁡ \log log之后这个两项相乘就是两项相加了,接下来就找这两项的两个迭代参数 π k , θ k , m ( x i m ) \pi_k,\theta_{k,m}(x_{im}) πk,θk,m(xim)

M步

过程如下:
根据完全数据对数似然函数:
log ⁡ P ( D , Y ∣ π ) = ∑ i = 1 N ∑ k = 1 K γ i k log ⁡ π k \log P(D, Y | \pi) = \sum_{i=1}^N \sum_{k=1}^K \gamma_{ik} \log \pi_k logP(D,Yπ)=i=1Nk=1Kγiklogπk

为了最大化 Q ( θ , π ) Q(\theta, \pi) Q(θ,π) π k \pi_k πk的部分,我们需要解以下优化问题:
max ⁡ π k ∑ i = 1 N ∑ k = 1 K γ i k log ⁡ π k \max_{\pi_k} \sum_{i=1}^N \sum_{k=1}^K \gamma_{ik} \log \pi_k πkmaxi=1Nk=1Kγiklogπk
约束条件:
∑ k = 1 K π k = 1 且 π k ≥ 0 , ∀ k ∈ { 1 , 2 , … , K } \sum_{k=1}^K \pi_k = 1 \quad \text{且} \quad \pi_k \geq 0,\forall k \in \{1,2,\dots,K\} k=1Kπk=1πk0,k{1,2,,K}

利用拉格朗日乘数法,引入 λ \lambda λ
L = ∑ i = 1 N ∑ k = 1 K γ i k log ⁡ π k + λ ( 1 − ∑ k = 1 K π k ) \mathcal{L} = \sum_{i=1}^N \sum_{k=1}^K \gamma_{ik} \log \pi_k + \lambda \left( 1 - \sum_{k=1}^K \pi_k \right) L=i=1Nk=1Kγiklogπk+λ(1k=1Kπk)

对每个 π k \pi_k πk求偏导并令为0:
∂ L ∂ π k = ∑ i = 1 N γ i k π k − λ = 0 \frac{\partial \mathcal{L}}{\partial \pi_k} = \frac{\sum_{i=1}^N \gamma_{ik}}{\pi_k} - \lambda = 0 πkL=πki=1Nγikλ=0
得:
π k = 1 λ ∑ i = 1 N γ i k \pi_k = \frac{1}{\lambda} \sum_{i=1}^N \gamma_{ik} πk=λ1i=1Nγik

利用约束条件 ∑ k = 1 K π k = 1 \sum\limits_{k=1}^K \pi_k = 1 k=1Kπk=1
∑ k = 1 K 1 λ ∑ i = 1 N γ i k = 1 ⇒ 1 λ ∑ i = 1 N ∑ k = 1 K γ i k = 1 \sum_{k=1}^K \frac{1}{\lambda} \sum_{i=1}^N \gamma_{ik} = 1 \Rightarrow \frac{1}{\lambda} \sum_{i=1}^N \sum_{k=1}^K \gamma_{ik} = 1 k=1Kλ1i=1Nγik=1λ1i=1Nk=1Kγik=1
由于 ∑ k = 1 K γ i k = 1 \sum\limits_{k=1}^K \gamma_{ik} = 1 k=1Kγik=1对所有 i i i
1 λ N = 1 ⇒ λ = N \frac{1}{\lambda} N = 1 \Rightarrow \lambda = N λ1N=1λ=N
因此:
π k ( t + 1 ) = 1 N ∑ i = 1 N γ i k \pi_k^{(t+1)} = \frac{1}{N} \sum_{i=1}^N \gamma_{ik} πk(t+1)=N1i=1Nγik

对于条件概率 θ k , m ( x ) = P ( x i ( m ) = x i m ∣ y = k ) \theta_{k,m}(x) = P(x_i^{(m)} = x_{im} | y = k) θk,m(x)=P(xi(m)=ximy=k),目标是最大化:
∑ i = 1 N ∑ k = 1 K γ i k log ⁡ θ k , m ( x i m ) \sum_{i=1}^N \sum_{k=1}^K \gamma_{ik} \log \theta_{k,m}(x_{im}) i=1Nk=1Kγiklogθk,m(xim)
不难发现约束条件:(k,m取值不写了)
∑ x θ k , m ( x ) = 1 ∀ k , m \sum_{x} \theta_{k,m}(x) = 1 \quad \forall k, m xθk,m(x)=1k,m

同样,使用拉格朗日乘数法,引入 λ k \lambda_k λk
L = ∑ i = 1 N ∑ k = 1 K γ i k log ⁡ θ k , m ( x i m ) + ∑ k = 1 K λ k ( 1 − ∑ x θ k , m ( x ) ) \mathcal{L} = \sum_{i=1}^N \sum_{k=1}^K \gamma_{ik} \log \theta_{k,m}(x_{im}) + \sum_{k=1}^K \lambda_k \left( 1 - \sum_{x} \theta_{k,m}(x) \right) L=i=1Nk=1Kγiklogθk,m(xim)+k=1Kλk(1xθk,m(x))

对每个 θ k , m ( x ) \theta_{k,m}(x) θk,m(x)求偏导并设为零:
∂ L ∂ θ k , m ( x ) = ∑ i = 1 N γ i k I ( x i m = x ) θ k , m ( x ) − λ k = 0 \frac{\partial \mathcal{L}}{\partial \theta_{k,m}(x)} = \frac{\sum_{i=1}^N \gamma_{ik} \mathbb{I}(x_{im} = x)}{\theta_{k,m}(x)} - \lambda_k = 0 θk,m(x)L=θk,m(x)i=1NγikI(xim=x)λk=0
解得:
θ k , m ( x ) = 1 λ k ∑ i = 1 N γ i k I ( x i m = x ) \theta_{k,m}(x) = \frac{1}{\lambda_k} \sum_{i=1}^N \gamma_{ik} \mathbb{I}(x_{im} = x) θk,m(x)=λk1i=1NγikI(xim=x)

利用约束条件:
∑ x θ k , m ( x ) = 1 ⇒ 1 λ k ∑ x ∑ i = 1 N γ i k I ( x i m = x ) = 1 \sum_{x} \theta_{k,m}(x) = 1 \Rightarrow \frac{1}{\lambda_k} \sum_{x} \sum_{i=1}^N \gamma_{ik} \mathbb{I}(x_{im} = x) = 1 xθk,m(x)=1λk1xi=1NγikI(xim=x)=1
注意到对于每个 i i i ∑ x I ( x i m = x ) = 1 \sum\limits_{x} \mathbb{I}(x_{im} = x) = 1 xI(xim=x)=1
1 λ k ∑ i = 1 N γ i k = 1 ⇒ λ k = ∑ i = 1 N γ i k \frac{1}{\lambda_k} \sum_{i=1}^N \gamma_{ik} = 1 \Rightarrow \lambda_k = \sum_{i=1}^N \gamma_{ik} λk1i=1Nγik=1λk=i=1Nγik

因此,条件概率的更新公式为:
θ k , m ( x ) ( t + 1 ) = ∑ i = 1 N γ i k ( t ) I ( x i m = x ) ∑ i = 1 N γ i k ( t ) \theta_{k,m}(x)^{(t+1)} = \frac{ \sum_{i=1}^N \gamma_{ik}^{(t)} \mathbb{I}(x_{im} = x) }{ \sum_{i=1}^N \gamma_{ik}^{(t)} } θk,m(x)(t+1)=i=1Nγik(t)i=1Nγik(t)I(xim=x)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值