指数族分布和变分推断

本文深入探讨了指数族分布的概念,包括其在概率密度函数中的表达形式,以及如何将常见分布如高斯分布转化为指数族形式。进一步讨论了指数族分布的优势,特别是在求解最大似然估计和变分推断中的简化作用。文章还详细解释了共轭先验的概念,并通过数学推导展示了指数族分布与变分推断之间的联系。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

指数族分布

  1. 指数族分布的pdf / pmf可以表示成:

p(x∣η)=h(x)exp(T(x)Tη−A(η)) p(x| \eta)=h(x)exp(T(x)^T \eta - A(\eta)) p(xη)=h(x)exp(T(x)TηA(η))

其中,、T(x)、h(x)、T(x)、h(x)T(x)h(x)只是包含xxx的函数, A(η)A(\eta)A(η)是只包含η\etaη的函数。T(x)T(x)T(x)叫做sufficient statistics。A(η)A(\eta)A(η)叫做log-normalizer。在变分推断中,A(η)A(\eta)A(η)起到很重要的作用。
∫h(x)exp(T(x)Tη)dxexp(A(η))=1A(η)=log∫h(x)exp(T(X)Tη)dx \frac{\int h(x)exp(T(x)^T\eta)dx}{exp(A(\eta))}=1\\ A(\eta)=log\int h(x)exp(T(X)^T\eta)dx exp(A(η))h(x)exp(T(x)Tη)dx=1A(η)=logh(x)exp(T(X)Tη)dx

  1. 我们学到的很多分布都是指数族分布,比如:

Normal, beta, Poisson, gamma, Bernoulli, chi-squared, geometric, exponential, categorical…

  1. 举高斯分布为例子

p(x∣θ)=p(x∣μ,σ2)=N(μ,σ2)=12πσexp−(x−μ)22σ2 p(x| \theta)=p(x|\mu, \sigma^2)=N(\mu, \sigma^2)=\frac{1}{\sqrt{2\pi}\sigma}exp{-\frac{(x-\mu)^2}{2\sigma^2}} p(xθ)=p(xμ,σ2)=N(μ,σ2)=2πσ1exp2σ2(xμ)2

  1. 例子:怎样把高斯分布写成指数族分布的形式,就是怎样把均值和方差这两个参数替换成η1,η2\eta_1, \eta_2η1,η2

N(x∣μ,σ2)=(2πσ2)−12exp(−(x−μ)22σ2)=exp(−x2−2xμ+μ22σ2−12ln(2πσ2)=exp(−12σ2x2+μσ2x−μ22σ2−12ln(2πσ2))=exp([xx2]T[μσ2−12σ2]−μ22σ2−12ln(2πσ2)) \begin{aligned} N(x|\mu,\sigma^2)&=(2\pi \sigma^2)^{-\frac{1}{2}}exp(-\frac{(x-\mu)^2}{2\sigma^2})\\ &=exp(-\frac{x^2-2x\mu+\mu^2}{2\sigma^2}-\frac{1}{2}ln(2\pi\sigma^2)\\ &=exp(-\frac{1}{2\sigma^2}x^2+\frac{\mu}{\sigma^2}x-\frac{\mu^2}{2\sigma^2}-\frac{1}{2}ln({2\pi\sigma^2}))\\ &=exp(\begin{bmatrix} x \\ x^2 \end{bmatrix}^T\begin{bmatrix} \frac{\mu}{\sigma^2}\\ -\frac{1}{2\sigma^2} \end{bmatrix}-\frac{\mu^2}{2\sigma^2}-\frac{1}{2}ln(2\pi\sigma^2)) \end{aligned} N(xμσ2)=(2πσ2)21exp(2σ2(xμ)2)=exp(2σ2x22xμ+μ221ln(2πσ2)=exp(2σ21x2+σ2μx2σ2μ221ln(2πσ2))=exp([xx2]T[σ2μ2σ21]2σ2μ221ln(2πσ2))

这里,我们得到:
T(x)=[xx2]η=[η1η2]=[μσ2−12σ2]θ=[μσ2]=[−η12η2−12η2]A(η)=−η124η2−12ln(−2η2) \begin{aligned} T(x)=\begin{bmatrix} x \\ x^2 \end{bmatrix}\\ \eta=\begin{bmatrix} \eta_1\\ \eta_2 \end{bmatrix}=\begin{bmatrix} \frac{\mu}{\sigma^2}\\ -\frac{1}{2\sigma^2} \end{bmatrix}\\ \theta=\begin{bmatrix} \mu\\ \sigma^2 \end{bmatrix}=\begin{bmatrix} \frac{-\eta_1}{2\eta_2}\\ \frac{-1}{2\eta_2} \end{bmatrix}\\ A(\eta)=\frac{-\eta_1^2}{4\eta_2}-\frac{1}{2}ln(-2\eta_2) \end{aligned} T(x)=[xx2]η=[η1η2]=[σ2μ2σ21]θ=[μσ2]=[2η2η12η21]A(η)=4η2η1221ln(2η2)
所以均值和方差可以表示为:
η2=−12σ2⇒σ2=−12η2μ=η1σ2=η1−12η2=−η12η2 \eta_2=-\frac{1}{2\sigma^2}\Rightarrow \sigma^2=-\frac{1}{2\eta_2}\\ \mu=\eta_1\sigma^2=\eta_1\frac{-1}{2\eta_2}=-\frac{\eta_1}{2\eta_2} η2=2σ21σ2=2η21μ=η1σ2=η12η21=2η2η1

  1. 指数族分布有什么好处呢?
  • 如果一个条件概率可以写成上面的形式,很多问题的求解变得简单。
  • 比如:求解argmaxθ[logp(X∣η)]\underset{\theta}{argmax}[log p(X| \eta)]θargmax[logp(Xη)]

argmaxη[logp(X∣η)]=argmaxη[log∏i=1Np(xi∣η)]=argmaxη∑i=1N[logh(xi)+T(xi)Tη−A(η)]=argmaxη∑i=1NT(xi)Tη−NA(η) \begin{aligned} \underset{\eta}{argmax}[log p(X| \eta)]&=\underset{\eta}{argmax}[log \prod_{i=1}^{N} p(x_i| \eta)]\\ &=\underset{\eta}{argmax}\sum_{i=1}^{N}[logh(x_i)+T(x_i)^T\eta-A(\eta)]\\ &=\underset{\eta}{argmax}\sum_{i=1}^{N}T(x_i)^T\eta-NA(\eta) \end{aligned} ηargmax[logp(Xη)]=ηargmax[logi=1Np(xiη)]=ηargmaxi=1N[logh(xi)+T(xi)TηA(η)]=ηargmaxi=1NT(xi)TηNA(η)

令上式为L(η)L(\eta)L(η),则
∂L(η)∂η=∑i=1NT(xi)−NA′(η)=0 \frac{\partial{L(\eta)}}{\partial \eta}=\sum_{i=1}^{N}T(x_i)-NA'(\eta)=0 ηL(η)=i=1NT(xi)NA(η)=0
即:
A′(η)=∑i=1NT(xi)N A'(\eta)=\frac{\sum_{i=1}^{N}T(x_i)}{N} A(η)=Ni=1NT(xi)

  1. 共轭:

p(β∣x)∝p(x∣β)p(β) p(\beta | x) \propto p(x|\beta)p(\beta) p(βx)p(xβ)p(β)

如果似然函数和先验是共轭的,则后验和先验是同一种分布。

如果似然函数是指数族分布,理论上一定可以找到一个与之共轭的先验分布(也是指数族分布)。

  1. 一个结论:Al′(β)=Ep(x∣β)[T(x)]A_l'(\beta)=E_{p(x|\beta)}[T(x)]Al(β)=Ep(xβ)[T(x)]

证明:
p(x∣β)=h(x)exp(T(x)Tβ−Al(β))∵∫p(x∣β)dx=1∴∂∫p(x∣β)dx∂β=∂∫h(x)exp(T(x)Tβ−Al(β))dx∂β=0=∫x∂[h(x)exp[T(x)Tβ−Al(β)]∂βdx=∫xh(x)exp[T(x)Tβ−Al(β)](T(x)−Al′(β))dx=∫xh(x)exp[T(x)Tβ−Al(β)]T(x)dx−∫xh(x)exp[T(x)Tβ−Al(β)]Al′(β))dx=Ep(x∣β)[T(x)]−Al′(β)=0 p(x|\beta)=h(x)exp(T(x)^T\beta-A_l(\beta))\\ \because \int p(x|\beta)dx=1\\ \begin{aligned} \therefore \frac{\partial \int p(x|\beta)dx}{\partial \beta}&=\frac{\partial \int h(x)exp(T(x)^T\beta-A_l(\beta))dx}{\partial \beta}=0\\ &=\int_x \frac{\partial [h(x)exp[T(x)^T\beta - A_l(\beta)]}{\partial \beta}dx\\ &=\int_x h(x)exp[T(x)^T\beta-A_l(\beta)](T(x)-A_l'(\beta))dx\\ &=\int_x h(x)exp[T(x)^T\beta-A_l(\beta)]T(x)dx-\int_x h(x)exp[T(x)^T\beta-A_l(\beta)]A_l'(\beta))dx\\ &=E_{p(x|\beta)}[T(x)]-A_l'(\beta)=0 \end{aligned} p(xβ)=h(x)exp(T(x)TβAl(β))p(xβ)dx=1βp(xβ)dx=βh(x)exp(T(x)TβAl(β))dx=0=xβ[h(x)exp[T(x)TβAl(β)]dx=xh(x)exp[T(x)TβAl(β)](T(x)Al(β))dx=xh(x)exp[T(x)TβAl(β)]T(x)dxxh(x)exp[T(x)TβAl(β)]Al(β))dx=Ep(xβ)[T(x)]Al(β)=0

  1. 数据集合XXX,隐变量集合ZZZ,参数集合β\betaβ

后验概率分布:
p(β,Z∣X)=p(β∣Z,X)p(Z∣X)=p(Z∣β,X)p(β∣X) \begin{aligned} p(\beta,Z|X)&=p(\beta|Z,X)p(Z|X)\\ &=p(Z|\beta,X)p(\beta|X) \end{aligned} p(β,ZX)=p(βZ,X)p(ZX)=p(Zβ,X)p(βX)
p(β∣Z,X)p(\beta|Z,X)p(βZ,X)p(Z∣β,X)p(Z|\beta,X)p(Zβ,X),这两个后验分布都是指数族分布。

则:
p(β∣Z,X)=h(β)exp(T(β)Tη(Z,X)−Al(η(Z,X))) p(\beta|Z,X)=h(\beta)exp(T(\beta)^T\eta(Z,X)-A_l(\eta(Z,X))) p(βZ,X)=h(β)exp(T(β)Tη(Z,X)Al(η(Z,X)))
在做变分推断时,希望用函数q(β∣λ)q(\beta|\lambda)q(βλ)去近似p(β∣Z,X)p(\beta|Z,X)p(βZ,X),即:
p(β∣Z,X)≈q(β∣λ)=h(β)exp(T(β)Tλ−Ag(λ)) p(\beta|Z,X)\approx q(\beta|\lambda)=h(\beta)exp(T(\beta)^T\lambda-A_g(\lambda)) p(βZ,X)q(βλ)=h(β)exp(T(β)TλAg(λ))
接下来,就要不断地调整λ\lambdaλ,使得q(β∣λ)q(\beta|\lambda)q(βλ)越来越接近于p(β∣Z,X)p(\beta|Z,X)p(βZ,X),即增大ELOBELOBELOB函数。

同样的,对于p(Z∣β,X)p(Z|\beta, X)p(Zβ,X)也是如此:
p(Z∣β,X)=h(Z)exp(T(Z)Tη(β,X)−Al(η(β,X)))≈q(Z∣ϕ)=h(Z)exp(T(Z)Tϕ−Ag(ϕ)) \begin{aligned} p(Z|\beta,X)&=h(Z)exp(T(Z)^T\eta(\beta,X)-A_l(\eta(\beta,X)))\\ &\approx q(Z|\phi)=h(Z)exp(T(Z)^T\phi-A_g(\phi)) \end{aligned} p(Zβ,X)=h(Z)exp(T(Z)Tη(β,X)Al(η(β,X)))q(Zϕ)=h(Z)exp(T(Z)TϕAg(ϕ))
ELOBELOBELOB函数如下:
L(q)=Eq(Z,β)[logp(X,Z,β)]−Eq(Z,β)[logq(Z,β)] L(q)=E_{q(Z,\beta)}[logp(X,Z, \beta)]-E_{q(Z,\beta)}[logq(Z,\beta)] L(q)=Eq(Z,β)[logp(X,Z,β)]Eq(Z,β)[logq(Z,β)]
现在,ELOBELOBELOB函数可以写成:
L(λ,ϕ)=Eq(Z,β)[logP(X,Z,β)]−Eq(Z,β)[logq(Z,β)] L(\lambda, \phi)=E_{q(Z,\beta)}[logP(X,Z, \beta)]-E_{q(Z,\beta)}[logq(Z,\beta)] L(λ,ϕ)=Eq(Z,β)[logP(X,Z,β)]Eq(Z,β)[logq(Z,β)]
目标:找到一个λ\lambdaλϕ\phiϕ,使得ELOBELOBELOB函数最大化。

方法:

  • 先固定一个参数,对另一个参数优化

具体做法:

  • 固定ϕ\phiϕ,优化λ\lambdaλ

  • L(λ,ϕ)=Eq(Z,β)[logp(X,Z,β)]−Eq(Z,β)[logq(Z,β)]=Eq(Z,β)[logp(β∣X,Z)+logp(Z∣X)]−Eq(Z,β)[logq(β)]−Eq(Z,β)[logq(Z)]=Eq(Z,β)[logp(β∣X,Z)]−Eq(Z,β)[logq(β∣λ)] \begin{aligned} L(\lambda, \phi)&=E_{q(Z,\beta)}[logp(X,Z, \beta)]-E_{q(Z,\beta)}[logq(Z,\beta)]\\ &=E_{q(Z,\beta)}[logp(\beta|X,Z)+logp(Z|X)]-E_{q(Z,\beta)}[logq(\beta)]-E_{q(Z,\beta)}[logq(Z)]\\ &=E_{q(Z,\beta)}[logp(\beta|X,Z)]-E_{q(Z,\beta)}[logq(\beta|\lambda)] \end{aligned} L(λ,ϕ)=Eq(Z,β)[logp(X,Z,β)]Eq(Z,β)[logq(Z,β)]=Eq(Z,β)[logp(βX,Z)+logp(ZX)]Eq(Z,β)[logq(β)]Eq(Z,β)[logq(Z)]=Eq(Z,β)[logp(βX,Z)]Eq(Z,β)[logq(βλ)]

  • p(β∣Z,X)p(\beta|Z,X)p(βZ,X)q(β∣λ)q(\beta|\lambda)q(βλ)代入上式

  • L(λ,ϕ)=Eq(Z,β)[logh(β)]+Eq(Z,β)[T(β)Tη(Z,X)]−Eq(Z,β)[Ag(η(X,Z))]−Eq(Z,β)[logh(β)]−Eq(Z,β)[(T(β)Tλ)]+Eq(Z,β)[Ag(λ)]=Eq(β)[T(β)T]⋅Eq(Z)[η(Z,X)]−Eq(Z)[Ag(η(X,Z))]−Eq(β)[(T(β)Tλ)]+Ag(λ)=Ag′(λ)TEq(Z)[η(Z,X)]−λAg′(λ)T+Ag(λ) \begin{aligned} L(\lambda, \phi)&=E_{q(Z,\beta)}[logh(\beta)]+E_{q(Z,\beta)}[T(\beta)^T\eta(Z,X)]-E_{q(Z,\beta)}[A_g(\eta(X,Z))]-E_{q(Z,\beta)}[logh(\beta)]-E_{q(Z,\beta)}[(T(\beta)^T\lambda)]+E_{q(Z,\beta)}[A_g(\lambda)]\\ &=E_{q(\beta)}[T(\beta)^T]\cdot E_{q(Z)}[\eta(Z,X)]-E_{q(Z)}[A_g(\eta(X,Z))]-E_{q(\beta)}[(T(\beta)^T\lambda)]+A_g(\lambda)\\ &=A_g'(\lambda)^TE_{q(Z)}[\eta(Z,X)]-\lambda A_g'(\lambda)^T+A_g(\lambda) \end{aligned} L(λ,ϕ)=Eq(Z,β)[logh(β)]+Eq(Z,β)[T(β)Tη(Z,X)]Eq(Z,β)[Ag(η(X,Z))]Eq(Z,β)[logh(β)]Eq(Z,β)[(T(β)Tλ)]+Eq(Z,β)[Ag(λ)]=Eq(β)[T(β)T]Eq(Z)[η(Z,X)]Eq(Z)[Ag(η(X,Z))]Eq(β)[(T(β)Tλ)]+Ag(λ)=Ag(λ)TEq(Z)[η(Z,X)]λAg(λ)T+Ag(λ)

  • 上式对λ\lambdaλ求导

  • ∂L(λ,ϕ)∂λ=Ag′′(λ)T⋅Eq(Z)[η(Z,X)]−Ag′(λ)T−λAg′′(λ)T+Ag′(λ)=Ag′′(λ)T(Eq(Z)[η(Z,X)]−λ)=0 \begin{aligned} \frac{\partial L(\lambda, \phi)}{\partial \lambda}&=A_g''(\lambda)^T\cdot E_{q(Z)}[\eta(Z,X)]-A_g'(\lambda)^T-\lambda A_g''(\lambda)^T+A_g'(\lambda)\\ &=A_g''(\lambda)^T(E_{q(Z)}[\eta(Z,X)]-\lambda)=0 \end{aligned} λL(λ,ϕ)=Ag(λ)TEq(Z)[η(Z,X)]Ag(λ)TλAg(λ)T+Ag(λ)=Ag(λ)T(Eq(Z)[η(Z,X)]λ)=0

  • 如果Ag′′(λ)T≠0A_g''(\lambda)^T \neq 0Ag(λ)T̸=0,则
    λ=Eq(Z∣ϕ)[η(Z,X)] \lambda=E_{q(Z|\phi)}[\eta(Z,X)] λ=Eq(Zϕ)[η(Z,X)]
    同样
    ϕ=Eq(β∣λ)[η(X,β)] \phi=E_{q(\beta|\lambda)}[\eta(X,\beta)] ϕ=Eq(βλ)[η(X,β)]

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值