指数族分布
- 指数族分布的pdf / pmf可以表示成:
p(x∣η)=h(x)exp(T(x)Tη−A(η)) p(x| \eta)=h(x)exp(T(x)^T \eta - A(\eta)) p(x∣η)=h(x)exp(T(x)Tη−A(η))
其中,、T(x)、h(x)、T(x)、h(x)、T(x)、h(x)只是包含xxx的函数, A(η)A(\eta)A(η)是只包含η\etaη的函数。T(x)T(x)T(x)叫做sufficient statistics。A(η)A(\eta)A(η)叫做log-normalizer。在变分推断中,A(η)A(\eta)A(η)起到很重要的作用。
∫h(x)exp(T(x)Tη)dxexp(A(η))=1A(η)=log∫h(x)exp(T(X)Tη)dx
\frac{\int h(x)exp(T(x)^T\eta)dx}{exp(A(\eta))}=1\\
A(\eta)=log\int h(x)exp(T(X)^T\eta)dx
exp(A(η))∫h(x)exp(T(x)Tη)dx=1A(η)=log∫h(x)exp(T(X)Tη)dx
- 我们学到的很多分布都是指数族分布,比如:
Normal, beta, Poisson, gamma, Bernoulli, chi-squared, geometric, exponential, categorical…
- 举高斯分布为例子
p(x∣θ)=p(x∣μ,σ2)=N(μ,σ2)=12πσexp−(x−μ)22σ2 p(x| \theta)=p(x|\mu, \sigma^2)=N(\mu, \sigma^2)=\frac{1}{\sqrt{2\pi}\sigma}exp{-\frac{(x-\mu)^2}{2\sigma^2}} p(x∣θ)=p(x∣μ,σ2)=N(μ,σ2)=2πσ1exp−2σ2(x−μ)2
- 例子:怎样把高斯分布写成指数族分布的形式,就是怎样把均值和方差这两个参数替换成η1,η2\eta_1, \eta_2η1,η2。
N(x∣μ,σ2)=(2πσ2)−12exp(−(x−μ)22σ2)=exp(−x2−2xμ+μ22σ2−12ln(2πσ2)=exp(−12σ2x2+μσ2x−μ22σ2−12ln(2πσ2))=exp([xx2]T[μσ2−12σ2]−μ22σ2−12ln(2πσ2)) \begin{aligned} N(x|\mu,\sigma^2)&=(2\pi \sigma^2)^{-\frac{1}{2}}exp(-\frac{(x-\mu)^2}{2\sigma^2})\\ &=exp(-\frac{x^2-2x\mu+\mu^2}{2\sigma^2}-\frac{1}{2}ln(2\pi\sigma^2)\\ &=exp(-\frac{1}{2\sigma^2}x^2+\frac{\mu}{\sigma^2}x-\frac{\mu^2}{2\sigma^2}-\frac{1}{2}ln({2\pi\sigma^2}))\\ &=exp(\begin{bmatrix} x \\ x^2 \end{bmatrix}^T\begin{bmatrix} \frac{\mu}{\sigma^2}\\ -\frac{1}{2\sigma^2} \end{bmatrix}-\frac{\mu^2}{2\sigma^2}-\frac{1}{2}ln(2\pi\sigma^2)) \end{aligned} N(x∣μ,σ2)=(2πσ2)−21exp(−2σ2(x−μ)2)=exp(−2σ2x2−2xμ+μ2−21ln(2πσ2)=exp(−2σ21x2+σ2μx−2σ2μ2−21ln(2πσ2))=exp([xx2]T[σ2μ−2σ21]−2σ2μ2−21ln(2πσ2))
这里,我们得到:
T(x)=[xx2]η=[η1η2]=[μσ2−12σ2]θ=[μσ2]=[−η12η2−12η2]A(η)=−η124η2−12ln(−2η2)
\begin{aligned}
T(x)=\begin{bmatrix}
x \\
x^2
\end{bmatrix}\\
\eta=\begin{bmatrix}
\eta_1\\
\eta_2
\end{bmatrix}=\begin{bmatrix}
\frac{\mu}{\sigma^2}\\
-\frac{1}{2\sigma^2}
\end{bmatrix}\\
\theta=\begin{bmatrix}
\mu\\
\sigma^2
\end{bmatrix}=\begin{bmatrix}
\frac{-\eta_1}{2\eta_2}\\
\frac{-1}{2\eta_2}
\end{bmatrix}\\
A(\eta)=\frac{-\eta_1^2}{4\eta_2}-\frac{1}{2}ln(-2\eta_2)
\end{aligned}
T(x)=[xx2]η=[η1η2]=[σ2μ−2σ21]θ=[μσ2]=[2η2−η12η2−1]A(η)=4η2−η12−21ln(−2η2)
所以均值和方差可以表示为:
η2=−12σ2⇒σ2=−12η2μ=η1σ2=η1−12η2=−η12η2
\eta_2=-\frac{1}{2\sigma^2}\Rightarrow \sigma^2=-\frac{1}{2\eta_2}\\
\mu=\eta_1\sigma^2=\eta_1\frac{-1}{2\eta_2}=-\frac{\eta_1}{2\eta_2}
η2=−2σ21⇒σ2=−2η21μ=η1σ2=η12η2−1=−2η2η1
- 指数族分布有什么好处呢?
- 如果一个条件概率可以写成上面的形式,很多问题的求解变得简单。
- 比如:求解argmaxθ[logp(X∣η)]\underset{\theta}{argmax}[log p(X| \eta)]θargmax[logp(X∣η)]:
argmaxη[logp(X∣η)]=argmaxη[log∏i=1Np(xi∣η)]=argmaxη∑i=1N[logh(xi)+T(xi)Tη−A(η)]=argmaxη∑i=1NT(xi)Tη−NA(η) \begin{aligned} \underset{\eta}{argmax}[log p(X| \eta)]&=\underset{\eta}{argmax}[log \prod_{i=1}^{N} p(x_i| \eta)]\\ &=\underset{\eta}{argmax}\sum_{i=1}^{N}[logh(x_i)+T(x_i)^T\eta-A(\eta)]\\ &=\underset{\eta}{argmax}\sum_{i=1}^{N}T(x_i)^T\eta-NA(\eta) \end{aligned} ηargmax[logp(X∣η)]=ηargmax[logi=1∏Np(xi∣η)]=ηargmaxi=1∑N[logh(xi)+T(xi)Tη−A(η)]=ηargmaxi=1∑NT(xi)Tη−NA(η)
令上式为L(η)L(\eta)L(η),则
∂L(η)∂η=∑i=1NT(xi)−NA′(η)=0
\frac{\partial{L(\eta)}}{\partial \eta}=\sum_{i=1}^{N}T(x_i)-NA'(\eta)=0
∂η∂L(η)=i=1∑NT(xi)−NA′(η)=0
即:
A′(η)=∑i=1NT(xi)N
A'(\eta)=\frac{\sum_{i=1}^{N}T(x_i)}{N}
A′(η)=N∑i=1NT(xi)
- 共轭:
p(β∣x)∝p(x∣β)p(β) p(\beta | x) \propto p(x|\beta)p(\beta) p(β∣x)∝p(x∣β)p(β)
如果似然函数和先验是共轭的,则后验和先验是同一种分布。
如果似然函数是指数族分布,理论上一定可以找到一个与之共轭的先验分布(也是指数族分布)。
- 一个结论:Al′(β)=Ep(x∣β)[T(x)]A_l'(\beta)=E_{p(x|\beta)}[T(x)]Al′(β)=Ep(x∣β)[T(x)]
证明:
p(x∣β)=h(x)exp(T(x)Tβ−Al(β))∵∫p(x∣β)dx=1∴∂∫p(x∣β)dx∂β=∂∫h(x)exp(T(x)Tβ−Al(β))dx∂β=0=∫x∂[h(x)exp[T(x)Tβ−Al(β)]∂βdx=∫xh(x)exp[T(x)Tβ−Al(β)](T(x)−Al′(β))dx=∫xh(x)exp[T(x)Tβ−Al(β)]T(x)dx−∫xh(x)exp[T(x)Tβ−Al(β)]Al′(β))dx=Ep(x∣β)[T(x)]−Al′(β)=0
p(x|\beta)=h(x)exp(T(x)^T\beta-A_l(\beta))\\
\because \int p(x|\beta)dx=1\\
\begin{aligned}
\therefore \frac{\partial \int p(x|\beta)dx}{\partial \beta}&=\frac{\partial \int h(x)exp(T(x)^T\beta-A_l(\beta))dx}{\partial \beta}=0\\
&=\int_x \frac{\partial [h(x)exp[T(x)^T\beta - A_l(\beta)]}{\partial \beta}dx\\
&=\int_x h(x)exp[T(x)^T\beta-A_l(\beta)](T(x)-A_l'(\beta))dx\\
&=\int_x h(x)exp[T(x)^T\beta-A_l(\beta)]T(x)dx-\int_x h(x)exp[T(x)^T\beta-A_l(\beta)]A_l'(\beta))dx\\
&=E_{p(x|\beta)}[T(x)]-A_l'(\beta)=0
\end{aligned}
p(x∣β)=h(x)exp(T(x)Tβ−Al(β))∵∫p(x∣β)dx=1∴∂β∂∫p(x∣β)dx=∂β∂∫h(x)exp(T(x)Tβ−Al(β))dx=0=∫x∂β∂[h(x)exp[T(x)Tβ−Al(β)]dx=∫xh(x)exp[T(x)Tβ−Al(β)](T(x)−Al′(β))dx=∫xh(x)exp[T(x)Tβ−Al(β)]T(x)dx−∫xh(x)exp[T(x)Tβ−Al(β)]Al′(β))dx=Ep(x∣β)[T(x)]−Al′(β)=0
- 数据集合XXX,隐变量集合ZZZ,参数集合β\betaβ。
后验概率分布:
p(β,Z∣X)=p(β∣Z,X)p(Z∣X)=p(Z∣β,X)p(β∣X)
\begin{aligned}
p(\beta,Z|X)&=p(\beta|Z,X)p(Z|X)\\
&=p(Z|\beta,X)p(\beta|X)
\end{aligned}
p(β,Z∣X)=p(β∣Z,X)p(Z∣X)=p(Z∣β,X)p(β∣X)
p(β∣Z,X)p(\beta|Z,X)p(β∣Z,X)和p(Z∣β,X)p(Z|\beta,X)p(Z∣β,X),这两个后验分布都是指数族分布。
则:
p(β∣Z,X)=h(β)exp(T(β)Tη(Z,X)−Al(η(Z,X)))
p(\beta|Z,X)=h(\beta)exp(T(\beta)^T\eta(Z,X)-A_l(\eta(Z,X)))
p(β∣Z,X)=h(β)exp(T(β)Tη(Z,X)−Al(η(Z,X)))
在做变分推断时,希望用函数q(β∣λ)q(\beta|\lambda)q(β∣λ)去近似p(β∣Z,X)p(\beta|Z,X)p(β∣Z,X),即:
p(β∣Z,X)≈q(β∣λ)=h(β)exp(T(β)Tλ−Ag(λ))
p(\beta|Z,X)\approx q(\beta|\lambda)=h(\beta)exp(T(\beta)^T\lambda-A_g(\lambda))
p(β∣Z,X)≈q(β∣λ)=h(β)exp(T(β)Tλ−Ag(λ))
接下来,就要不断地调整λ\lambdaλ,使得q(β∣λ)q(\beta|\lambda)q(β∣λ)越来越接近于p(β∣Z,X)p(\beta|Z,X)p(β∣Z,X),即增大ELOBELOBELOB函数。
同样的,对于p(Z∣β,X)p(Z|\beta, X)p(Z∣β,X)也是如此:
p(Z∣β,X)=h(Z)exp(T(Z)Tη(β,X)−Al(η(β,X)))≈q(Z∣ϕ)=h(Z)exp(T(Z)Tϕ−Ag(ϕ))
\begin{aligned}
p(Z|\beta,X)&=h(Z)exp(T(Z)^T\eta(\beta,X)-A_l(\eta(\beta,X)))\\
&\approx q(Z|\phi)=h(Z)exp(T(Z)^T\phi-A_g(\phi))
\end{aligned}
p(Z∣β,X)=h(Z)exp(T(Z)Tη(β,X)−Al(η(β,X)))≈q(Z∣ϕ)=h(Z)exp(T(Z)Tϕ−Ag(ϕ))
ELOBELOBELOB函数如下:
L(q)=Eq(Z,β)[logp(X,Z,β)]−Eq(Z,β)[logq(Z,β)]
L(q)=E_{q(Z,\beta)}[logp(X,Z, \beta)]-E_{q(Z,\beta)}[logq(Z,\beta)]
L(q)=Eq(Z,β)[logp(X,Z,β)]−Eq(Z,β)[logq(Z,β)]
现在,ELOBELOBELOB函数可以写成:
L(λ,ϕ)=Eq(Z,β)[logP(X,Z,β)]−Eq(Z,β)[logq(Z,β)]
L(\lambda, \phi)=E_{q(Z,\beta)}[logP(X,Z, \beta)]-E_{q(Z,\beta)}[logq(Z,\beta)]
L(λ,ϕ)=Eq(Z,β)[logP(X,Z,β)]−Eq(Z,β)[logq(Z,β)]
目标:找到一个λ\lambdaλ和ϕ\phiϕ,使得ELOBELOBELOB函数最大化。
方法:
- 先固定一个参数,对另一个参数优化
具体做法:
-
固定ϕ\phiϕ,优化λ\lambdaλ
-
L(λ,ϕ)=Eq(Z,β)[logp(X,Z,β)]−Eq(Z,β)[logq(Z,β)]=Eq(Z,β)[logp(β∣X,Z)+logp(Z∣X)]−Eq(Z,β)[logq(β)]−Eq(Z,β)[logq(Z)]=Eq(Z,β)[logp(β∣X,Z)]−Eq(Z,β)[logq(β∣λ)] \begin{aligned} L(\lambda, \phi)&=E_{q(Z,\beta)}[logp(X,Z, \beta)]-E_{q(Z,\beta)}[logq(Z,\beta)]\\ &=E_{q(Z,\beta)}[logp(\beta|X,Z)+logp(Z|X)]-E_{q(Z,\beta)}[logq(\beta)]-E_{q(Z,\beta)}[logq(Z)]\\ &=E_{q(Z,\beta)}[logp(\beta|X,Z)]-E_{q(Z,\beta)}[logq(\beta|\lambda)] \end{aligned} L(λ,ϕ)=Eq(Z,β)[logp(X,Z,β)]−Eq(Z,β)[logq(Z,β)]=Eq(Z,β)[logp(β∣X,Z)+logp(Z∣X)]−Eq(Z,β)[logq(β)]−Eq(Z,β)[logq(Z)]=Eq(Z,β)[logp(β∣X,Z)]−Eq(Z,β)[logq(β∣λ)]
-
将p(β∣Z,X)p(\beta|Z,X)p(β∣Z,X)和q(β∣λ)q(\beta|\lambda)q(β∣λ)代入上式
-
L(λ,ϕ)=Eq(Z,β)[logh(β)]+Eq(Z,β)[T(β)Tη(Z,X)]−Eq(Z,β)[Ag(η(X,Z))]−Eq(Z,β)[logh(β)]−Eq(Z,β)[(T(β)Tλ)]+Eq(Z,β)[Ag(λ)]=Eq(β)[T(β)T]⋅Eq(Z)[η(Z,X)]−Eq(Z)[Ag(η(X,Z))]−Eq(β)[(T(β)Tλ)]+Ag(λ)=Ag′(λ)TEq(Z)[η(Z,X)]−λAg′(λ)T+Ag(λ) \begin{aligned} L(\lambda, \phi)&=E_{q(Z,\beta)}[logh(\beta)]+E_{q(Z,\beta)}[T(\beta)^T\eta(Z,X)]-E_{q(Z,\beta)}[A_g(\eta(X,Z))]-E_{q(Z,\beta)}[logh(\beta)]-E_{q(Z,\beta)}[(T(\beta)^T\lambda)]+E_{q(Z,\beta)}[A_g(\lambda)]\\ &=E_{q(\beta)}[T(\beta)^T]\cdot E_{q(Z)}[\eta(Z,X)]-E_{q(Z)}[A_g(\eta(X,Z))]-E_{q(\beta)}[(T(\beta)^T\lambda)]+A_g(\lambda)\\ &=A_g'(\lambda)^TE_{q(Z)}[\eta(Z,X)]-\lambda A_g'(\lambda)^T+A_g(\lambda) \end{aligned} L(λ,ϕ)=Eq(Z,β)[logh(β)]+Eq(Z,β)[T(β)Tη(Z,X)]−Eq(Z,β)[Ag(η(X,Z))]−Eq(Z,β)[logh(β)]−Eq(Z,β)[(T(β)Tλ)]+Eq(Z,β)[Ag(λ)]=Eq(β)[T(β)T]⋅Eq(Z)[η(Z,X)]−Eq(Z)[Ag(η(X,Z))]−Eq(β)[(T(β)Tλ)]+Ag(λ)=Ag′(λ)TEq(Z)[η(Z,X)]−λAg′(λ)T+Ag(λ)
-
上式对λ\lambdaλ求导
-
∂L(λ,ϕ)∂λ=Ag′′(λ)T⋅Eq(Z)[η(Z,X)]−Ag′(λ)T−λAg′′(λ)T+Ag′(λ)=Ag′′(λ)T(Eq(Z)[η(Z,X)]−λ)=0 \begin{aligned} \frac{\partial L(\lambda, \phi)}{\partial \lambda}&=A_g''(\lambda)^T\cdot E_{q(Z)}[\eta(Z,X)]-A_g'(\lambda)^T-\lambda A_g''(\lambda)^T+A_g'(\lambda)\\ &=A_g''(\lambda)^T(E_{q(Z)}[\eta(Z,X)]-\lambda)=0 \end{aligned} ∂λ∂L(λ,ϕ)=Ag′′(λ)T⋅Eq(Z)[η(Z,X)]−Ag′(λ)T−λAg′′(λ)T+Ag′(λ)=Ag′′(λ)T(Eq(Z)[η(Z,X)]−λ)=0
-
如果Ag′′(λ)T≠0A_g''(\lambda)^T \neq 0Ag′′(λ)T̸=0,则
λ=Eq(Z∣ϕ)[η(Z,X)] \lambda=E_{q(Z|\phi)}[\eta(Z,X)] λ=Eq(Z∣ϕ)[η(Z,X)]
同样
ϕ=Eq(β∣λ)[η(X,β)] \phi=E_{q(\beta|\lambda)}[\eta(X,\beta)] ϕ=Eq(β∣λ)[η(X,β)]