逻辑斯谛回归模型
逻辑斯谛分布
- 定义:设X是连续随机变量,XXX服从罗辑斯谛分布是指XXX具有下列分布函数和密度函数:
F(x)=P(X≤x)=11+e−−(x−μ)γF(x)=P(X\le x)=\frac{1}{1+e^{-\frac{-(x-\mu)}{\gamma}}}F(x)=P(X≤x)=1+e−γ−(x−μ)1
f(x)=F′(x)=P(X≤x)=e−(x−μ)γγ(1+e−−(x−μ)γ)2f(x)=F^\prime(x)=P(X\le x)=\frac{e^{\frac{-(x-\mu)}{\gamma}}}{\gamma(1+e^{-\frac{-(x-\mu)}{\gamma}})^2}f(x)=F′(x)=P(X≤x)=γ(1+e−γ−(x−μ))2eγ−(x−μ)式中μ\muμ为位置参数,γ>0\gamma>0γ>0为形状参数,其以(μ,12)(\mu,\frac{1}{2})(μ,21)为中心对称点,即满足F(−x+μ)−12=−F(x+μ)+12F(-x+\mu)-\frac{1}{2}=-F(x+\mu)+\frac{1}{2}F(−x+μ)−21=−F(x+μ)+21在中心增长速度快,两端速度慢,γ\gammaγ越小在中心增长越快
二项式逻辑斯谛回归模型
- 定义:二项式逻辑斯谛回归模型是如下条件概率分布:
P(y=1∣x)=exp(w⋅x+b)1+exp(w⋅x+b)P(y=1|x)=\frac{\exp(w\cdot x+b)}{1+\exp(w\cdot x+b)}P(y=1∣x)=1+exp(w⋅x+b)exp(w⋅x+b)
P(y=0∣x)=11+exp(w⋅x+b)P(y=0|x)=\frac{1}{1+\exp(w\cdot x+b)}P(y=0∣x)=1+exp(w⋅x+b)1
这里,x∈Rnx \in R^nx∈Rn是输入,y∈{0,1}是输出,y \in \{0,1\}是输出,y∈{0,1}是输出, w∈Rnw \in R^nw∈Rn 和 b∈Rb \in Rb∈R是参数,www称为权值向量,bbb为偏值,w⋅xw \cdot xw⋅x为www和xxx的内积,扩充w=(w1,w2,...wn,b)T,x=(x1,x2,...,xm,1)Tw=(w^1,w^2,...w^n,b)^T,x=(x^1,x^2,...,x^m,1)^Tw=(w1,w2,...wn,b)T,x=(x1,x2,...,xm,1)T,则$w $ - 几率定义:发生概率与不发生概率的比值,p1−p\frac{p}{1-p}1−pp对数几率,logit(p)=logp1−plogit(p)=\log\frac{p}{1-p}logit(p)=log1−pp
对于逻辑斯谛回归模型,logP(y=1∣x)1−P(y=1∣x)=w⋅x\log \frac{P(y=1|x)}{1-P(y=1|x)}=w \cdot xlog1−P(y=1∣x)P(y=1∣x)=w⋅x
模型参数估计
设:
P(y=1∣x)=π(x),P(y=0∣x)=1−π(x)P(y=1|x)=\pi(x),P(y=0|x)=1-\pi(x)P(y=1∣x)=π(x),P(y=0∣x)=1−π(x)
似然函数为
∏i=1N[π(xi)]yi[1−π(xi)]1−yi\prod\limits_{i=1}^N[\pi(x_i)]^{y_i}[1-\pi(x_i)]^{1-y_i}i=1∏N[π(xi)]yi[1−π(xi)]1−yi
对数似然函数为L(w)=∑i=1N[yilog π(xi)+(1−yi)log (1−π(xi))]L(w)=\sum\limits_{i=1}^N[y_i\log\ \pi(x_i)+(1-y_i)\log\ (1-\pi(x_i))]L(w)=i=1∑N[yilog π(xi)+(1−yi)log (1−π(xi))]
=∑i=1N[yilog π(xi)1−π(xi)+log (1−π(xi))]=\sum\limits_{i=1}^N[y_i\log\ \frac{\pi (x_i)}{1-\pi(x_i)}+\log\ (1-\pi(x_i))]=i=1∑N[yilog 1−π(xi)π(xi)+log (1−π(xi))]
=∑i=1N[yi(w⋅xi)−log (1+exp(w⋅xi)]=\sum\limits_{i=1}^N[y_i(w \cdot x_i)-\log\ (1+\exp(w \cdot x_i)]=i=1∑N[yi(w⋅xi)−log (1+exp(w⋅xi)]
对L(w)L(w)L(w)求极大值,得到www的估计
多项式逻辑斯谛回归
P(y=k∣x)=exp(wk⋅x)1+∑k=1K−1exp(wk⋅x)P(y=k|x)=\frac{\exp(w_k \cdot x)}{1+\sum\limits_{k=1}^{K-1}\exp(w_k \cdot x)}P(y=k∣x)=1+k=1∑K−1exp(wk⋅x)exp(wk⋅x)其中k=1,2,...,K−1k=1,2,...,K-1k=1,2,...,K−1
P(y=K∣x)=11+∑k=1K−1exp(wk⋅x)P(y=K|x)=\frac{1}{1+\sum\limits_{k=1}^{K-1}\exp(w_k \cdot x)}P(y=K∣x)=1+k=1∑K−1exp(wk⋅x)1
最大熵模型
最大熵模型是由最大熵原理推导实现。
最大熵原理
最大熵原理认为学习概率模型时,在所有可能的概率模型里,熵最大的模型是最好的模型。
H(P)=−∑xP(x)log P(x)H(P)=-\sum\limits_xP(x)\log \ P(x)H(P)=−x∑P(x)log P(x),满足0≤H(P)≤log ∣X∣0 \le H(P)\le \log\ |X|0≤H(P)≤log ∣X∣,其中∣X∣|X|∣X∣是X的取值个数.在约束条件下,那些不确定的事件是等可能的是最好的.
最大熵模型的定义
假设分类模型是一个条件概率分布P(Y∣X)P(Y|X)P(Y∣X),给定一个训练数据集
T={(x1,y1),(x2,y2),...,(xN,yN)}T=\{ (x_1,y_1),(x_2,y_2),...,(x_N,y_N)\}T={(x1,y1),(x2,y2),...,(xN,yN)},学习目标是利用最大熵选择最好的模型
首先确定P(X,Y)P(X,Y)P(X,Y)和P(X)P(X)P(X)的经验分布
P^(X=x,Y=y)=v(X=x,Y=y)N\hat{P}(X=x,Y=y)=\frac{v(X=x,Y=y)}{N}P^(X=x,Y=y)=Nv(X=x,Y=y)
P^(X=x)=v(X=x)N\hat{P}(X=x)=\frac{v(X=x)}{N}P^(X=x)=Nv(X=x)
其中v(X=x)v(X=x)v(X=x)为样本中X=xX=xX=x的个数,v(X=x,Y=y)v(X=x,Y=y)v(X=x,Y=y)同理.
用特征函数f(x,y)f(x,y)f(x,y)描述输入xxx和输出yyy之间某一事实
f(x,y)={1x,y满足某一事实0其他f(x,y)=\begin{cases}
1 & x,y满足某一事实\\
0 & 其他\\
\end{cases}f(x,y)={10x,y满足某一事实其他
则特征函数关于经验分布P^(X,Y)\hat{P}(X,Y)P^(X,Y)的期望值,用EP^(f)E_{\hat{P}}(f)EP^(f)表示
EP^(f)=∑x,yP^(x,y)f(x,y)E_{\hat{P}}(f)=\sum\limits_{x,y}\hat{P}(x,y)f(x,y)EP^(f)=x,y∑P^(x,y)f(x,y)
特征函数关于模型P(Y∣X)P(Y|X)P(Y∣X)与经验分布P^\hat{P}P^的期望值,用EP(f)E_{P}(f)EP(f)表示
EP(f)=∑x,yP^(x)P(y∣x)f(x,y)E_{P}(f)=\sum\limits_{x,y}\hat{P}(x)P(y|x)f(x,y)EP(f)=x,y∑P^(x)P(y∣x)f(x,y)
如果模型能够获取训练数据的信息,则
EP^(f)=EP(f)E_{\hat{P}}(f)=E_{P}(f)EP^(f)=EP(f)
- 定义
假设满足所有约束条件的模型集合为
C={P∈ρ∣EP^(fi)=EP(fi),i=1,2,...,n}C=\{P \in \rho |E_{\hat{P}}(f_i)=E_{P}(f_i),i=1,2,...,n\}C={P∈ρ∣EP^(fi)=EP(fi),i=1,2,...,n}
定义在条件概率分布P(Y∣X)P(Y|X)P(Y∣X)上的条件熵为
H(P)=−∑x,yP^(x)P(y∣x)log P(y∣x)H(P)=-\sum\limits_{x,y}\hat{P}(x)P(y|x)\log\ P(y|x)H(P)=−x,y∑P^(x)P(y∣x)log P(y∣x)
则模型集合CCC中条件熵最大的模型称为最大熵模型
最大熵模型的学习
minP∈C −H(P)=∑x,yP^(x)P(y∣x)log P(y∣x)\min\limits_{P \in C}\ \ \ \ \ \ \ \ \ \ -H(P)=\sum\limits_{x,y}\hat{P}(x)P(y|x)\log\ P(y|x)P∈Cmin −H(P)=x,y∑P^(x)P(y∣x)log P(y∣x)
s.t.s.t.s.t.
EP^(fi)=EP(fi)E_{\hat{P}}(f_i)=E_{P}(f_i)EP^(fi)=EP(fi)
∑yP(y∣x)=1\sum\limits_yP(y|x)=1y∑P(y∣x)=1
引入拉格朗日乘子w0,w1,...,wnw_0,w_1,...,w_nw0,w1,...,wn定义拉格朗日函数L(P,w)L(P,w)L(P,w)
L(P,w)=−H(P)+w0(1−∑yP(y∣x))+∑i=1nwi(EP^(fi)−EP(fi))L(P,w)=-H(P)+w_0(1-\sum\limits_yP(y|x))+\sum\limits_{i=1}^nw_i(E_{\hat{P}}(f_i)-E_{P}(f_i))L(P,w)=−H(P)+w0(1−y∑P(y∣x))+i=1∑nwi(EP^(fi)−EP(fi))
=∑x,yP^(x)P(y∣xlog P(y∣x)+w0(1−∑yP(y∣x))+=\sum\limits_{x,y}\hat{P}(x)P(y|x\log \ P(y|x)+w_0(1-\sum\limits_yP(y|x))+=x,y∑P^(x)P(y∣xlog P(y∣x)+w0(1−y∑P(y∣x))+
∑i=1nwi(∑x,yP^(x,y)f(x,y)−∑x,yP^(x)P(y∣x)f(x,y))\sum\limits_{i=1}^nw_i(\sum\limits_{x,y}\hat{P}(x,y)f(x,y)-\sum\limits_{x,y}\hat{P}(x)P(y|x)f(x,y))i=1∑nwi(x,y∑P^(x,y)f(x,y)−x,y∑P^(x)P(y∣x)f(x,y))
最优的原始问题是minP∈C maxw L(P,w)\min\limits_{P \in C} \ \max\limits_w \ L(P,w)P∈Cmin wmax L(P,w)
则对偶问题为maxw minP∈C L(P,w)\max\limits_{w} \ \min\limits_{P \in C} \ L(P,w)wmax P∈Cmin L(P,w)
∂L(P,w)∂P(y∣x)=∑x,yP^(x)(log P(y∣x)+1)−∑yw0−∑x,y(P^(x)∑i=1nwifi(x,y))\frac{\partial L(P,w)}{\partial P(y|x)}=\sum\limits_{x,y} \hat{P}(x)(\log \ P(y|x)+1)-\sum\limits_yw_0-\sum\limits_{x,y}(\hat{P}(x)\sum\limits_{i=1}^nw_if_i(x,y))∂P(y∣x)∂L(P,w)=x,y∑P^(x)(log P(y∣x)+1)−y∑w0−x,y∑(P^(x)i=1∑nwifi(x,y))
=∑x,yP^(x)(log P(y∣x)+1−w0−∑i=1nwifi(x,y))=0=\sum\limits_{x,y}\hat{P}(x)(\log \ P(y|x)+1-w_0-\sum\limits_{i=1}^nw_if_i(x,y))=0=x,y∑P^(x)(log P(y∣x)+1−w0−i=1∑nwifi(x,y))=0
则
P(y∣x)=exp(∑i=1nwifi(x,y)+w0−1)=exp(∑i=1nwifi(x,y))exp(1−w0)P(y|x)=\exp(\sum\limits_{i=1}^nw_if_i(x,y)+w_0-1)=\frac{\exp(\sum\limits_{i=1}^nw_if_i(x,y))}{\exp(1-w_0)}P(y∣x)=exp(i=1∑nwifi(x,y)+w0−1)=exp(1−w0)exp(i=1∑nwifi(x,y))
又由
∑yP(y∣x)=1\sum\limits_yP(y|x)=1y∑P(y∣x)=1
Pw(y∣x)=1Zw(x)exp(∑i=1nwifi(x,y))P_w(y|x)=\frac{1}{Z_w(x)}\exp(\sum\limits_{i=1}^nw_if_i(x,y))Pw(y∣x)=Zw(x)1exp(i=1∑nwifi(x,y))
其中
Zw(x)=∑yexp(∑i=1nwifi(x,y))Z_w(x)=\sum\limits_y\exp(\sum\limits_{i=1}^nw_if_i(x,y))Zw(x)=y∑exp(i=1∑nwifi(x,y))
把以上代入L(P,w)L(P,w)L(P,w)后,我们只需要最大化求解www即可
极大似然估计
设极大似然估计
LP^(Pw)=log ∏x,yP(y∣x)P^(x,y)=∑x,yP^(x,y)logP(y∣x)L_{\hat{P}}(P_w)=\log \ \prod\limits_{x,y}P(y|x)^{\hat{P}(x,y)}=\sum\limits_{x,y}\hat{P}(x,y)\log P(y|x)LP^(Pw)=log x,y∏P(y∣x)P^(x,y)=x,y∑P^(x,y)logP(y∣x)
=∑x,yP^(x,y)∑i=1nwifi(x,y)−∑x,yP^(x,y)logZw(x)=\sum\limits_{x,y}\hat{P}(x,y)\sum\limits_{i=1}^nw_if_i(x,y)-\sum\limits_{x,y}\hat{P}(x,y)\log Z_w(x)=x,y∑P^(x,y)i=1∑nwifi(x,y)−x,y∑P^(x,y)logZw(x)
=∑x,yP^(x,y)∑i=1nwifi(x,y)−∑xP^(x)logZw(x)=\sum\limits_{x,y}\hat{P}(x,y)\sum\limits_{i=1}^nw_if_i(x,y)-\sum\limits_{x}\hat{P}(x)\log Z_w(x)=x,y∑P^(x,y)i=1∑nwifi(x,y)−x∑P^(x)logZw(x)
又对偶函数
ψ(w)=∑x,yP^(x)Pw(y∣x)logPw(y∣x)+\psi(w)=\sum\limits_{x,y}\hat{P}(x)P_w(y|x)\log P_w(y|x)+ψ(w)=x,y∑P^(x)Pw(y∣x)logPw(y∣x)+
∑i=1nwi(∑x,yP^(x,y)fi(x,y)−∑x,yP^(x)Pw(y∣x)fi(x,y))\sum\limits_{i=1}^nw_i(\sum\limits_{x,y}\hat{P}(x,y)f_i(x,y)-\sum\limits_{x,y}\hat{P}(x)P_w(y|x)f_i(x,y))i=1∑nwi(x,y∑P^(x,y)fi(x,y)−x,y∑P^(x)Pw(y∣x)fi(x,y))
=∑x,yP^(x,y)∑i=1nwifi(x,y)+∑x,yP^(x)Pw(y∣x)(logPw(y∣x)−∑i=1nwifi(x,y)=\sum\limits_{x,y}\hat{P}(x,y)\sum\limits_{i=1}^nw_if_i(x,y)+\sum\limits_{x,y}\hat{P}(x)P_w(y|x)(\log P_w(y|x)-\sum\limits_{i=1}^nw_if_i(x,y)=x,y∑P^(x,y)i=1∑nwifi(x,y)+x,y∑P^(x)Pw(y∣x)(logPw(y∣x)−i=1∑nwifi(x,y)
=∑x,yP^(x,y)∑i=1nwifi(x,y)−∑x,yP^(x)Pw(y∣x)logZw(x)=\sum\limits_{x,y}\hat{P}(x,y)\sum\limits_{i=1}^nw_if_i(x,y)-\sum\limits_{x,y}\hat{P}(x)P_w(y|x)\log Z_w(x)=x,y∑P^(x,y)i=1∑nwifi(x,y)−x,y∑P^(x)Pw(y∣x)logZw(x)
=∑x,yP^(x,y)∑i=1nwifi(x,y)−∑xP^(x)logZw(x)=\sum\limits_{x,y}\hat{P}(x,y)\sum\limits_{i=1}^nw_if_i(x,y)-\sum\limits_{x}\hat{P}(x)\log Z_w(x)=x,y∑P^(x,y)i=1∑nwifi(x,y)−x∑P^(x)logZw(x)
在最大熵模型下对偶最大化等价于极大似然估计.
模型学习的最优化算法
- 改进迭代尺度法
- 梯度下降法
- 牛顿法
- 拟牛顿法
改进的迭代尺度法
希望求的ϱ=(ϱ1,ϱ2,...,ϱn)\varrho=(\varrho_1,\varrho_2,...,\varrho_n)ϱ=(ϱ1,ϱ2,...,ϱn),使w+ϱw+\varrhow+ϱ优于www
L(w+ϱ)−L(w)=∑x,yP^(x,y)logPw+ϱ(y∣x)−∑x,yP^(x,y)logPw(y∣x)L(w+\varrho)-L(w)=\sum\limits_{x,y}\hat{P}(x,y)\log P_{w+\varrho}(y|x)-\sum\limits_{x,y}\hat{P}(x,y)\log P_w(y|x)L(w+ϱ)−L(w)=x,y∑P^(x,y)logPw+ϱ(y∣x)−x,y∑P^(x,y)logPw(y∣x)
=∑x,yP^(x,y)∑i=1nϱifi(x,y)−∑xP^(x)logZw+ϱ(x)Zw(x)=\sum\limits_{x,y}\hat{P}(x,y)\sum\limits_{i=1}^n\varrho_if_i(x,y)-\sum\limits_{x}\hat{P}(x)\log \frac{Z_{w+\varrho}(x)}{Z_w(x)}=x,y∑P^(x,y)i=1∑nϱifi(x,y)−x∑P^(x)logZw(x)Zw+ϱ(x)
又
−loga≥1−a,a>0-\log a \ge 1-a,a>0−loga≥1−a,a>0
得
L(w+ϱ)−L(w)≥∑x,yP^(x,y)∑i=1nϱifi(x,y)+1−∑xP^(x)Zw+ϱ(x)Zw(x)L(w+\varrho)-L(w)\ge \sum\limits_{x,y}\hat{P}(x,y)\sum\limits_{i=1}^n\varrho_if_i(x,y)+1-\sum\limits_{x}\hat{P}(x)\frac{Z_{w+\varrho}(x)}{Z_w(x)}L(w+ϱ)−L(w)≥x,y∑P^(x,y)i=1∑nϱifi(x,y)+1−x∑P^(x)Zw(x)Zw+ϱ(x)
=∑x,yP^(x,y)∑i=1nϱifi(x,y)+1−∑xP^(x)∑yPw(y∣x)exp∑i=1nϱifi(x,y)=\sum\limits_{x,y}\hat{P}(x,y)\sum\limits_{i=1}^n\varrho_if_i(x,y)+1-\sum\limits_{x}\hat{P}(x)\sum\limits_{y}P_w(y|x)\exp \sum\limits_{i=1}^n\varrho_if_i(x,y)=x,y∑P^(x,y)i=1∑nϱifi(x,y)+1−x∑P^(x)y∑Pw(y∣x)expi=1∑nϱifi(x,y)
记右端为
A(ϱ∣w)=∑x,yP^(x,y)∑i=1nϱifi(x,y)+1−∑xP^(x)∑yPw(y∣x)exp∑i=1nϱifi(x,y)A(\varrho|w)=\sum\limits_{x,y}\hat{P}(x,y)\sum\limits_{i=1}^n\varrho_if_i(x,y)+1-\sum\limits_{x}\hat{P}(x)\sum\limits_{y}P_w(y|x)\exp \sum\limits_{i=1}^n\varrho_if_i(x,y)A(ϱ∣w)=x,y∑P^(x,y)i=1∑nϱifi(x,y)+1−x∑P^(x)y∑Pw(y∣x)expi=1∑nϱifi(x,y)
于是
L(w+ϱ)−L(w)≥A(ϱ∣w)L(w+\varrho)-L(w)\ge A(\varrho|w)L(w+ϱ)−L(w)≥A(ϱ∣w)
但ϱ\varrhoϱ为向量不易优化故继续化简
定义
f#(x,y)=∑ifi(x,y)f^\#(x,y)=\sum\limits_{i}f_i(x,y)f#(x,y)=i∑fi(x,y)
A(ϱ∣w)=∑x,yP^(x,y)∑i=1nϱifi(x,y)+1−∑xP^(x)∑yPw(y∣x)exp(f#(x,y)∑i=1nϱifi(x,y)f#(x,y))A(\varrho|w)=\sum\limits_{x,y}\hat{P}(x,y)\sum\limits_{i=1}^n\varrho_if_i(x,y)+1-\sum\limits_x\hat{P}(x)\sum\limits_{y}P_w(y|x)\exp(f^\#(x,y)\sum\limits_{i=1}^n\frac{\varrho_if_i(x,y)}{f^\#(x,y)})A(ϱ∣w)=x,y∑P^(x,y)i=1∑nϱifi(x,y)+1−x∑P^(x)y∑Pw(y∣x)exp(f#(x,y)i=1∑nf#(x,y)ϱifi(x,y))
又fi(x,y)f#(x,y)≥0\frac{f_i(x,y)}{f^\#(x,y)}\ge0f#(x,y)fi(x,y)≥0且∑i=1nfi(x,y)f#(x,y)=1\sum\limits_{i=1}^n\frac{f_i(x,y)}{f^\#(x,y)}=1i=1∑nf#(x,y)fi(x,y)=1
根据Jensen不等式
exp(∑i=1nfi(x,y)f#(x,y)ϱif#(x,y))≤∑i=1nfi(x,y)f#(x,y)exp(ϱif#(x,y))\exp(\sum\limits_{i=1}^n\frac{f_i(x,y)}{f^\#(x,y)}\varrho_if^\#(x,y))\le\sum\limits_{i=1}^n\frac{f_i(x,y)}{f^\#(x,y)}\exp(\varrho_if^\#(x,y))exp(i=1∑nf#(x,y)fi(x,y)ϱif#(x,y))≤i=1∑nf#(x,y)fi(x,y)exp(ϱif#(x,y))
于是A(ϱ∣w)≥B(ϱ∣w)=∑x,yP^(x,y)∑i=1nϱifi(x,y)+1−∑xP^(x)∑yPw(y∣x)∑i=1n(fi(x,y)f#(x,y))exp(ϱif#(x,y))A(\varrho|w)\ge B(\varrho|w)=\sum\limits_{x,y}\hat{P}(x,y)\sum\limits_{i=1}^n\varrho_if_i(x,y)+1-\sum\limits_x\hat{P}(x)\sum\limits_{y}P_w(y|x)\sum\limits_{i=1}^n(\frac{f_i(x,y)}{f^\#(x,y)})\exp(\varrho_if^\#(x,y))A(ϱ∣w)≥B(ϱ∣w)=x,y∑P^(x,y)i=1∑nϱifi(x,y)+1−x∑P^(x)y∑Pw(y∣x)i=1∑n(f#(x,y)fi(x,y))exp(ϱif#(x,y))
得到
L(w+ϱ)−L(w)≥B(ϱ∣w)L(w+\varrho)-L(w)\ge B(\varrho|w)L(w+ϱ)−L(w)≥B(ϱ∣w)
∂B(ϱ∣w)∂ϱ=∑x,yP^(x,y)fi(x,y)−∑xP^(x)∑yPw(y∣x)fi(x,y)exp(ϱif#(x,y))=0\frac{\partial B(\varrho|w)}{\partial \varrho}=\sum\limits_{x,y}\hat{P}(x,y)f_i(x,y)-\sum\limits_x\hat{P}(x)\sum\limits_yP_w(y|x)f_i(x,y)\exp(\varrho_if^\#(x,y))=0∂ϱ∂B(ϱ∣w)=x,y∑P^(x,y)fi(x,y)−x∑P^(x)y∑Pw(y∣x)fi(x,y)exp(ϱif#(x,y))=0
得
∑xP^(x)∑yPw(y∣x)fi(x,y)exp(ϱif#(x,y))=EP^(fi)\sum\limits_x\hat{P}(x)\sum\limits_yP_w(y|x)f_i(x,y)\exp(\varrho_if^\#(x,y))=E_{\hat{P}}(f_i)x∑P^(x)y∑Pw(y∣x)fi(x,y)exp(ϱif#(x,y))=EP^(fi)
- 算法
输入:特征函数f1,f2,...,fnf_1,f_2,...,f_nf1,f2,...,fn,经验分布P^(X,Y)\hat{P}(X,Y)P^(X,Y),模型Pw(y∣x)P_w(y|x)Pw(y∣x)
输出:最优参数w∗w^*w∗
(1)(1)(1) 对所有i∈{1,2,...,n}i \in \{1,2,...,n\}i∈{1,2,...,n},取wi=0w_i=0wi=0
(2)(2)(2)对每个iii求方程
∑xP^(x)∑yPw(y∣x)fi(x,y)exp(ϱif#(x,y))=EP^(fi)\sum\limits_x\hat{P}(x)\sum\limits_yP_w(y|x)f_i(x,y)\exp(\varrho_if^\#(x,y))=E_{\hat{P}}(f_i)x∑P^(x)y∑Pw(y∣x)fi(x,y)exp(ϱif#(x,y))=EP^(fi)
wi⟵wi+ϱiw_i \longleftarrow w_i+\varrho_iwi⟵wi+ϱi
(3)(3)(3)如果不是所有wiw_iwi收敛则重复(2)
拟牛顿法
对于最大熵模型
目标函数:
minw∈Rn f(w)=∑xP^(x)logexp(∑i=1nwifi(x,y))−∑x,yP^(x,y)∑i=1nwifi(x,y)\min\limits_{w \in R^n}\ \ \ \ \ \ \ \ \ \ f(w)=\sum\limits_x\hat{P}(x)\log \exp(\sum\limits_{i=1}^nw_if_i(x,y))-\sum\limits_{x,y}\hat{P}(x,y)\sum\limits_{i=1}^nw_if_i(x,y)w∈Rnmin f(w)=x∑P^(x)logexp(i=1∑nwifi(x,y))−x,y∑P^(x,y)i=1∑nwifi(x,y)
则梯度为
g(w)=(∂f(w)∂w1,∂f(w)∂w2,...,∂f(w)∂wn)T\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ g(w)=(\frac{\partial f(w)}{\partial w_1},\frac{\partial f(w)}{\partial w_2},...,\frac{\partial f(w)}{\partial w_n})^T g(w)=(∂w1∂f(w),∂w2∂f(w),...,∂wn∂f(w))T
∂f(w)∂wi=∑x,yP^(x)Pw(y∣x)fi(x,y)−EP^(fi)\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \frac{\partial f(w)}{\partial w_i}=\sum\limits_{x,y}\hat{P}(x)P_w(y|x)f_i(x,y)-E_{\hat{P}}(f_i) ∂wi∂f(w)=x,y∑P^(x)Pw(y∣x)fi(x,y)−EP^(fi)
- 算法
输入:特征函数f1,f2,...,fnf_1,f_2,...,f_nf1,f2,...,fn,经验分布P^(x,y)\hat{P}(x,y)P^(x,y),目标函数f(w)f(w)f(w),梯度g(w)=▽f(w)g(w)=\bigtriangledown f(w)g(w)=▽f(w),精度要求 ε\ \ \varepsilon ε
输出:最优参数w∗w^*w∗
(1)(1)(1)选定初始点w(0),w^{(0)},w(0),取B0B_0B0为正定对称矩阵,k=0k=0k=0
(2)(2)(2)计算gk=g(w(k))g_k=g(w^{(k)})gk=g(w(k)),如果∣∣gk∣∣<ε||g_k||<\varepsilon∣∣gk∣∣<ε,则停止计算,取w∗=w(k)w^*=w^{(k)}w∗=w(k),否则转(3)
(3)(3)(3)由Bkpk=−gkB_kp_k=-g_kBkpk=−gk求出pkp_kpk
(4)(4)(4)一维搜索求λk\lambda_kλk
f(w(k)+λkpk)=minλ≥0f(w(k)+λpk)f(w^{(k)}+\lambda_kp_k)=\min\limits_{\lambda\ge0}f(w^{(k)}+\lambda p_k)f(w(k)+λkpk)=λ≥0minf(w(k)+λpk)
(5)(5)(5)w(k+1)=w(k)+λkpkw^{(k+1)}=w^{(k)}+\lambda_kp_kw(k+1)=w(k)+λkpk
(6)(6)(6)计算gk+1=g(w(k+1))g_{k+1}=g(w^{(k+1)})gk+1=g(w(k+1))如果∣∣gk+1∣∣<ε||g_{k+1}||<\varepsilon∣∣gk+1∣∣<ε则停止,得到w=w(k+1)w=w^{(k+1)}w=w(k+1)否则计算Bk+1B_{k+1}Bk+1
Bk+1=Bk+ykykTykTϱk−BkϱkϱkTBkϱkTBkϱkB_{k+1}=B_k+\frac{y_ky_k^T}{y_k^T\varrho_k}-\frac{B_k\varrho_k\varrho_k^TB_k}{\varrho_k^TB_k\varrho_k}Bk+1=Bk+ykTϱkykykT−ϱkTBkϱkBkϱkϱkTBk其中yk=gk+1−gk,ϱk=w(k+1)−w(k)y_k=g_{k+1}-g_k,\varrho_k=w^{(k+1)}-w{(k)}yk=gk+1−gk,ϱk=w(k+1)−w(k)
(7)(7)(7)k=k+1k=k+1k=k+1,转(3)(3)(3)