5.广义线性模型
- 指数族
p(y;η)=b(y)e(ηTT(y)−a(η)) \large p(y;η)=b(y)e^{(η^TT(y)−a(η))} p(y;η)=b(y)e(ηTT(y)−a(η))
η\etaη 叫做此分布的自然参数,一般T(y)=yT(y)=yT(y)=y
-
伯努利分布属于指数族
p(y;ϕ)=ϕy(1−ϕ)1−y=e[ylogϕ+(1−y)log(1−ϕ)]=e{[log(ϕ1−ϕ)]y+log(1−ϕ)} \large p(y;ϕ)=ϕ^y(1−ϕ)^{1−y}=e^{[ylogϕ+(1−y)log(1−ϕ)]}= e^{\{[log(\frac{ϕ}{1−ϕ})]y+log(1−ϕ)\}} p(y;ϕ)=ϕy(1−ϕ)1−y=e[ylogϕ+(1−y)log(1−ϕ)]=e{[log(1−ϕϕ)]y+log(1−ϕ)}T(y)=ya(η)=−log(1−ϕ)=log(1+eη)b(y)=1 \begin{aligned} T(y)&=y\\ a(η)&=−log(1−ϕ)\\ &=log(1+eη)\\ b(y)&=1 \end{aligned} T(y)a(η)b(y)=y=−log(1−ϕ)=log(1+eη)=1
-
正态分布属于指数族
-
多项式分布属于指数族
-
伽马和指数分布属于指数族
-
贝塔和狄利克雷分布属于指数族
构建模型
-
前提:
- y∣x;θ∼ExponentialFamily(η)y | x; \theta ∼ Exponential Family(\eta)y∣x;θ∼ExponentialFamily(η),即给定 xxx 和 θ,y\theta, yθ,y 的分布属于指数分布族,是一个参数为 η\etaη 的指数分布。
- hhh 输出的预测值 h(x)h(x)h(x) 要满足 h(x)=E[y∣x]h(x) = E[y|x]h(x)=E[y∣x]
- 自然参数 η\etaη 和输入值 xxx 是线性相关的,η=θTx\eta = \theta^T xη=θTx,或者如果 η\etaη 是有值的向量,则有ηi=θiTx\eta_i = \theta_i^T xηi=θiTx。
-
SoftMax回归
假设有K个分类,则分类写作向量形式 T(y)∈Rk−1T (y) \in R^{k−1}T(y)∈Rk−1:
T(1)=[100⋮0],T(2)=[010⋮0],T(3)=[001⋮0],T(k−1)=[000⋮1 ],T(k)=[000⋮0 ] T(1)= \begin{bmatrix} 1\\ 0\\ 0\\ \vdots \\ 0 \end{bmatrix}, T(2)= \begin{bmatrix} 0\\ 1\\ 0\\ \vdots \\ 0 \end{bmatrix}, T(3)= \begin{bmatrix} 0\\ 0\\ 1\\ \vdots \\ 0\\ \end{bmatrix}, T(k-1)= \begin{bmatrix} 0\\ 0\\ 0\\ \vdots \\ 1\ \end{bmatrix}, T(k)= \begin{bmatrix} 0\\ 0\\ 0\\ \vdots \\ 0\ \end{bmatrix}T(1)=⎣⎢⎢⎢⎢⎢⎡100⋮0⎦⎥⎥⎥⎥⎥⎤,T(2)=⎣⎢⎢⎢⎢⎢⎡010⋮0⎦⎥⎥⎥⎥⎥⎤,T(3)=⎣⎢⎢⎢⎢⎢⎡001⋮0⎦⎥⎥⎥⎥⎥⎤,T(k−1)=⎣⎢⎢⎢⎢⎢⎡000⋮1 ⎦⎥⎥⎥⎥⎥⎤,T(k)=⎣⎢⎢⎢⎢⎢⎡000⋮0 ⎦⎥⎥⎥⎥⎥⎤
向量 T(y)T(y)T(y) 中的第 iii 个元素写成(T(y))i(T(y))_i(T(y))i 。
概率写作:$ p (y = i; \phi)=\phi_i ,p (y = k; \phi) = 1 −\sum ^{k−1}_{i=1}\phi_i$:
p(y;ϕ)=ϕ11{y=1}ϕ21{y=2}…ϕk1{y=k}=ϕ11{y=1}ϕ21{y=2}…ϕk1−∑i=1k−11{y=i}=ϕ1(T(y))1ϕ2(T(y))2…ϕk1−∑i=1k−1(T(y))i=exp((T(y))1log(ϕ1)+(T(y))2log(ϕ2)+⋯+(1−∑i=1k−1(T(y))i)log(ϕk))=exp((T(y))1log(ϕ1ϕk)+(T(y))2log(ϕ2ϕk)+⋯+(T(y))k−1log(ϕk−1ϕk)+log(ϕk))=b(y)exp(ηTT(y)−a(η)) \begin{aligned} p(y;\phi) &=\phi_1^{1\{y=1\}}\phi_2^{1\{y=2\}}\dots \phi_k^{1\{y=k\}} \\ &=\phi_1^{1\{y=1\}}\phi_2^{1\{y=2\}}\dots \phi_k^{1-\sum_{i=1}^{k-1}1\{y=i\}} \\ &=\phi_1^{(T(y))_1}\phi_2^{(T(y))_2}\dots \phi_k^{1-\sum_{i=1}^{k-1}(T(y))_i } \\ &=exp((T(y))_1 log(\phi_1)+(T(y))_2 log(\phi_2)+\dots+(1-\sum_{i=1}^{k-1}(T(y))_i)log(\phi_k)) \\ &= exp((T(y))_1 log(\frac{\phi_1}{\phi_k})+(T(y))_2 log(\frac{\phi_2}{\phi_k})+\dots+(T(y))_{k-1}log(\frac{\phi_{k-1}}{\phi_k})+log(\phi_k)) \\ &=b(y)exp(\eta^T T(y)-a(\eta)) \end{aligned} p(y;ϕ)=ϕ11{y=1}ϕ21{y=2}…ϕk1{y=k}=ϕ11{y=1}ϕ21{y=2}…ϕk1−∑i=1k−11{y=i}=ϕ1(T(y))1ϕ2(T(y))2…ϕk1−∑i=1k−1(T(y))i=exp((T(y))1log(ϕ1)+(T(y))2log(ϕ2)+⋯+(1−i=1∑k−1(T(y))i)log(ϕk))=exp((T(y))1log(ϕkϕ1)+(T(y))2log(ϕkϕ2)+⋯+(T(y))k−1log(ϕkϕk−1)+log(ϕk))=b(y)exp(ηTT(y)−a(η))
其中:
η=[log(ϕ1/ϕk)log(ϕ2/ϕk)⋮log(ϕk−1/ϕk) ],a(η)=−log(ϕk)b(y)=1 \begin{aligned} \eta &= \begin{bmatrix} \log (\phi _1/\phi _k)\\ \log (\phi _2/\phi _k)\\ \vdots \\ \log (\phi _{k-1}/\phi _k)\ \end{bmatrix}, \\ a(\eta) &= -\log (\phi _k)\\ b(y) &= 1\ \end{aligned} ηa(η)b(y)=⎣⎢⎢⎢⎡log(ϕ1/ϕk)log(ϕ2/ϕk)⋮log(ϕk−1/ϕk) ⎦⎥⎥⎥⎤,=−log(ϕk)=1
所以:
ηi=logϕiϕk \eta_i =\log \frac {\phi_i}{\phi_k} ηi=logϕkϕi
于是我们可以以此推出:
eηi=ϕiϕkϕkeηi=ϕiϕk∑i=1keηi=∑i=1kϕi=1ϕk=1∑i=1keηiϕi=eηi∑j=1keηj \begin{aligned} e^{\eta_i} &= \frac {\phi_i}{\phi_k}\\ \phi_k e^{\eta_i} &= \phi_i \\ \phi_k \sum^k_{i=1} e^{\eta_i}&= \sum^k_{i=1}\phi_i= 1\\ \phi_k &= \frac 1 {\sum^k_{i=1} e^{\eta_i}}\\ \phi_i &= \frac { e^{\eta_i} }{ \sum^k_{j=1} e^{\eta_j}} \end{aligned} eηiϕkeηiϕki=1∑keηiϕkϕi=ϕkϕi=ϕi=i=1∑kϕi=1=∑i=1keηi1=∑j=1keηjeηi
下面这个函数从η\etaη 映射到了ϕ\phiϕ,称为 Softmax 函数:
ϕi=eηi∑j=1keηj \phi_i = \frac { e^{\eta_i} }{ \sum^k_{j=1} e^{\eta_j}} ϕi=∑j=1keηjeηi
由于前提3,也就是 ηi\eta_iηi 是一个 xxx 的线性函数。所以就有了 ηi=θiTx(fori=1,...,k−1)\eta_i= \theta_i^Tx (for\quad i = 1, ..., k − 1)ηi=θiTx(fori=1,...,k−1),其中的 θ1,...,θk−1∈Rn+1\theta_1, ..., \theta_{k−1} \in R^{n+1}θ1,...,θk−1∈Rn+1 就是我们建模的参数。因此,我们的模型假设了给定 xxx 的 yyy 的条件分布为:
p(y=i∣x;θ)=ϕi=eηi∑j=1keηj=eθiTx∑j=1keθjTx \begin{aligned} p(y=i|x;\theta) &= \phi_i \\ &= \frac {e^{\eta_i}}{\sum^k_{j=1}e^{\eta_j}}\\ &=\frac {e^{\theta_i^Tx}}{\sum^k_{j=1}e^{\theta_j^Tx}}\ \end{aligned} p(y=i∣x;θ)=ϕi=∑j=1keηjeηi=∑j=1keθjTxeθiTx
这个适用于解决 y∈1,...,ky \in{1, ..., k}y∈1,...,k 的分类问题的模型,就叫做 Softmax 回归。 这种回归是对逻辑回归的一种扩展泛化。由于前提2,可以发现:
hθ(x)=E[T(y)∣x;θ]=E[ϕ1ϕ2⋮ϕk−1]=E[exp(θ1Tx)∑j=1kexp(θjTx)exp(θ2Tx)∑j=1kexp(θjTx)⋮exp(θk−1Tx)∑j=1kexp(θjTx) ] \begin{aligned} h_\theta (x) &= E[T(y)|x;\theta]\\ &= E \left[ \begin{array}{c} \phi_1\\ \phi_2\\ \vdots \\ \phi_{k-1}\\ \end{array} \right]\\ &= E \left[ \begin{array}{ccc} \frac {exp(\theta_1^Tx)}{\sum^k_{j=1}exp(\theta_j^Tx)} \\ \frac {exp(\theta_2^Tx)}{\sum^k_{j=1}exp(\theta_j^Tx)} \\ \vdots \\ \frac {exp(\theta_{k-1}^Tx)}{\sum^k_{j=1}exp(\theta_j^Tx)} \ \end{array} \right]\ \end{aligned} hθ(x)=E[T(y)∣x;θ]=E⎣⎢⎢⎢⎡ϕ1ϕ2⋮ϕk−1⎦⎥⎥⎥⎤=E⎣⎢⎢⎢⎢⎢⎢⎡∑j=1kexp(θjTx)exp(θ1Tx)∑j=1kexp(θjTx)exp(θ2Tx)⋮∑j=1kexp(θjTx)exp(θk−1Tx) ⎦⎥⎥⎥⎥⎥⎥⎤
最后我们通过极大似然拟合参数 θi\theta_iθi :
l(θ)=∑i=1mlogp(y(i)∣x(i);θ)=∑i=1mlog∏l=1k(eθlTx(i)∑j=1keθjTx(i))1(y(i)=l) \begin{aligned} l(\theta)& =\sum^m_{i=1} \log p(y^{(i)}|x^{(i)};\theta)\\ &= \sum^m_{i=1}log\prod ^k_{l=1}\left(\frac {e^{\theta_l^Tx^{(i)}}}{\sum^k_{j=1} e^{\theta_j^T x^{(i)}}}\right)^{1(y^{(i)}=l)}\ \end{aligned} l(θ)=i=1∑mlogp(y(i)∣x(i);θ)=i=1∑mlogl=1∏k(∑j=1keθjTx(i)eθlTx(i))1(y(i)=l)
其中mmm是样本数,kkk是分类数,对于每一个种类的概率ϕi\phi_iϕi有一个相对应的参数向量θi\theta_iθi(注意:由于第kkk个分类的概率,可以由1−∑i=1k−1ϕi1-\sum_{i=1}^{k-1}\phi_i1−∑i=1k−1ϕi得来,所以我们定义θk=0,ϕk=0\theta_k=0,\phi_k=0θk=0,ϕk=0。所以对于逻辑回归(二分类),参数向量θ\thetaθ只有一个)-
对单个样本求导
继续推导l(θ)l(\theta)l(θ)函数:
l(θ)=∑i=1mlog∏l=1k(eθlTx(i))1(y(i)=l)∑j=1keθjTx(i)=∑i=1m{∑l=1k[(1{y(i)=l})(θlTx(i))]−log∑j=1keθjTx(i)} \begin{aligned} l(\theta)&=\sum^m_{i=1}log\frac {\prod ^k_{l=1}\left(e^{\theta_l^Tx^{(i)}}\right)^{1(y^{(i)}=l)}}{\sum^k_{j=1} e^{\theta_j^T x^{(i)}}} \\ &=\sum^m_{i=1}\left\{\sum^k_{l=1}\left[({1\{y^{(i)}=l\}})\left({\theta_l^Tx^{(i)}}\right)\right]-log{\sum^k_{j=1} e^{\theta_j^T x^{(i)}}}\right\} \end{aligned} l(θ)=i=1∑mlog∑j=1keθjTx(i)∏l=1k(eθlTx(i))1(y(i)=l)=i=1∑m{l=1∑k[(1{y(i)=l})(θlTx(i))]−logj=1∑keθjTx(i)}
所以l(θ)l(\theta)l(θ)对第ddd种分类对应的参数θn\theta_nθn求导:
∂l∂θd=∑i=1m{[(1{y(i)=d})(x(i))]−eθdTx(i)∑j=1keθjTx(i)x(i)}=∑i=1m{[(1{y(i)=d})−eθdTx(i)∑j=1keθjTx(i)](x(i))}=∑i=1m{[(1{y(i)=d})−SoftMax(eθdTx(i))](x(i))} \begin{aligned} \frac{\partial l}{\partial θ_d}&=\sum^m_{i=1}\left\{\left[({1\{y^{(i)}=d\}})\left({x^{(i)}}\right)\right]-\frac{e^{\theta_d^T x^{(i)}}}{\sum^k_{j=1} e^{\theta_j^T x^{(i)}}}x^{(i)}\right\}\\ &=\sum^m_{i=1}\left\{\left[({1\{y^{(i)}=d\}})-\frac{e^{\theta_d^T x^{(i)}}}{\sum^k_{j=1} e^{\theta_j^T x^{(i)}}}\right]\left({x^{(i)}}\right)\right\}\\ &=\sum^m_{i=1}\left\{\left[({1\{y^{(i)}=d\}})-SoftMax(e^{\theta_d^T x^{(i)}})\right]\left({x^{(i)}}\right)\right\}\\ \end{aligned} ∂θd∂l=i=1∑m{[(1{y(i)=d})(x(i))]−∑j=1keθjTx(i)eθdTx(i)x(i)}=i=1∑m{[(1{y(i)=d})−∑j=1keθjTx(i)eθdTx(i)](x(i))}=i=1∑m{[(1{y(i)=d})−SoftMax(eθdTx(i))](x(i))} -
对参数矩阵θ\thetaθ求导
设:
Y=[y(1)y(2)⋯y(m)] Y= \begin{bmatrix} y^{(1)}& y^{(2)}& \cdots& y^{(m)}\\ \end{bmatrix} Y=[y(1)y(2)⋯y(m)]
YYY是k×mk\times mk×m的结果矩阵(m为样本数,k为分类总数),其中y(i)y^{(i)}y(i)是每个样本的分类T(y)T(y)T(y).
X=[—(x(1))T——(x(2))T—⋮—(x(m))T—] X=\left[ \begin{matrix} —(x^{(1)})^T—\\ —(x^{(2)})^T—\\ \vdots\\ —(x^{(m)})^T— \end{matrix} \right] X=⎣⎢⎢⎢⎡—(x(1))T——(x(2))T—⋮—(x(m))T—⎦⎥⎥⎥⎤
XXX是m×nm\times nm×n的特征矩阵(m为样本数,n为特征数)
θ=[θ1θ2⋯θk] \theta=\left[ \begin{matrix} \theta_1&\theta_2&\cdots&\theta_k \end{matrix} \right] θ=[θ1θ2⋯θk]
θ\thetaθ是m×km \times km×k的参数矩阵,对于每种不同的分类有一个参数向量θi\theta_iθi则(E\mathbf EE为元素全为1的k×kk \times kk×k矩阵):
l(θ)=∑i=1mlog[e(x(i))Tθe(x(i))TθE](y(i)) l(\theta)=\sum^m_{i=1}log\left[\frac {e^{(x^{(i)})^T\theta}}{e^{(x^{(i)})^T\theta}\mathbf E}\right](y^{(i)}) l(θ)=i=1∑mlog[e(x(i))TθEe(x(i))Tθ](y(i))
将l(θ)l(\theta)l(θ)修改写为矩阵形式
l(θ)=tr(log[Softmax(Xθ)]Y)=tr(Ylog[Softmax(Xθ)])=tr(YXθ−Ylog(eXθE)) \begin{aligned} l(\theta)&=tr(log[Softmax(X\theta)]Y)\\ &=tr(Ylog[Softmax(X\theta)])\\ &=tr(YX\theta-Ylog(e^{X\theta}\mathbf E)) \end{aligned} l(θ)=tr(log[Softmax(Xθ)]Y)=tr(Ylog[Softmax(Xθ)])=tr(YXθ−Ylog(eXθE))
求导:
d(l)=tr(YXd(θ))−tr(Y(1eXθE⊙d(eXθE)))=tr(YXd(θ))−tr((YT⊙1eXθE)Td(eXθE))=tr(YXd(θ))−tr(E(YT⊙1eXθE)Td(eXθ))=tr(YXd(θ))−tr(EY(1eXθE⊙d(eXθ)))=tr(YXd(θ))−tr(E(1eXθE⊙d(eXθ)))=tr(YXd(θ))−tr((E⊙1eXθE)Td(eXθ))=tr(YXd(θ))−tr((1eXθE)T[(eXθ)⊙d(Xθ)])=tr(YXd(θ))−tr((1eXθE⊙(eXθ))Td(Xθ))=tr(YXd(θ))−tr((eXθeXθE)Td(Xθ))=tr(YXd(θ))−tr((eXθeXθE)TXd(θ))=tr((YT−(eXθeXθE))TXd(θ))=tr((∂l∂θ)Td(θ)) \begin{aligned} d(l) &=tr(YXd(\theta))-tr\left(Y\left(\frac{1}{e^{X\theta}\mathbf E}\odot d(e^{X\theta}\mathbf E)\right)\right)\\ &=tr(YXd(\theta))-tr\left(\left(Y^T \odot\frac{1}{e^{X\theta}\mathbf E}\right)^T d(e^{X\theta}\mathbf E)\right)\\ &=tr(YXd(\theta))-tr\left(\mathbf E \left(Y^T \odot\frac{1}{e^{X\theta}\mathbf E}\right)^T d(e^{X\theta})\right)\\ &=tr(YXd(\theta))-tr\left(\mathbf E Y \left(\frac{1}{e^{X\theta}\mathbf E}\odot d(e^{X\theta})\right)\right)\\ &=tr(YXd(\theta))-tr\left(\mathbf E \left(\frac{1}{e^{X\theta}\mathbf E}\odot d(e^{X\theta})\right)\right)\\ &=tr(YXd(\theta))-tr\left(\left(\mathbf E \odot\frac{1}{e^{X\theta}\mathbf E}\right)^T d(e^{X\theta})\right)\\ &=tr(YXd(\theta))-tr\left(\left(\frac{1}{e^{X\theta}\mathbf E}\right)^T\left[(e^{X\theta}) \odot d(X\theta)\right]\right)\\ &=tr(YXd(\theta))-tr\left(\left(\frac{1}{e^{X\theta}\mathbf E}\odot (e^{X\theta}) \right)^T d(X\theta)\right)\\ &=tr(YXd(\theta))-tr\left(\left(\frac{e^{X\theta}}{e^{X\theta}\mathbf E}\right)^Td(X\theta)\right)\\ &=tr(YXd(\theta))-tr\left(\left(\frac{e^{X\theta}}{e^{X\theta}\mathbf E}\right)^TXd(\theta)\right)\\ &=tr\left(\left(Y^T-\left(\frac{e^{X\theta}}{e^{X\theta}\mathbf E}\right)\right)^TXd(\theta)\right)=tr\left(\left(\frac{\partial l}{\partial \theta}\right)^Td(\theta)\right)\\ \end{aligned} d(l)=tr(YXd(θ))−tr(Y(eXθE1⊙d(eXθE)))=tr(YXd(θ))−tr((YT⊙eXθE1)Td(eXθE))=tr(YXd(θ))−tr(E(YT⊙eXθE1)Td(eXθ))=tr(YXd(θ))−tr(EY(eXθE1⊙d(eXθ)))=tr(YXd(θ))−tr(E(eXθE1⊙d(eXθ)))=tr(YXd(θ))−tr((E⊙eXθE1)Td(eXθ))=tr(YXd(θ))−tr((eXθE1)T[(eXθ)⊙d(Xθ)])=tr(YXd(θ))−tr((eXθE1⊙(eXθ))Td(Xθ))=tr(YXd(θ))−tr((eXθEeXθ)Td(Xθ))=tr(YXd(θ))−tr((eXθEeXθ)TXd(θ))=tr((YT−(eXθEeXθ))TXd(θ))=tr((∂θ∂l)Td(θ))
所以:
∂l∂θ=XT[YT−(eXθeXθE)]=XT[YT−Softmax(Xθ)] \frac{\partial l}{\partial \theta}=X^T\left[Y^T-\left(\frac{e^{X\theta}}{e^{X\theta}\mathbf E}\right)\right]=X^T[Y^T-Softmax(X\theta)] ∂θ∂l=XT[YT−(eXθEeXθ)]=XT[YT−Softmax(Xθ)]
-