6.1 Logistic Regression
Logistic分布
回顾感知机:
f(x)=sign(w⋅x+b) f(x)=\operatorname{sign}(w \cdot x+b) f(x)=sign(w⋅x+b)
思考:
- 只输出-1和+1是不是太生硬了?这样的判别方式真的有效吗?
- 超平面左侧0.001距离的点和超平面右侧0.001距离的点真的有天壤之别吗?
感知机的缺陷:
- 感知机通过梯度下降更新参数,但在 sign函数中, x=0x=0x=0 是间断点,不可微。
- 感知机由于sign不是连续可微的,因此 在梯度下降时脱去了壳子sign函数。
logistic regression定义:
P(Y=1∣x)=exp(w⋅x)1+exp(w⋅x)P(Y=0∣x)=11+exp(w⋅x) \begin{aligned} &P(Y=1 \mid x)=\frac{\exp (w \cdot x)}{1+\exp (w \cdot x)} \\ &P(Y=0 \mid x)=\frac{1}{1+\exp (w \cdot x)} \end{aligned} P(Y=1∣x)=1+exp(w⋅x)exp(w⋅x)P(Y=0∣x)=1+exp(w⋅x)1
参数估计:
Logistic regression模型学习时,对于给定的训练数据集T={(x1,y1),(x2,y2),⋯ ,(xN,yN)}T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \cdots,\right.\left.\left(x_{N}, y_{N}\right)\right\}T={(x1,y1),(x2,y2),⋯,(xN,yN)},其中,xi∈Rn,yi∈{0,1}x_{i} \in \mathbf{R}^{n}, \quad y_{i} \in\{0,1\}xi∈Rn,yi∈{0,1},可以应用极大似然估计法估计模型参数,从而得到logistic regression模型。
设:
P(Y=1∣x)=π(x),P(Y=0∣x)=1−π(x)
P(Y=1 \mid x)=\pi(x), \quad P(Y=0 \mid x)=1-\pi(x)
P(Y=1∣x)=π(x),P(Y=0∣x)=1−π(x)
似然函数为:
∏i=1N[π(xi)]yi[1−π(xi)]1−yi
\prod_{i=1}^{N}\left[\pi\left(x_{i}\right)\right]^{y_{i}}\left[1-\pi\left(x_{i}\right)\right]^{1-y_{i}}
i=1∏N[π(xi)]yi[1−π(xi)]1−yi
对数似然函数为:
L(w)=∑i=1N[yilogπ(xi)+(1−yi)log(1−π(xi))]=∑i=1N[yilogπ(xi)1−π(xi)+log(1−π(xi))]=∑i=1N[yi(w⋅xi)−log(1+exp(w⋅xi)]
\begin{aligned}
L(w) &=\sum_{i=1}^{N}\left[y_{i} \log \pi\left(x_{i}\right)+\left(1-y_{i}\right) \log \left(1-\pi\left(x_{i}\right)\right)\right] \\
&=\sum_{i=1}^{N}\left[y_{i} \log \frac{\pi\left(x_{i}\right)}{1-\pi\left(x_{i}\right)}+\log \left(1-\pi\left(x_{i}\right)\right)\right] \\
&=\sum_{i=1}^{N}\left[y_{i}\left(w \cdot x_{i}\right)-\log \left(1+\exp \left(w \cdot x_{i}\right)\right]\right.
\end{aligned}
L(w)=i=1∑N[yilogπ(xi)+(1−yi)log(1−π(xi))]=i=1∑N[yilog1−π(xi)π(xi)+log(1−π(xi))]=i=1∑N[yi(w⋅xi)−log(1+exp(w⋅xi)]
对L(w)L(w)L(w)求极大值,得到www的估计值。
似然函数对www的求导:
L(w)=∑i=1N[yi(w⋅xi)−log(1+exp(w⋅xi))]∂L(w)∂w=yi⋅xi−11+exp(w⋅xi)exp(w⋅xi)⋅xi=yi⋅xi−xi⋅exp(w⋅xi)1+exp(w⋅xi)
\begin{gathered}
L(w)=\sum_{i=1}^{N}\left[y_{i}\left(w \cdot x_{i}\right)-\log \left(1+\exp \left(w \cdot x_{i}\right)\right)\right] \\
\frac{\partial L(w)}{\partial w}=y_{i} \cdot x_{i}-\frac{1}{1+\exp \left(w \cdot x_{i}\right)} \exp \left(w \cdot x_{i}\right) \cdot x_{i}=y_{i} \cdot x_{i}-\frac{x_{i} \cdot \exp \left(w \cdot x_{i}\right)}{1+\exp \left(w \cdot x_{i}\right)}
\end{gathered}
L(w)=i=1∑N[yi(w⋅xi)−log(1+exp(w⋅xi))]∂w∂L(w)=yi⋅xi−1+exp(w⋅xi)1exp(w⋅xi)⋅xi=yi⋅xi−1+exp(w⋅xi)xi⋅exp(w⋅xi)
总结:
- 逻辑斯谛以输出概率的形式解决了极小距离带来的 + 1和-1的天壤之别。同时概率也可作为模型输出的置信程度。
- 逻辑斯谛使得了最终的模型函数连续可微。训练目标与预测目标达成了一致。
- 逻辑斯谛采用了极大似然估计来估计参数。
最大熵原理
什么是最大熵?
在我们猜测概率时,不确定的部分我们认为是等可能的,就好像骰子一样,我们知道有6个面,因此认为每个面的概率是 1/61 / 61/6 ,也就是等可能。
换句话说,就是趋向于均匀分布,最大熵使用的就是一个这么朴素的道理:凡是我们知道的,就把它考虑进去,凡是不知道的, 通通均匀分布。
最大熵模型
终极目标:
P(Y∣X)
P(Y \mid X)
P(Y∣X)
熵:
H(P)=−∑xp(x)logP(x)
H(P)=-\sum_{x} p(x) \log P(x)
H(P)=−x∑p(x)logP(x)
将终极目标代入熵:
H(P)=−∑xp(y∣x)logP(y∣x)
H(P)=-\sum_{x} p(y \mid x) \log P(y \mid x)
H(P)=−x∑p(y∣x)logP(y∣x)
做些改变,调整熵:
H(P)=−∑xP~(x)p(y∣x)logP(y∣x)
H(P)=-\sum_{x} \widetilde{P}(x) p(y \mid x) \log P(y \mid x)
H(P)=−x∑P(x)p(y∣x)logP(y∣x)
约束条件
特征函数
f(x,y)={1,x 与 y 满足某一事实 0, 否则
f(x, y)= \begin{cases}1, & x \text { 与 } y \text { 满足某一事实 } \\ 0, & \text { 否则 }\end{cases}
f(x,y)={1,0,x 与 y 满足某一事实 否则
特征函数f(x,y)f(x, y)f(x,y)关于经验分布P~(x,y)\widetilde{P}(x, y)P(x,y)的期望值:
Ep~(f)=∑x,yP~(x,y)f(x,y)=∑x,yP~(x)P~(y∣x)f(x,y)
E_{\widetilde{p}}(f)=\sum_{x, y} \widetilde{P}(x, y) f(x, y)=\sum_{x, y} \widetilde{P}(x) \widetilde{P}(y \mid x) f(x, y)
Ep(f)=x,y∑P(x,y)f(x,y)=x,y∑P(x)P(y∣x)f(x,y)
特征函数f(x,y)f(x, y)f(x,y)关于经验分布P(x,y)P(x, y)P(x,y)的期望值:
Ep(f)=∑x,yP(x,y)f(x,y)=∑x,yP~(x)P(y∣x)f(x,y)
E_{p}(f)=\sum_{x, y} P(x, y) f(x, y)=\sum_{x, y} \widetilde{P}(x) P(y \mid x) f(x, y)
Ep(f)=x,y∑P(x,y)f(x,y)=x,y∑P(x)P(y∣x)f(x,y)
约束:
Ep~(f)=Ep(f)
E_{\widetilde{p}}(f)=E_{p}(f)
Ep(f)=Ep(f)
maxP∈CH(P)=−∑x,yP~(x)P~(y∣x)f(x,y) s.t. Ep~(f)−Ep(f)=0∑yP(y∣x)=1minP∈CH(P)=∑x,yP~(x)P~(y∣x)f(x,y) s.t. Ep~(f)−Ep(f)=0∑yP(y∣x)=1 \begin{array}{ll} \max _{P \in C} & H(P)=-\sum_{x, y} \widetilde{P}(x) \widetilde{P}(y \mid x) f(x, y) \\ \text { s.t. } & E_{\widetilde{p}}(f)-E_{p}(f)=0 \\ & \sum_{y} P(y \mid x)=1 \\ \min _{P \in C} & H(P)=\sum_{x, y} \widetilde{P}(x) \widetilde{P}(y \mid x) f(x, y) \\ \text { s.t. } & E_{\widetilde{p}}(f)-E_{p}(f)=0 \\ & \sum_{y} P(y \mid x)=1 \end{array} maxP∈C s.t. minP∈C s.t. H(P)=−∑x,yP(x)P(y∣x)f(x,y)Ep(f)−Ep(f)=0∑yP(y∣x)=1H(P)=∑x,yP(x)P(y∣x)f(x,y)Ep(f)−Ep(f)=0∑yP(y∣x)=1
拉格朗日乘子法
L(P,w)≡−H(P)+w0(1−∑yP(y∣x))+∑i=1nwi(EP~(fi)−EP(fi))=∑x,yP~(x)P(y∣x)logP(y∣x)+w0(1−∑yP(y∣x))+∑i=1nwi(∑x,yP~(x,y)fi(x,y)−∑x,yP~(x)P(y∣x)fi(x,y)) \begin{aligned} L(P, w) \equiv &-H(P)+w_{0}\left(1-\sum_{y} P(y \mid x)\right)+\sum_{i=1}^{n} w_{i}\left(E_{\tilde{P}}\left(f_{i}\right)-E_{P}\left(f_{i}\right)\right) \\ =& \sum_{x, y} \tilde{P}(x) P(y \mid x) \log P(y \mid x)+w_{0}\left(1-\sum_{y} P(y \mid x)\right) \\ &+\sum_{i=1}^{n} w_{i}\left(\sum_{x, y} \tilde{P}(x, y) f_{i}(x, y)-\sum_{x, y} \tilde{P}(x) P(y \mid x) f_{i}(x, y)\right) \end{aligned} L(P,w)≡=−H(P)+w0(1−y∑P(y∣x))+i=1∑nwi(EP~(fi)−EP(fi))x,y∑P~(x)P(y∣x)logP(y∣x)+w0(1−y∑P(y∣x))+i=1∑nwi(x,y∑P~(x,y)fi(x,y)−x,y∑P~(x)P(y∣x)fi(x,y))
minP∈CmaxwL(P,w)→maxwminP∈CL(P,w) \min _{P \in C} \max _{w} L(P, w) \rightarrow \max _{w} \min _{P \in C} L(P, w) P∈CminwmaxL(P,w)→wmaxP∈CminL(P,w)
Pw(y∣x)=1Zw(x)exp(∑i=1nwifi(x,y))Zw(x)=∑yexp(∑i=1nwifi(x,y)) \begin{aligned} P_{w}(y \mid x) &=\frac{1}{Z_{w}(x)} \exp \left(\sum_{i=1}^{n} w_{i} f_{i}(x, y)\right) \\ Z_{w}(x) &=\sum_{y} \exp \left(\sum_{i=1}^{n} w_{i} f_{i}(x, y)\right) \end{aligned} Pw(y∣x)Zw(x)=Zw(x)1exp(i=1∑nwifi(x,y))=y∑exp(i=1∑nwifi(x,y))
总结
- 最大熵强调不提任何假设,以熵最大为目标。
- 将终极目标代入熵的公式后,将其最大化。
- 在训练集中寻找现有的约束,计算期望,将其作为约束。使用拉格朗日乘子法得到 p(y∣x)p(y \mid x)p(y∣x) ,之后使用优化算法得到 p(y∣x)p(y \mid x)p(y∣x) 中的参数 www 。
6.2 改进的尺度迭代法(IIS)
已知要解决的目标:
Pw(y∣x)=1Zw(x)exp(∑i=1nwifi(x,y))Zw(x)=∑yexp(∑i=1nwifi(x,y))
\begin{aligned}
P_{w}(y \mid x) &=\frac{1}{Z_{w}(x)} \exp \left(\sum_{i=1}^{n} w_{i} f_{i}(x, y)\right) \\
Z_{w}(x) &=\sum_{y} \exp \left(\sum_{i=1}^{n} w_{i} f_{i}(x, y)\right)
\end{aligned}
Pw(y∣x)Zw(x)=Zw(x)1exp(i=1∑nwifi(x,y))=y∑exp(i=1∑nwifi(x,y))
所有的式子连乘取对数转换为似然函数为:
L(w)=∑x,y[P~(x,y)∑i=1nwifi(x,y)]−∑x[P~(x)lnZw(x)]
L(w)=\sum_{x, y}\left[\tilde{P}(x, y) \sum_{i=1}^{n} w_{i} f_{i}(x, y)\right]-\sum_{x}\left[\tilde{P}(x) \ln Z_{w}(x)\right]
L(w)=x,y∑[P~(x,y)i=1∑nwifi(x,y)]−x∑[P~(x)lnZw(x)]
IIS核心思想:每次增加一个量δ\deltaδ, 使得L(w+δ)>L(w)L(w+\delta)>L(w)L(w+δ)>L(w),以此不断提高LLL的值,直到达到极大值
L(w+δ)−L(w)=∑x,y[P~(x,y)∑i=1nwiδifi(x,y)]−∑x[P~(x)lnZw+δ(x)Zw(x)]
L(w+\delta)-L(w)=\sum_{x, y}\left[\tilde{P}(x, y) \sum_{i=1}^{n} w_{i} \delta_{i} f_{i}(x, y)\right]-\sum_{x}\left[\tilde{P}(x) \ln \frac{Z_{w+\delta}(x)}{Z_{w}(x)}\right]
L(w+δ)−L(w)=x,y∑[P~(x,y)i=1∑nwiδifi(x,y)]−x∑[P~(x)lnZw(x)Zw+δ(x)]
其中
Zw+δ(x)Zw(x)=1Zw(x)∑yexp(∑i=1n(wi+δi)fi(x,y)])=∑y1Zw(x)exp(∑i=1nwifi(x,y))exp(∑i=1nδifi(x,y))=∑yP(y∣x)exp(∑i=1nδifi(x,y))
\begin{aligned}
\frac{Z_{w+\delta}(x)}{Z_{w}(x)} &\left.=\frac{1}{Z_{w}(x)} \sum_{y} \exp \left(\sum_{i=1}^{n}\left(w_{i}+\delta_{i}\right) f_{i}(x, y)\right]\right) \\
&=\sum_{y} \frac{1}{Z_{w}(x)} \exp \left(\sum_{i=1}^{n} w_{i} f_{i}(x, y)\right) \exp \left(\sum_{i=1}^{n} \delta_{i} f_{i}(x, y)\right) \\
&=\sum_{y} P(y \mid x) \exp \left(\sum_{i=1}^{n} \delta_{i} f_{i}(x, y)\right)
\end{aligned}
Zw(x)Zw+δ(x)=Zw(x)1y∑exp(i=1∑n(wi+δi)fi(x,y)])=y∑Zw(x)1exp(i=1∑nwifi(x,y))exp(i=1∑nδifi(x,y))=y∑P(y∣x)exp(i=1∑nδifi(x,y))
所以
L(w+δ)−L(w)=∑x,y[P~(x,y)∑i=1nwiδifi(x,y)]−∑x[P~(x)lnZw+δ(x)Zw(x)]≥∑x,y[P~(x,y)∑i=1nwiδifi(x,y)]+1−∑xP~(x)∑yPw(y∣x)exp(∑i=1nδifi(x,y))
\begin{aligned}
\mathrm{L}(w+\delta)-\mathrm{L}(w) &=\sum_{x, y}\left[\tilde{P}(x, y) \sum_{i=1}^{n} w_{i} \delta_{i} f_{i}(x, y)\right]-\sum_{x}\left[\tilde{P}(x) \ln \frac{Z_{w+\delta}(x)}{Z_{w}(x)}\right] \\
& \geq \sum_{x, y}\left[\tilde{P}(x, y) \sum_{i=1}^{n} w_{i} \delta_{i} f_{i}(x, y)\right]+1-\sum_{x} \tilde{P}(x) \sum_{y} P_{w}(y \mid x) \exp \left(\sum_{i=1}^{n} \delta_{i} f_{i}(x, y)\right)
\end{aligned}
L(w+δ)−L(w)=x,y∑[P~(x,y)i=1∑nwiδifi(x,y)]−x∑[P~(x)lnZw(x)Zw+δ(x)]≥x,y∑[P~(x,y)i=1∑nwiδifi(x,y)]+1−x∑P~(x)y∑Pw(y∣x)exp(i=1∑nδifi(x,y))
又
exp(∑i=1nδifi(x,y))=exp(∑i=1nfi(x,y)f∗(x,y)f∗(x,y)δi)≤∑i=1nfi(x,y)f∗(x,y)exp(δif∗(x,y))
\exp \left(\sum_{i=1}^{n} \delta_{i} f_{i}(x, y)\right) =\exp \left(\sum_{i=1}^{n} \frac{f_{i}(x, y)}{f^{*}(x, y)} f^{*}(x, y) \delta_{i}\right) \leq \sum_{i=1}^{n} \frac{f_{i}(x, y)}{f^{*}(x, y)} \exp \left(\delta_{i} f^{*}(x,y)\right)
exp(i=1∑nδifi(x,y))=exp(i=1∑nf∗(x,y)fi(x,y)f∗(x,y)δi)≤i=1∑nf∗(x,y)fi(x,y)exp(δif∗(x,y))
所以
L(w+δ)−L(w)=∑x,y[P~(x,y)∑i=1nwiδifi(x,y)]−∑x[P~(x)lnZw+δ(x)Zw(x)]≥∑x,y[P~(x,y)∑i=1nwiδifi(x,y)]+1−∑xP~(x)∑yPw(y∣x)exp(∑i=1nδifi(x,y))≥∑x,y[P~(x,y)∑i=1nδifi(x,y)]+1−∑xP~(x)∑vPw(y∣x)∑i=1nfi(x,y)f∗(x,y)exp(δif∗(x,y))
\begin{aligned}
L(w+\delta)-L(w) &=\sum_{x, y}\left[\tilde{P}(x, y) \sum_{i=1}^{n} w_{i} \delta_{i} f_{i}(x, y)\right]-\sum_{x}\left[\tilde{P}(x) \ln \frac{Z_{w+\delta}(x)}{Z_{w}(x)}\right] \\
& \geq \sum_{x, y}\left[\tilde{P}(x, y) \sum_{i=1}^{n} w_{i} \delta_{i} f_{i}(x, y)\right]+1-\sum_{x} \tilde{P}(x) \sum_{y} P_{w}(y \mid x) \exp \left(\sum_{i=1}^{n} \delta_{i} f_{i}(x, y)\right) \\
& \geq \sum_{x, y}\left[\tilde{P}(x, y) \sum_{i=1}^{n} \delta_{i} f_{i}(x, y)\right]+1-\sum_{x} \tilde{P}(x) \sum_{v} P_{w}(y \mid x) \sum_{i=1}^{n} \frac{f_{i}(x, y)}{f^{*}(x, y)} \exp \left(\delta_{i} f^{*}(x, y)\right)
\end{aligned}
L(w+δ)−L(w)=x,y∑[P~(x,y)i=1∑nwiδifi(x,y)]−x∑[P~(x)lnZw(x)Zw+δ(x)]≥x,y∑[P~(x,y)i=1∑nwiδifi(x,y)]+1−x∑P~(x)y∑Pw(y∣x)exp(i=1∑nδifi(x,y))≥x,y∑[P~(x,y)i=1∑nδifi(x,y)]+1−x∑P~(x)v∑Pw(y∣x)i=1∑nf∗(x,y)fi(x,y)exp(δif∗(x,y))
我们令
A(δ∣w)=∑x,y[P~(x,y)∑i=1nwiδifi(x,y)]+1−∑xP~(x)∑yPw(y∣x)exp(∑i=1nδifi(x,y))B(δ∣w)=∑x,y[P~(x,y)∑i=1nδifi(x,y)]+1−∑xP~(x)∑yPw(y∣x)∑i=1nfi(x,y)f∗(x,y)exp(δif∗(x,y))
\begin{aligned}
&A(\delta \mid w)=\sum_{x, y}\left[\tilde{P}(x, y) \sum_{i=1}^{n} w_{i} \delta_{i} f_{i}(x, y)\right]+1-\sum_{x} \tilde{P}(x) \sum_{y} P_{w}(y \mid x) \exp \left(\sum_{i=1}^{n} \delta_{i} f_{i}(x, y)\right) \\
&B(\delta \mid w)=\sum_{x, y}\left[\tilde{P}(x, y) \sum_{i=1}^{n} \delta_{i} f_{i}(x, y)\right]+1-\sum_{x} \tilde{P}(x) \sum_{y} P_{w}(y \mid x) \sum_{i=1}^{n} \frac{f_{i}(x, y)}{f^{*}(x, y)} \exp \left(\delta_{i} f^{*}(x, y)\right)
\end{aligned}
A(δ∣w)=x,y∑[P~(x,y)i=1∑nwiδifi(x,y)]+1−x∑P~(x)y∑Pw(y∣x)exp(i=1∑nδifi(x,y))B(δ∣w)=x,y∑[P~(x,y)i=1∑nδifi(x,y)]+1−x∑P~(x)y∑Pw(y∣x)i=1∑nf∗(x,y)fi(x,y)exp(δif∗(x,y))
当δ=0\delta=0δ=0, 有
A(δ∣w)=0B(δ∣w)=0
\begin{aligned}
&A(\delta \mid w)=0 \\
&B(\delta \mid w)=0
\end{aligned}
A(δ∣w)=0B(δ∣w)=0
所以
g(δi)=∑x,yP~(x)Pw(y∣x)fiexp(δif∗)−EP~(fi)g(δi)=0δi(k+1)=δi(k)−g(δi(k))g′(δi(k))
\begin{aligned}
g\left(\delta_{i}\right) &=\sum_{x, y} \tilde{P}(x) P_{w}(y \mid x) f_{i} \exp \left(\delta_{i} f^{*}\right)-\mathrm{E} \tilde{P}\left(f_{i}\right) \\
g\left(\delta_{i}\right) &=0 \\
\delta_{i}^{(k+1)} &=\delta_{i}^{(k)}-\frac{g\left(\delta_{i}^{(k)}\right)}{g^{\prime}\left(\delta_{i}^{(k)}\right)}
\end{aligned}
g(δi)g(δi)δi(k+1)=x,y∑P~(x)Pw(y∣x)fiexp(δif∗)−EP~(fi)=0=δi(k)−g′(δi(k))g(δi(k))
总结:
IIS找到了原优化目标的一个下界,通过不断提高下界以此提高目标优化。