【李航统计学习笔记】第六章:Logistic regression

6.1 Logistic Regression

Logistic分布

回顾感知机:

f(x)=sign⁡(w⋅x+b) f(x)=\operatorname{sign}(w \cdot x+b) f(x)=sign(wx+b)

思考:

  1. 只输出-1和+1是不是太生硬了?这样的判别方式真的有效吗?
  2. 超平面左侧0.001距离的点和超平面右侧0.001距离的点真的有天壤之别吗?

感知机的缺陷:

  1. 感知机通过梯度下降更新参数,但在 sign函数中, x=0x=0x=0 是间断点,不可微。
  2. 感知机由于sign不是连续可微的,因此 在梯度下降时脱去了壳子sign函数。
logistic regression定义:

P(Y=1∣x)=exp⁡(w⋅x)1+exp⁡(w⋅x)P(Y=0∣x)=11+exp⁡(w⋅x) \begin{aligned} &P(Y=1 \mid x)=\frac{\exp (w \cdot x)}{1+\exp (w \cdot x)} \\ &P(Y=0 \mid x)=\frac{1}{1+\exp (w \cdot x)} \end{aligned} P(Y=1x)=1+exp(wx)exp(wx)P(Y=0x)=1+exp(wx)1

参数估计:

Logistic regression模型学习时,对于给定的训练数据集T={(x1,y1),(x2,y2),⋯ ,(xN,yN)}T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \cdots,\right.\left.\left(x_{N}, y_{N}\right)\right\}T={(x1,y1),(x2,y2),,(xN,yN)},其中,xi∈Rn,yi∈{0,1}x_{i} \in \mathbf{R}^{n}, \quad y_{i} \in\{0,1\}xiRn,yi{0,1},可以应用极大似然估计法估计模型参数,从而得到logistic regression模型。

设:
P(Y=1∣x)=π(x),P(Y=0∣x)=1−π(x) P(Y=1 \mid x)=\pi(x), \quad P(Y=0 \mid x)=1-\pi(x) P(Y=1x)=π(x),P(Y=0x)=1π(x)
似然函数为:
∏i=1N[π(xi)]yi[1−π(xi)]1−yi \prod_{i=1}^{N}\left[\pi\left(x_{i}\right)\right]^{y_{i}}\left[1-\pi\left(x_{i}\right)\right]^{1-y_{i}} i=1N[π(xi)]yi[1π(xi)]1yi
对数似然函数为:
L(w)=∑i=1N[yilog⁡π(xi)+(1−yi)log⁡(1−π(xi))]=∑i=1N[yilog⁡π(xi)1−π(xi)+log⁡(1−π(xi))]=∑i=1N[yi(w⋅xi)−log⁡(1+exp⁡(w⋅xi)] \begin{aligned} L(w) &=\sum_{i=1}^{N}\left[y_{i} \log \pi\left(x_{i}\right)+\left(1-y_{i}\right) \log \left(1-\pi\left(x_{i}\right)\right)\right] \\ &=\sum_{i=1}^{N}\left[y_{i} \log \frac{\pi\left(x_{i}\right)}{1-\pi\left(x_{i}\right)}+\log \left(1-\pi\left(x_{i}\right)\right)\right] \\ &=\sum_{i=1}^{N}\left[y_{i}\left(w \cdot x_{i}\right)-\log \left(1+\exp \left(w \cdot x_{i}\right)\right]\right. \end{aligned} L(w)=i=1N[yilogπ(xi)+(1yi)log(1π(xi))]=i=1N[yilog1π(xi)π(xi)+log(1π(xi))]=i=1N[yi(wxi)log(1+exp(wxi)]
L(w)L(w)L(w)求极大值,得到www的估计值。

似然函数对www的求导:
L(w)=∑i=1N[yi(w⋅xi)−log⁡(1+exp⁡(w⋅xi))]∂L(w)∂w=yi⋅xi−11+exp⁡(w⋅xi)exp⁡(w⋅xi)⋅xi=yi⋅xi−xi⋅exp⁡(w⋅xi)1+exp⁡(w⋅xi) \begin{gathered} L(w)=\sum_{i=1}^{N}\left[y_{i}\left(w \cdot x_{i}\right)-\log \left(1+\exp \left(w \cdot x_{i}\right)\right)\right] \\ \frac{\partial L(w)}{\partial w}=y_{i} \cdot x_{i}-\frac{1}{1+\exp \left(w \cdot x_{i}\right)} \exp \left(w \cdot x_{i}\right) \cdot x_{i}=y_{i} \cdot x_{i}-\frac{x_{i} \cdot \exp \left(w \cdot x_{i}\right)}{1+\exp \left(w \cdot x_{i}\right)} \end{gathered} L(w)=i=1N[yi(wxi)log(1+exp(wxi))]wL(w)=yixi1+exp(wxi)1exp(wxi)xi=yixi1+exp(wxi)xiexp(wxi)

总结:
  1. 逻辑斯谛以输出概率的形式解决了极小距离带来的 + 1和-1的天壤之别。同时概率也可作为模型输出的置信程度。
  2. 逻辑斯谛使得了最终的模型函数连续可微。训练目标与预测目标达成了一致。
  3. 逻辑斯谛采用了极大似然估计来估计参数。

最大熵原理

什么是最大熵?

在我们猜测概率时,不确定的部分我们认为是等可能的,就好像骰子一样,我们知道有6个面,因此认为每个面的概率是 1/61 / 61/6 ,也就是等可能。
换句话说,就是趋向于均匀分布,最大熵使用的就是一个这么朴素的道理:凡是我们知道的,就把它考虑进去,凡是不知道的, 通通均匀分布。

最大熵模型

终极目标:
P(Y∣X) P(Y \mid X) P(YX)
熵:
H(P)=−∑xp(x)log⁡P(x) H(P)=-\sum_{x} p(x) \log P(x) H(P)=xp(x)logP(x)
将终极目标代入熵:
H(P)=−∑xp(y∣x)log⁡P(y∣x) H(P)=-\sum_{x} p(y \mid x) \log P(y \mid x) H(P)=xp(yx)logP(yx)
做些改变,调整熵:
H(P)=−∑xP~(x)p(y∣x)log⁡P(y∣x) H(P)=-\sum_{x} \widetilde{P}(x) p(y \mid x) \log P(y \mid x) H(P)=xP(x)p(yx)logP(yx)

约束条件

特征函数
f(x,y)={1,x 与 y 满足某一事实 0, 否则  f(x, y)= \begin{cases}1, & x \text { 与 } y \text { 满足某一事实 } \\ 0, & \text { 否则 }\end{cases} f(x,y)={1,0,x  y 满足某一事实  否则 
特征函数f(x,y)f(x, y)f(x,y)关于经验分布P~(x,y)\widetilde{P}(x, y)P(x,y)的期望值:
Ep~(f)=∑x,yP~(x,y)f(x,y)=∑x,yP~(x)P~(y∣x)f(x,y) E_{\widetilde{p}}(f)=\sum_{x, y} \widetilde{P}(x, y) f(x, y)=\sum_{x, y} \widetilde{P}(x) \widetilde{P}(y \mid x) f(x, y) Ep(f)=x,yP(x,y)f(x,y)=x,yP(x)P(yx)f(x,y)
特征函数f(x,y)f(x, y)f(x,y)关于经验分布P(x,y)P(x, y)P(x,y)的期望值:
Ep(f)=∑x,yP(x,y)f(x,y)=∑x,yP~(x)P(y∣x)f(x,y) E_{p}(f)=\sum_{x, y} P(x, y) f(x, y)=\sum_{x, y} \widetilde{P}(x) P(y \mid x) f(x, y) Ep(f)=x,yP(x,y)f(x,y)=x,yP(x)P(yx)f(x,y)
约束:
Ep~(f)=Ep(f) E_{\widetilde{p}}(f)=E_{p}(f) Ep(f)=Ep(f)

max⁡P∈CH(P)=−∑x,yP~(x)P~(y∣x)f(x,y) s.t. Ep~(f)−Ep(f)=0∑yP(y∣x)=1min⁡P∈CH(P)=∑x,yP~(x)P~(y∣x)f(x,y) s.t. Ep~(f)−Ep(f)=0∑yP(y∣x)=1 \begin{array}{ll} \max _{P \in C} & H(P)=-\sum_{x, y} \widetilde{P}(x) \widetilde{P}(y \mid x) f(x, y) \\ \text { s.t. } & E_{\widetilde{p}}(f)-E_{p}(f)=0 \\ & \sum_{y} P(y \mid x)=1 \\ \min _{P \in C} & H(P)=\sum_{x, y} \widetilde{P}(x) \widetilde{P}(y \mid x) f(x, y) \\ \text { s.t. } & E_{\widetilde{p}}(f)-E_{p}(f)=0 \\ & \sum_{y} P(y \mid x)=1 \end{array} maxPC s.t. minPC s.t. H(P)=x,yP(x)P(yx)f(x,y)Ep(f)Ep(f)=0yP(yx)=1H(P)=x,yP(x)P(yx)f(x,y)Ep(f)Ep(f)=0yP(yx)=1

拉格朗日乘子法

L(P,w)≡−H(P)+w0(1−∑yP(y∣x))+∑i=1nwi(EP~(fi)−EP(fi))=∑x,yP~(x)P(y∣x)log⁡P(y∣x)+w0(1−∑yP(y∣x))+∑i=1nwi(∑x,yP~(x,y)fi(x,y)−∑x,yP~(x)P(y∣x)fi(x,y)) \begin{aligned} L(P, w) \equiv &-H(P)+w_{0}\left(1-\sum_{y} P(y \mid x)\right)+\sum_{i=1}^{n} w_{i}\left(E_{\tilde{P}}\left(f_{i}\right)-E_{P}\left(f_{i}\right)\right) \\ =& \sum_{x, y} \tilde{P}(x) P(y \mid x) \log P(y \mid x)+w_{0}\left(1-\sum_{y} P(y \mid x)\right) \\ &+\sum_{i=1}^{n} w_{i}\left(\sum_{x, y} \tilde{P}(x, y) f_{i}(x, y)-\sum_{x, y} \tilde{P}(x) P(y \mid x) f_{i}(x, y)\right) \end{aligned} L(P,w)=H(P)+w0(1yP(yx))+i=1nwi(EP~(fi)EP(fi))x,yP~(x)P(yx)logP(yx)+w0(1yP(yx))+i=1nwi(x,yP~(x,y)fi(x,y)x,yP~(x)P(yx)fi(x,y))

min⁡P∈Cmax⁡wL(P,w)→max⁡wmin⁡P∈CL(P,w) \min _{P \in C} \max _{w} L(P, w) \rightarrow \max _{w} \min _{P \in C} L(P, w) PCminwmaxL(P,w)wmaxPCminL(P,w)

Pw(y∣x)=1Zw(x)exp⁡(∑i=1nwifi(x,y))Zw(x)=∑yexp⁡(∑i=1nwifi(x,y)) \begin{aligned} P_{w}(y \mid x) &=\frac{1}{Z_{w}(x)} \exp \left(\sum_{i=1}^{n} w_{i} f_{i}(x, y)\right) \\ Z_{w}(x) &=\sum_{y} \exp \left(\sum_{i=1}^{n} w_{i} f_{i}(x, y)\right) \end{aligned} Pw(yx)Zw(x)=Zw(x)1exp(i=1nwifi(x,y))=yexp(i=1nwifi(x,y))

总结
  1. 最大熵强调不提任何假设,以熵最大为目标。
  2. 将终极目标代入熵的公式后,将其最大化。
  3. 在训练集中寻找现有的约束,计算期望,将其作为约束。使用拉格朗日乘子法得到 p(y∣x)p(y \mid x)p(yx) ,之后使用优化算法得到 p(y∣x)p(y \mid x)p(yx) 中的参数 www

6.2 改进的尺度迭代法(IIS)

已知要解决的目标:
Pw(y∣x)=1Zw(x)exp⁡(∑i=1nwifi(x,y))Zw(x)=∑yexp⁡(∑i=1nwifi(x,y)) \begin{aligned} P_{w}(y \mid x) &=\frac{1}{Z_{w}(x)} \exp \left(\sum_{i=1}^{n} w_{i} f_{i}(x, y)\right) \\ Z_{w}(x) &=\sum_{y} \exp \left(\sum_{i=1}^{n} w_{i} f_{i}(x, y)\right) \end{aligned} Pw(yx)Zw(x)=Zw(x)1exp(i=1nwifi(x,y))=yexp(i=1nwifi(x,y))
所有的式子连乘取对数转换为似然函数为:
L(w)=∑x,y[P~(x,y)∑i=1nwifi(x,y)]−∑x[P~(x)ln⁡Zw(x)] L(w)=\sum_{x, y}\left[\tilde{P}(x, y) \sum_{i=1}^{n} w_{i} f_{i}(x, y)\right]-\sum_{x}\left[\tilde{P}(x) \ln Z_{w}(x)\right] L(w)=x,y[P~(x,y)i=1nwifi(x,y)]x[P~(x)lnZw(x)]
IIS核心思想:每次增加一个量δ\deltaδ, 使得L(w+δ)>L(w)L(w+\delta)>L(w)L(w+δ)>L(w),以此不断提高LLL的值,直到达到极大值
L(w+δ)−L(w)=∑x,y[P~(x,y)∑i=1nwiδifi(x,y)]−∑x[P~(x)ln⁡Zw+δ(x)Zw(x)] L(w+\delta)-L(w)=\sum_{x, y}\left[\tilde{P}(x, y) \sum_{i=1}^{n} w_{i} \delta_{i} f_{i}(x, y)\right]-\sum_{x}\left[\tilde{P}(x) \ln \frac{Z_{w+\delta}(x)}{Z_{w}(x)}\right] L(w+δ)L(w)=x,y[P~(x,y)i=1nwiδifi(x,y)]x[P~(x)lnZw(x)Zw+δ(x)]
其中
Zw+δ(x)Zw(x)=1Zw(x)∑yexp⁡(∑i=1n(wi+δi)fi(x,y)])=∑y1Zw(x)exp⁡(∑i=1nwifi(x,y))exp⁡(∑i=1nδifi(x,y))=∑yP(y∣x)exp⁡(∑i=1nδifi(x,y)) \begin{aligned} \frac{Z_{w+\delta}(x)}{Z_{w}(x)} &\left.=\frac{1}{Z_{w}(x)} \sum_{y} \exp \left(\sum_{i=1}^{n}\left(w_{i}+\delta_{i}\right) f_{i}(x, y)\right]\right) \\ &=\sum_{y} \frac{1}{Z_{w}(x)} \exp \left(\sum_{i=1}^{n} w_{i} f_{i}(x, y)\right) \exp \left(\sum_{i=1}^{n} \delta_{i} f_{i}(x, y)\right) \\ &=\sum_{y} P(y \mid x) \exp \left(\sum_{i=1}^{n} \delta_{i} f_{i}(x, y)\right) \end{aligned} Zw(x)Zw+δ(x)=Zw(x)1yexp(i=1n(wi+δi)fi(x,y)])=yZw(x)1exp(i=1nwifi(x,y))exp(i=1nδifi(x,y))=yP(yx)exp(i=1nδifi(x,y))
所以
L(w+δ)−L(w)=∑x,y[P~(x,y)∑i=1nwiδifi(x,y)]−∑x[P~(x)ln⁡Zw+δ(x)Zw(x)]≥∑x,y[P~(x,y)∑i=1nwiδifi(x,y)]+1−∑xP~(x)∑yPw(y∣x)exp⁡(∑i=1nδifi(x,y)) \begin{aligned} \mathrm{L}(w+\delta)-\mathrm{L}(w) &=\sum_{x, y}\left[\tilde{P}(x, y) \sum_{i=1}^{n} w_{i} \delta_{i} f_{i}(x, y)\right]-\sum_{x}\left[\tilde{P}(x) \ln \frac{Z_{w+\delta}(x)}{Z_{w}(x)}\right] \\ & \geq \sum_{x, y}\left[\tilde{P}(x, y) \sum_{i=1}^{n} w_{i} \delta_{i} f_{i}(x, y)\right]+1-\sum_{x} \tilde{P}(x) \sum_{y} P_{w}(y \mid x) \exp \left(\sum_{i=1}^{n} \delta_{i} f_{i}(x, y)\right) \end{aligned} L(w+δ)L(w)=x,y[P~(x,y)i=1nwiδifi(x,y)]x[P~(x)lnZw(x)Zw+δ(x)]x,y[P~(x,y)i=1nwiδifi(x,y)]+1xP~(x)yPw(yx)exp(i=1nδifi(x,y))

exp⁡(∑i=1nδifi(x,y))=exp⁡(∑i=1nfi(x,y)f∗(x,y)f∗(x,y)δi)≤∑i=1nfi(x,y)f∗(x,y)exp⁡(δif∗(x,y)) \exp \left(\sum_{i=1}^{n} \delta_{i} f_{i}(x, y)\right) =\exp \left(\sum_{i=1}^{n} \frac{f_{i}(x, y)}{f^{*}(x, y)} f^{*}(x, y) \delta_{i}\right) \leq \sum_{i=1}^{n} \frac{f_{i}(x, y)}{f^{*}(x, y)} \exp \left(\delta_{i} f^{*}(x,y)\right) exp(i=1nδifi(x,y))=exp(i=1nf(x,y)fi(x,y)f(x,y)δi)i=1nf(x,y)fi(x,y)exp(δif(x,y))

所以
L(w+δ)−L(w)=∑x,y[P~(x,y)∑i=1nwiδifi(x,y)]−∑x[P~(x)ln⁡Zw+δ(x)Zw(x)]≥∑x,y[P~(x,y)∑i=1nwiδifi(x,y)]+1−∑xP~(x)∑yPw(y∣x)exp⁡(∑i=1nδifi(x,y))≥∑x,y[P~(x,y)∑i=1nδifi(x,y)]+1−∑xP~(x)∑vPw(y∣x)∑i=1nfi(x,y)f∗(x,y)exp⁡(δif∗(x,y)) \begin{aligned} L(w+\delta)-L(w) &=\sum_{x, y}\left[\tilde{P}(x, y) \sum_{i=1}^{n} w_{i} \delta_{i} f_{i}(x, y)\right]-\sum_{x}\left[\tilde{P}(x) \ln \frac{Z_{w+\delta}(x)}{Z_{w}(x)}\right] \\ & \geq \sum_{x, y}\left[\tilde{P}(x, y) \sum_{i=1}^{n} w_{i} \delta_{i} f_{i}(x, y)\right]+1-\sum_{x} \tilde{P}(x) \sum_{y} P_{w}(y \mid x) \exp \left(\sum_{i=1}^{n} \delta_{i} f_{i}(x, y)\right) \\ & \geq \sum_{x, y}\left[\tilde{P}(x, y) \sum_{i=1}^{n} \delta_{i} f_{i}(x, y)\right]+1-\sum_{x} \tilde{P}(x) \sum_{v} P_{w}(y \mid x) \sum_{i=1}^{n} \frac{f_{i}(x, y)}{f^{*}(x, y)} \exp \left(\delta_{i} f^{*}(x, y)\right) \end{aligned} L(w+δ)L(w)=x,y[P~(x,y)i=1nwiδifi(x,y)]x[P~(x)lnZw(x)Zw+δ(x)]x,y[P~(x,y)i=1nwiδifi(x,y)]+1xP~(x)yPw(yx)exp(i=1nδifi(x,y))x,y[P~(x,y)i=1nδifi(x,y)]+1xP~(x)vPw(yx)i=1nf(x,y)fi(x,y)exp(δif(x,y))
我们令
A(δ∣w)=∑x,y[P~(x,y)∑i=1nwiδifi(x,y)]+1−∑xP~(x)∑yPw(y∣x)exp⁡(∑i=1nδifi(x,y))B(δ∣w)=∑x,y[P~(x,y)∑i=1nδifi(x,y)]+1−∑xP~(x)∑yPw(y∣x)∑i=1nfi(x,y)f∗(x,y)exp⁡(δif∗(x,y)) \begin{aligned} &A(\delta \mid w)=\sum_{x, y}\left[\tilde{P}(x, y) \sum_{i=1}^{n} w_{i} \delta_{i} f_{i}(x, y)\right]+1-\sum_{x} \tilde{P}(x) \sum_{y} P_{w}(y \mid x) \exp \left(\sum_{i=1}^{n} \delta_{i} f_{i}(x, y)\right) \\ &B(\delta \mid w)=\sum_{x, y}\left[\tilde{P}(x, y) \sum_{i=1}^{n} \delta_{i} f_{i}(x, y)\right]+1-\sum_{x} \tilde{P}(x) \sum_{y} P_{w}(y \mid x) \sum_{i=1}^{n} \frac{f_{i}(x, y)}{f^{*}(x, y)} \exp \left(\delta_{i} f^{*}(x, y)\right) \end{aligned} A(δw)=x,y[P~(x,y)i=1nwiδifi(x,y)]+1xP~(x)yPw(yx)exp(i=1nδifi(x,y))B(δw)=x,y[P~(x,y)i=1nδifi(x,y)]+1xP~(x)yPw(yx)i=1nf(x,y)fi(x,y)exp(δif(x,y))
δ=0\delta=0δ=0, 有
A(δ∣w)=0B(δ∣w)=0 \begin{aligned} &A(\delta \mid w)=0 \\ &B(\delta \mid w)=0 \end{aligned} A(δw)=0B(δw)=0
所以
g(δi)=∑x,yP~(x)Pw(y∣x)fiexp⁡(δif∗)−EP~(fi)g(δi)=0δi(k+1)=δi(k)−g(δi(k))g′(δi(k)) \begin{aligned} g\left(\delta_{i}\right) &=\sum_{x, y} \tilde{P}(x) P_{w}(y \mid x) f_{i} \exp \left(\delta_{i} f^{*}\right)-\mathrm{E} \tilde{P}\left(f_{i}\right) \\ g\left(\delta_{i}\right) &=0 \\ \delta_{i}^{(k+1)} &=\delta_{i}^{(k)}-\frac{g\left(\delta_{i}^{(k)}\right)}{g^{\prime}\left(\delta_{i}^{(k)}\right)} \end{aligned} g(δi)g(δi)δi(k+1)=x,yP~(x)Pw(yx)fiexp(δif)EP~(fi)=0=δi(k)g(δi(k))g(δi(k))
总结:

IIS找到了原优化目标的一个下界,通过不断提高下界以此提高目标优化。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值