机器学习——逻辑回归(Logistic Regression)

本文详细介绍了逻辑回归算法的工作原理,包括其目标函数、梯度计算及参数更新方式,并探讨了极大似然估计的应用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

算法描述:

Logistic Regression Algorithm
初始化ω0{\omega _0}ω0
For t=0,1,2,⋯t=0,1,2, \cdotst=0,1,2,

      1.计算梯度方向:

                        ∇Ein(ωt)=1N∑n=1Nθ(−ynωtTxn)(−ynxn)\nabla {E_{in}}({\omega _t}) = \frac{1}{N}\sum\limits_{n = 1}^N {\theta ( - {y_n}\omega _t^T{x_n})( - {y_n}{x_n})}Ein(ωt)=N1n=1Nθ(ynωtTxn)(ynxn)

      2.更新:

                        ωt+1←ωt−η∇Ein(ωt){\omega _{t + 1}} \leftarrow {\omega _t} - \eta \nabla {E_{in}}({\omega _t})ωt+1ωtηEin(ωt)

Until ∇Ein(ωt+1)=0\nabla {E_{in}}({\omega _{t + 1}}) = 0Ein(ωt+1)=0,或者足够的次数

这里的目标函数:f(x)=P(+1∣x)∈[0,1]f(x) = P( + 1\left| x \right.) \in \left[ {0,1} \right]f(x)=P(+1x)[0,1] ,用于二分类,则当f(x)&gt;0.5f(x) &gt; 0.5f(x)>0.5 ,为+1;当f(x)&lt;0.5f(x) &lt; 0.5f(x)<0.5 ,为-1。

计算过程:


Logistic Function:

θ(s)=es1+es=11+e−s\theta (s) = \frac{{{e^s}}}{{1 + {e^s}}} = \frac{1}{{1 + {e^{ - s}}}}θ(s)=1+eses=1+es1

图像如下,


在这里插入图片描述

该函数的特性:

  • 定义域(−∞,+∞)( - \infty , + \infty )(,+)
  • 值域(0,1)(0,1)(0,1)
  • 在定义域内是smooth,monotonic,sigmiod的
  • θ(s)=1−θ(−s)\theta (s) = 1 - \theta ( - s)θ(s)=1θ(s)
  • dθ(s)ds=θ(s)(1−θ(s))\frac{{d\theta (s)}}{{ds}} = \theta (s)(1 - \theta (s))dsdθ(s)=θ(s)(1θ(s))

logistic函数用在逻辑回归里为,
h(x)=11+exp⁡(−ωTx)h(x) = \frac{1}{{1 + \exp ( - {\omega ^T}x)}}h(x)=1+exp(ωTx)1

下面根据极大似然原理(Maximum Likelihood) 来计算逻辑回归的参数更新式。

现有目标函数如下,
f(x)=P(+1∣x)⇔P(y∣x)={f(x)fory=+11−f(x)fory=−1f(x) = P( + 1\left| x \right.) \Leftrightarrow P(y\left| x \right.) = \left\{ \begin{array}{l} f(x){\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} for{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} y = + 1{\kern 1pt} \\ 1 - f(x){\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} for{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} y = - 1 \end{array} \right.f(x)=P(+1x)P(yx)={f(x)fory=+11f(x)fory=1

假设现在有资料集D={(x1,◯),(x2,×),⋯&ThinSpace;,(xN,×)}D = \{ ({x_1},\bigcirc ),({x_2}, \times ), \cdots ,({x_N}, \times )\}D={(x1,),(x2,×),,(xN,×)}

则通过h产生数据集D的可能性为:
P(x1)h(x1)×P(x2)(1−h(x2))×⋯×P(xN)(1−h(xN))P({x_1})h({x_1}) \times P({x_2})(1 - h({x_2})) \times \cdots \times P({x_N})(1 - h({x_N}))P(x1)h(x1)×P(x2)(1h(x2))××P(xN)(1h(xN))

通常由目标函数f产生数据集D的概率是很大的 (极大似然的思想),当h≈fh \approx fhf时,由h产生D的概率也是非常大的,即,
g≈arg⁡max⁡hlikelihood(h)g \approx \mathop {\arg \max }\limits_h {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} likelihood(h)ghargmaxlikelihood(h)

这里h(x)=θ(ωTx)h(x) = \theta ({\omega ^T}x)h(x)=θ(ωTx),又有1−h(x)=h(−x)1 - h(x) = h( - x)1h(x)=h(x),所以,
likelihood(h)=P(x1)h(x1)×P(x2)(1−h(x2))×⋯×P(xN)(1−h(xN))=P(x1)h(x1)×P(x2)h(−x2)×⋯×P(xN)h(−xN)=P(x1)h(y1x1)×P(x2)h(y2x2)×⋯×P(xN)h(yNxN)\begin{array}{l} likelihood(h) = P({x_1})h({x_1}) \times P({x_2})(1 - h({x_2})) \times \\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \cdots \times P({x_N})(1 - h({x_N}))\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = P({x_1})h({x_1}) \times P({x_2})h( - {x_2}) \times \\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \cdots \times P({x_N})h( - {x_N})\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = P({x_1})h({y_1}{x_1}) \times P({x_2})h({y_2}{x_2}) \times \\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \cdots \times P({x_N})h({y_N}{x_N}) \end{array}likelihood(h)=P(x1)h(x1)×P(x2)(1h(x2))××P(xN)(1h(xN))=P(x1)h(x1)×P(x2)h(x2)××P(xN)h(xN)=P(x1)h(y1x1)×P(x2)h(y2x2)××P(xN)h(yNxN)

对于每个不同的h而言,P(xi)P({x_i})P(xi)都是不变的,那么就有,
likelihood(h)∝∏n=1Nh(ynxn)likelihood(h) \propto \prod\limits_{n = 1}^N {h({y_n}{x_n})}likelihood(h)n=1Nh(ynxn)

ω\omegaω表示h,有,
max⁡ωlikelihood(h)∝∏n=1Nθ(ynωTxn)\mathop {\max }\limits_\omega likelihood(h) \propto \prod\limits_{n = 1}^N {\theta ({y_n}{\omega ^T}{x_n})}ωmaxlikelihood(h)n=1Nθ(ynωTxn)

取对数 (变连乘为连加),加负号 (变最大化为最小化),再取均值,有,
min⁡ω1N∑n=1N−ln⁡θ(ynωTxn)=min⁡ω1N∑n=1Nln⁡(1+exp⁡(−ynωTxn))=min⁡ω1N∑n=1Nerr(ω,xn,yn)⎵Ein(ω)\begin{array}{l} \mathop {\min }\limits_\omega \frac{1}{N}\sum\limits_{n = 1}^N { - \ln \theta ({y_n}{\omega ^T}{x_n})} \\ = \mathop {\min }\limits_\omega \frac{1}{N}\sum\limits_{n = 1}^N {\ln (1 + \exp ( - {y_n}{\omega ^T}{x_n}))} \\ {\kern 1pt} = \mathop {\min }\limits_\omega \frac{1}{N}\underbrace {\sum\limits_{n = 1}^N {err(\omega ,{x_n},{y_n})} }_{{E_{in}}(\omega )} \end{array}ωminN1n=1Nlnθ(ynωTxn)=ωminN1n=1Nln(1+exp(ynωTxn))=ωminN1Ein(ω)n=1Nerr(ω,xn,yn)

上式,就是逻辑回归里的误差衡量方式——交叉熵误差(Cross-Entropy Error),即,
err(ω,x,y)=ln⁡(1+exp⁡(−yωx)err(\omega ,x,y) = \ln (1 + \exp ( - y\omega x)err(ω,x,y)=ln(1+exp(yωx)

根据凸函数的最小化原理,令∇Ein(ω)=0\nabla {E_{in}}(\omega ) = 0Ein(ω)=0,下面计算梯度,
Ein(ω)=1N∑n=1Nln⁡(1+exp⁡(−ynωTxn⏞◯)⎵Δ)\begin{array}{l} {E_{in}}(\omega ) = \frac{1}{N}\sum\limits_{n = 1}^N {\ln (\underbrace {1 + \exp (\overbrace { - {y_n}{\omega ^T}{x_n}}^\bigcirc )}_\Delta )} \\ {\kern 1pt} \end{array}Ein(ω)=N1n=1Nln(Δ1+exp(ynωTxn))

∂Ein(ω)∂ωi=1N∑n=1N(∂ln⁡(Δ)∂Δ)(∂(1+exp⁡(◯))∂◯)(∂−ynωTxn∂ωi)=1N∑n=1N(1Δ)(exp⁡(◯))(−ynxn,i)=1N∑n=1N(exp⁡(◯)1+exp⁡(◯))(−ynxn,i)=1N∑n=1Nθ◯(−ynxn,i)\begin{array}{l} \frac{{\partial {E_{in}}(\omega )}}{{\partial {\omega _i}}} = \frac{1}{N}\sum\limits_{n = 1}^N {(\frac{{\partial \ln (\Delta )}}{{\partial \Delta }})} (\frac{{\partial (1 + \exp (\bigcirc ))}}{{\partial \bigcirc }})(\frac{{\partial - {y_n}{\omega ^T}{x_n}}}{{\partial {\omega _i}}})\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \frac{1}{N}\sum\limits_{n = 1}^N {(\frac{1}{\Delta })} (\exp (\bigcirc ))( - {y_n}{x_{n,i}})\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \frac{1}{N}\sum\limits_{n = 1}^N {(\frac{{\exp (\bigcirc )}}{{1{\rm{ + }}\exp (\bigcirc )}})} ( - {y_n}{x_{n,i}})\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \frac{1}{N}\sum\limits_{n = 1}^N {\theta \bigcirc } ( - {y_n}{x_{n,i}}) \end{array}ωiEin(ω)=N1n=1N(Δln(Δ))((1+exp()))(ωiynωTxn)=N1n=1N(Δ1)(exp())(ynxn,i)=N1n=1N(1+exp()exp())(ynxn,i)=N1n=1Nθ(ynxn,i)

即,
∇Ein(ω)=1N∑n=1Nθ(−ynωTxn)(−ynxn)=0\nabla {E_{in}}(\omega ) = \frac{1}{N}\sum\limits_{n = 1}^N {\theta ( - {y_n}{\omega ^T}{x_n})( - {y_n}{x_n})} = 0Ein(ω)=N1n=1Nθ(ynωTxn)(ynxn)=0

上式不存在闭式解(closed-form solution),因为,把这里的θ(⋅)\theta ( \cdot )θ()看作是 −ynxn- {y_n}{x_n}ynxn的权重,则整个梯度式子可看作是以θ(⋅)\theta ( \cdot )θ()为权重的关于−ynxn- {y_n}{x_n}ynxn的加权平均,所以只有当所有的θ(⋅)=0\theta ( \cdot ) = 0θ()=0成立时, ∇Ein(ω)=0\nabla {E_{in}}(\omega ) = 0Ein(ω)=0
1.所有θ(⋅)=0\theta ( \cdot ) = 0θ()=0 ,当且仅当ynωTxn≫0{y_n}{\omega ^T}{x_n} \gg 0ynωTxn0 ,即该数据集线性可分,一旦数据集线性不可分,则上述梯度就不可能为0
2.权重 :θ(⋅)=0\theta ( \cdot ) = 0θ()=0是关于ω\omegaω的一个非线性方程,不容易得出闭式解

所以,这里的参数更新采用的是迭代优化解(Iterative Optimization),用梯度下降法求解函数的最小化问题,
ωt+1←ωt−η∇Ein(ωt){\omega _{t + 1}} \leftarrow {\omega _t} - \eta \nabla {E_{in}}({\omega _t})ωt+1ωtηEin(ωt)

实际应用:

数据特征集D为
x=[11⋯1x11x21⋯xn1x12x22⋯xn2⋮⋮⋮⋮x1dx2d⋯xnd]⎵(d+1)×Nx = \underbrace {\left[ {\begin{array}{} 1&amp;1&amp; \cdots &amp;1\\ {{x_{11}}}&amp;{{x_{21}}}&amp; \cdots &amp;{x{}_{n1}}\\ {{x_{12}}}&amp;{{x_{22}}}&amp; \cdots &amp;{x{}_{n2}}\\ \vdots &amp; \vdots &amp; \vdots &amp; \vdots \\ {{x_{1d}}}&amp;{x{}_{2d}}&amp; \cdots &amp;{x{}_{nd}} \end{array}} \right]}_{(d + 1) \times N}x=(d+1)×N1x11x12x1d1x21x22x2d1xn1xn2xnd

对应的标签集为:

y=[y1y2⋮yn]⎵N×1y = \underbrace {\left[ {\begin{array}{} {{y_1}}\\ {{y_2}}\\ \vdots \\ {{y_n}} \end{array}} \right]}_{N \times 1}y=N×1y1y2yn

则梯度的计算如下:

A=θ(−yn.∗(ωTxn⏞1×N))⎵1×Nb=−yn.∗xn⎵(d+1)×N\begin{array}{l} A = \underbrace {\theta ( - {y_n}. * (\overbrace {{\omega ^T}{x_n}}^{1 \times N}))}_{1 \times N}\\ b = \underbrace { - {y_n}. * {x_n}}_{(d + 1) \times N} \end{array}A=1×Nθ(yn.(ωTxn1×N))b=(d+1)×Nyn.xn

∇Ein(ω)=A1⎵(常数)b1⎵(d+1)×1+A2b2+⋯+ANbN=b⎵(d+1)×N[A1A2⋮AN]⎵N×1\begin{array}{l} \nabla {E_{in}}(\omega ) = \underbrace {{A_1}}_{(常数)}\underbrace {{b_1}}_{(d + 1) \times 1} + {A_2}{b_2} + \cdots + {A_N}{b_N}\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \underbrace b_{(d + 1) \times N}\underbrace {\left[ {\begin{array}{} {{A_1}}\\ {{A_2}}\\ \vdots \\ {{A_N}} \end{array}} \right]}_{N \times 1} \end{array}Ein(ω)=()A1(d+1)×1b1+A2b2++ANbN=(d+1)×NbN×1A1A2AN

实际应用中,一般用线性回归求初值 ,然后再用PLA/pocket/logistic regression等方法,一般logistic regression效果要好于pocket。

随机梯度(Stochastic Gradient Descent, SGD)的使用:

以上计算的梯度的时候,是计算了在所有点处的梯度和然后再平均,这里的平均的概念可以用随机的一个梯度值来近似代替,即,
ωt+1←ωt+ηθ(−ynωtTxn)(ynxn)⎵−∇err(ωt,xn,yn){\omega _{t + 1}} \leftarrow {\omega _t} + \eta \underbrace {\theta ( - {y_n}\omega _t^T{x_n})({y_n}{x_n})}_{ - \nabla err({\omega _t},{x_n},{y_n})}ωt+1ωt+ηerr(ωt,xn,yn)θ(ynωtTxn)(ynxn)

随机梯度的使用体现了一个在线学习思想,即每来一个数据,就可以进行一次参数更新。

Pros: 计算代价低,适合数据量大以及在线学习的场景
Cons: 不稳定。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值