机器学习进阶(一)回归

这篇博客详细介绍了回归模型,包括线性回归、局部加权线性回归、logistic回归和Softmax回归。讨论了目标函数、梯度下降、正则项评估以及不同回归模型的损失函数和优化方法,如批量梯度下降、随机梯度下降和mini-batch梯度下降。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

线性回归

  1. 目标函数
    J ( θ ) = 1 2 ∑ i = 1 n ( h θ ( x ( i ) ) − y ( i ) ) 2 = 1 2 ( X θ − y ) T ( X θ − y ) J(\theta)=\frac{1}{2}\sum_{i=1}^n(h_{\theta}(x^{(i)})-y^{(i)})^2=\frac{1}{2}(X\theta-y)^T(X\theta-y) J(θ)=21i=1n(hθ(x(i))y(i))2=21(Xθy)T(Xθy)
  2. 求解方法
    正规求解: θ = ( X T X ) − 1 X T y \theta=(X^TX)^{-1}X^Ty θ=(XTX)1XTy
    梯度下降: ∂ J ( θ ) ∂ θ j = θ j − α ∑ i = 1 n ( h θ ( x ( i ) ) − y ( i ) ) x j \frac{\partial J(\theta)}{\partial \theta_j}=\theta_j-\alpha\sum_{i=1}^n(h_{\theta}(x^{(i)})-y^{(i)})x_j θjJ(θ)=θjαi=1n(hθ(x(i))y(i))xj
    3.正则项
\quad L0正则项L1正则项L2正则项Elastic Net
形式 ∑ I { θ i ≠ 0 } \sum I_{\{\theta_i \neq0\}} I{θi=0} ∑ ∣ θ i ∣ \sum \vert\theta_i \vert θi ∑ θ i 2 \sum \theta_i^2 θi2 ρ ∑ ∣ θ i ∣ + ( 1 − ρ ) ∑ θ i 2 , ρ ∈ [ 0 , 1 ] \rho \sum \vert\theta_i \vert+(1-\rho)\sum \theta_i^2 , \rho \in[0,1] ρθi+(1ρ)θi2,ρ[0,1]
含义L0正则化的值是模型参数中非零参数的个数L1范数是指向量中各个元素绝对值之和L2正则化标识各个参数的平方的和的开方值Elastic Net则为L1和L2的加权组合
结果倾向L1会趋向于产生少量的特征,而其他的特征都是0L2会选择更多的特征,但这些特征都会接近于0
使用场景在所有特征中只有少数特征起重要作用的情况下,选择L1比较合适。L1不仅可以作为正则化手段,其在特征选择时候非常有用如果所有特征中,大部分特征都能起作用,而且起的作用很平均,那么使用L2更合适。
  1. R 2 R^2 R2 评估
    R 2 = ∑ i = 1 n ( y i ^ − y ‾ ) 2 ∑ i = 1 n ( y i − y ‾ ) 2 = 1 − ϵ ^ i 2 ∑ i = 1 n ( y i − y ‾ ) 2 = 1 − ( y ^ − y i ) 2 ∑ i = 1 n ( y i − y ‾ ) 2 R^2=\frac{\sum_{i=1}^n (\hat{y_i}- \overline{y})^2}{\sum_{i=1}^n (y_i- \overline{y})^2}=1-\frac{\hat{\epsilon} _i^2}{\sum_{i=1}^n (y_i- \overline{y})^2}=1-\frac{(\hat{y}-y_i) ^2}{\sum_{i=1}^n (y_i- \overline{y})^2} R2=i=1n(yiy)2i=1n(yi^y)2=1i=1n(yiy)2ϵ^i2=1i=1n(yiy)2(y^yi)2
    ( ∑ i = 1 n ( y i − y ‾ ) 2 ≥ ∑ i = 1 n ( y i ^ − y ‾ ) 2 + ∑ i = 1 n ϵ ^ i 2 ) (\sum_{i=1}^n (y_i- \overline{y})^2\ge\sum_{i=1}^n (\hat{y_i}- \overline{y})^2 +\sum_{i=1}^n\hat{\epsilon} _i^2) (i=1n(yiy)2i=1n(yi^y)2+i=1nϵ^i2)
    其中, y ^ \hat{y} y^为估计值, x i , y i x_i,y_i xi,yi为样本值。 R 2 R^2 R2越大,拟合效果越好。括号中的等号当且仅当 θ \theta θ为无偏估计时成立。

梯度下降

\quad 批量梯度下降随机梯度下降mini-batch 梯度下降
式子 θ j : θ j − α ∑ i = 1 n ( h θ ( x ( i ) ) − y ( i ) ) x j \theta_j: \theta_j-\alpha\sum_{i=1}^n(h_{\theta}(x^{(i)})-y^{(i)})x_j θj:θjαi=1n(hθ(x(i))y(i))xj θ j : θ j − α ( h θ ( x ( i ) ) − y ( i ) ) x j \theta_j:\theta_j-\alpha (h_{\theta}(x^{(i)})-y^{(i)})x_j θj:θjα(hθ(x(i))y(i))xj θ j : θ j − α ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j \theta_j: \theta_j-\alpha\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})x_j θj:θjαi=1m(hθ(x(i))y(i))xj,
m < n m<n m<n

局部加权线性回归

  1. 目标函数
    J ( θ ) = ∑ i = 1 n w ( i ) ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta)=\sum_{i=1}^nw^{(i)}(h_{\theta}(x^{(i)})-y^{(i)})^2 J(θ)=i=1nw(i)(hθ(x(i))y(i))2
    w w w为权重,若为高斯函数,则 w ( i ) = e x p ( − ( x ( i ) − x ) 2 2 τ 2 ) w^{(i)}=exp(-\frac{(x^{(i)}-x)^2}{2\tau^2}) w(i)=exp(2τ2(x(i)x)2),其中 τ \tau τ为带宽。

logistic回归

  1. 对数线性模型
    对数几率: log ⁡ i t ( p ) = log ⁡ p 1 − p = log ⁡ h θ ( x ) 1 − h θ ( x ) = θ T x \log it(p)=\log\frac{p}{1-p}=\log\frac{h_{\theta}(x)}{1-h_{\theta}(x)}=\theta^Tx logit(p)=log1pp=log1hθ(x)hθ(x)=θTx

  2. sigmoid函数
    g ( z ) = 1 1 + e − z g(z)=\frac{1}{1+e^{-z}} g(z)=1+ez1
    g ′ ( z ) = g ( z ) ( 1 − g ( z ) ) g'(z)=g(z)(1-g(z)) g(z)=g(z)(1g(z))

  3. 参数估计
    假定: P ( y = 1 ∣ x ; θ ) = h θ ( x ) , P ( y = 0 ∣ x ; θ ) = 1 − h θ ( x ) P(y=1|x;\theta)=h_{\theta}(x),P(y=0|x;\theta)=1-h_{\theta}(x) P(y=1x;θ)=hθ(x),P(y=0x;θ)=1hθ(x)
    则有 p ( y ∣ x ; θ ) = ( h θ ( x ) ) y ( 1 − h θ ( x ) ) 1 − y p(y|x;\theta)=(h_{\theta}(x))^y(1-h_{\theta}(x))^{1-y} p(yx;θ)=(hθ(x))y(1hθ(x))1y
    那么对数似然函数为 log ⁡ ( L ( θ ) ) = log ⁡ ∏ i = 1 n p ( y ∣ x ; θ ) = ∑ i = 1 n y ( i ) log ⁡ h ( x ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − h ( x ( i ) ) ) \log(L(\theta))=\log \prod_{i=1}^n p(y|x;\theta)=\sum_{i=1}^n y^{(i)}\log h(x^{(i)})+(1-y^{(i)})\log(1-h(x^{(i)})) log(L(θ))=logi=1np(yx;θ)=i=1ny(i)logh(x(i))+(1y(i))log(1h(x(i)))
    θ j \theta_j θj求偏导
    ∂ l ( θ ) ∂ θ j = ∑ i = 1 n ( y ( i ) h ( x ( i ) ) − 1 − y ( i ) 1 − h ( x ( i ) ) ) ∂ h ( x ( i ) ) ∂ θ j = ∑ i = 1 n ( y ( i ) g ( θ T x ( i ) ) − 1 − y ( i ) 1 − g ( θ T x ( i ) ) ) ∂ g ( θ T x ( i ) ) ∂ θ j = ∑ i = 1 n ( y ( i ) g ( θ T x ( i ) ) − 1 − y ( i ) 1 − g ( θ T x ( i ) ) ) g ( θ T x ( i ) ) ( 1 − g ( θ T x ( i ) ) ) ∂ θ T x ( i ) ∂ θ j = ∑ i = 1 n ( y ( i ) − g ( θ T x ( i ) ) ) x ( i ) \begin{aligned} \frac{\partial l(\theta)}{\partial \theta_j} &= \sum_{i=1}^n(\frac{y^{(i)}}{h(x^{(i)})}- \frac{1-y^{(i)}}{1-h(x^{(i)})})\frac{\partial h(x^{(i)})}{\partial \theta_j}\\ &=\sum_{i=1}^n(\frac{y^{(i)}}{g(\theta^Tx^{(i)})}- \frac{1-y^{(i)}}{1-g(\theta^Tx^{(i)})})\frac{\partial g(\theta^Tx^{(i)})}{\partial \theta_j}\\ &=\sum_{i=1}^n(\frac{y^{(i)}}{g(\theta^Tx^{(i)})}- \frac{1-y^{(i)}}{1-g(\theta^Tx^{(i)})})g(\theta^Tx^{(i)})(1-g(\theta^Tx^{(i)}))\frac{\partial \theta^Tx^{(i)}}{\partial \theta_j}\\ &=\sum_{i=1}^n (y^{(i)}-g(\theta^Tx^{(i)}))x^{(i)} \end{aligned} θjl(θ)=i=1n(h(x(i))y(i)1h(x(i))1y(i))θjh(x(i))=i=1n(g(θTx(i))y(i)1g(θTx(i))1y(i))θjg(θTx(i))=i=1n(g(θTx(i))y(i)1g(θTx(i))1y(i))g(θTx(i))(1g(θTx(i)))θjθTx(i)=i=1n(y(i)g(θTx(i)))x(i)

  4. 损失函数
    L ( θ ) = − log ⁡ ( L ( θ ) ) L(\theta)=-\log(L(\theta)) L(θ)=log(L(θ))

  5. 梯度下降
    批量梯度下降
    θ j : θ j + α ∑ i = 1 n ( y ( i ) − g ( θ T x ( i ) ) ) x ( i ) \theta_j: \theta_j+\alpha\sum_{i=1}^n(y^{(i)}-g(\theta^Tx^{(i)}))x^{(i)} θj:θj+αi=1n(y(i)g(θTx(i)))x(i)
    随机梯度下降
    θ j : θ j + α ( y ( i ) − g ( θ T x ( i ) ) ) x ( i ) \theta_j: \theta_j+\alpha(y^{(i)}-g(\theta^Tx^{(i)}))x^{(i)} θj:θj+α(y(i)g(θTx(i)))x(i)
    mini-batch梯度下降
    θ j : θ j + α ∑ i = 1 m ( y ( i ) − h θ ( x ( i ) ) ) x ( i ) , m < n \theta_j: \theta_j+\alpha\sum_{i=1}^m(y^{(i)}-h_\theta (x^{(i)}))x^{(i)},m<n θj:θj+αi=1m(y(i)hθ(x(i)))x(i),m<n

\quad 线性回归逻辑回归
h θ ( x ( i ) ) h_\theta (x^{(i)}) hθ(x(i)) h θ ( x ( i ) ) ) = θ T x h_\theta (x^{(i)}))=\theta^Tx hθ(x(i)))=θTx h θ ( x ( i ) ) ) = 1 1 + e − x h_\theta (x^{(i)}))=\frac{1}{1+e^{-x}} hθ(x(i)))=1+ex1
假设 ϵ \epsilon ϵ(也就是 y = θ T x + ϵ y=\theta^Tx+\epsilon y=θTx+ϵ)服从高斯分布,是指数族分布 y y y服从二项分布,是指数族分布

指数族分布的函数梯度下降都有类似的形式

Softmax回归

  1. s o f t m a x softmax softmax函数
    s o f t m a x ( z k ) = e x p ( z k ) ∑ i = 1 K e x p ( z i ) softmax(z_k)=\frac{exp(z_k)}{\sum_{i=1}^K exp(z_i)} softmax(zk)=i=1Kexp(zi)exp(zk)
    ∂ s o f t m a x ( z k ) ∂ z k = e x p ( z k ) ∑ i = 1 K e x p ( z i ) − e x p ( z k ) e x p ( z k ) ( ∑ i = 1 K e x p ( z i ) ) 2 = s o f t m a x ( z k ) ( 1 − s o f t m a x ( z k ) ) \begin{aligned} \frac{\partial softmax(z_k)}{\partial z_k}&=\frac{exp(z_k)\sum_{i=1}^K exp(z_i)-exp(z_k)exp(z_k)}{(\sum_{i=1}^K exp(z_i))^2}\\ &=softmax(z_k)(1-softmax(z_k)) \end{aligned} zksoftmax(zk)=(i=1Kexp(zi))2exp(zk)i=1Kexp(zi)exp(zk)exp(zk)=softmax(zk)(1softmax(zk))
    ∂ s o f t m a x ( z k ) ∂ z j = − e x p ( z k ) e x p ( z j ) ( ∑ i = 1 K e x p ( z i ) ) 2 = − s o f t m a x ( z k ) s o f t m a x ( z j ) \frac{\partial softmax(z_k)}{\partial z_j}=\frac{-exp(z_k)exp(z_j)}{(\sum_{i=1}^K exp(z_i))^2}=-softmax(z_k)softmax(z_j) zjsoftmax(zk)=(i=1Kexp(zi))2exp(zk)exp(zj)=softmax(zk)softmax(zj)
    ∂ log ⁡ s o f t m a x ( z k ) ∂ z k = ∂ ( z k − log ⁡ ∑ i = 1 K e x p ( z i ) ) ∂ z k = 1 − s o f t m a x ( z k ) \frac{\partial \log softmax(z_k)}{\partial z_k}=\frac{\partial (z_k-\log \sum_{i=1}^K exp(z_i))}{\partial z_k}=1-softmax(z_k) zklogsoftmax(zk)=zk(zklogi=1Kexp(zi))=1softmax(zk)
    ( ∂ log ⁡ s o f t m a x ( z k ) ∂ z k = 1 s o f t m a x ( z k ) ∂ s o f t m a x ( z k ) = 1 − s o f t m a x ( z k ) ) \left(\frac{\partial \log softmax(z_k)}{\partial z_k}=\frac{1}{softmax(z_k)}\partial softmax(z_k)=1-softmax(z_k)\right) (zklogsoftmax(zk)=softmax(zk)1softmax(zk)=1softmax(zk))
  2. k k k分类
    对于第 k k k类,参数为 θ k = ( θ 1 , … , θ m ) T \theta_k=(\theta_1,\dots,\theta_m)^T θk=(θ1,,θm)T m m m为数据 x x x的维数, Θ \Theta Θ是一个 k × m k\times m k×m的矩阵。第 i i i个样本 x ( i ) x^{(i)} x(i)的标签 y ( i ) = ( y 1 ( i ) , … , y k ( i ) ) \bm{y}^{(i)}=(y_1^{(i)},\dots,y_k^{(i)}) y(i)=(y1(i),,yk(i))
    则假定:
    P ( y = k ∣ x ; θ ) = e x p ( θ k T x ) ∑ k = 1 K e x p ( θ k T x ) , k = 1 , 2 , … , K \begin{aligned} P(y=k|x;\theta)&=\frac{exp(\theta_k^Tx)}{\sum_{k=1}^K exp(\theta_k^Tx)},k=1,2,\dots,K\\ \end{aligned} P(y=kx;θ)=k=1Kexp(θkTx)exp(θkTx),k=1,2,,K
    ( y ( i ) ) T = ( y ^ 1 ( i ) , … , y ^ K ( i ) ) = ( P ( y = 1 ∣ x ( i ) ; θ ) , … , P ( y = K ∣ x ( i ) ) (\bm{y}^{(i)})^T=(\hat{y}^{(i)}_1,\dots,\hat{y}^{(i)}_K)=(P(y=1|x^{(i)};\theta),\dots,P(y=K|x^{(i)}) (y(i))T=(y^1(i),,y^K(i))=(P(y=1x(i);θ),,P(y=Kx(i)).
  3. 对数似然函数 log ⁡ ( L ( θ ) ) = log ⁡ ∏ i = 1 n p ( y ∣ x ; θ ) = l o g ∏ i = 1 n ∏ k = 1 K ( P ( y = k ∣ x ; θ ) ) y k ( i ) = l o g ∏ i = 1 n ∏ k = 1 K ( e x p ( θ k T x ( i ) ) ∑ k = 1 K e x p ( θ k T x ( i ) ) ) ) y k ( i ) ( = ∑ i = 1 n ∑ i = 1 K y k ( i ) log ⁡ y ^ k ( i ) = ∑ i = 1 n ( y ( i ) ) T log ⁡ y ^ = y T log ⁡ y ^ ) = ∑ i = 1 n ∑ k = 1 K y k ( i ) ( θ k T x ( i ) − log ⁡ ∑ k = 1 K e x p ( θ k T x ( i ) ) ) ) \begin{aligned} \log(L(\theta))&=\log \prod_{i=1}^n p(y|x;\theta)\\ &=log \prod_{i=1}^n \prod_{k=1}^K (P(y=k|x;\theta))^{y_k^{(i)}}\\ &=log \prod_{i=1}^n \prod_{k=1}^K \left(\frac{exp(\theta_k^Tx^{(i)})}{\sum_{k=1}^K exp(\theta_k^Tx^{(i)}))} \right)^{y_k^{(i)}}\\ &\left(=\sum_{i=1}^n \sum_{i=1}^K y_k^{(i)} \log \hat{y}^{(i)}_k =\sum_{i=1}^n (\bm{y}^{(i)})^T\log\bm{\hat{y}} =\bm{y}^T\log\bm{\hat{y}} \right)\\ &=\sum_{i=1}^n \sum_{k=1}^K y_k^{(i)} (\theta_k^Tx^{(i)}-\log\sum_{k=1}^K exp(\theta_k^Tx^{(i)}))) \end{aligned} log(L(θ))=logi=1np(yx;θ)=logi=1nk=1K(P(y=kx;θ))yk(i)=logi=1nk=1K(k=1Kexp(θkTx(i)))exp(θkTx(i)))yk(i)(=i=1ni=1Kyk(i)logy^k(i)=i=1n(y(i))Tlogy^=yTlogy^)=i=1nk=1Kyk(i)(θkTx(i)logk=1Kexp(θkTx(i))))

θ j \theta_j θj求偏导
∂ L ( θ ) ∂ θ j = ∑ i = 1 n y j ( i ) ( 1 − e x p ( θ j T x ( i ) ) ∑ l = 1 K e x p ( θ l T x ( i ) ) ) ( x ( i ) ) T \begin{aligned} \frac{\partial L(\theta)}{\partial \theta_j}=\sum_{i=1}^n y_j^{(i)} (1-\frac{exp(\theta_j^Tx^{(i)})}{\sum_{l=1}^K exp(\theta_l^Tx^{(i)})}) (x^{(i)})^T\\ \end{aligned} θjL(θ)=i=1nyj(i)(1l=1Kexp(θlTx(i))exp(θjTx(i)))(x(i))T

  1. 损失函数
    L ( θ ) = − log ⁡ ( L ( θ ) ) L(\theta)=-\log(L(\theta)) L(θ)=log(L(θ))

  2. 梯度下降
    批量梯度下降
    θ j : θ j + α ∑ i = 1 n y j ( i ) ( 1 − e x p ( θ j T x ( i ) ) ∑ l = 1 K e x p ( θ l T x ( i ) ) ) ( x ( i ) ) T \theta_j: \theta_j+\alpha\sum_{i=1}^n y_j^{(i)} (1-\frac{exp(\theta_j^Tx^{(i)})}{\sum_{l=1}^K exp(\theta_l^Tx^{(i)})}) (x^{(i)})^T θj:θj+αi=1nyj(i)(1l=1Kexp(θlTx(i))exp(θjTx(i)))(x(i))T
    随机梯度下降
    θ j : θ j + α y j ( i ) ( 1 − e x p ( θ j T x ( i ) ) ∑ l = 1 K e x p ( θ l T x ( i ) ) ) ( x ( i ) ) T \theta_j: \theta_j+\alpha y_j^{(i)} (1-\frac{exp(\theta_j^Tx^{(i)})}{\sum_{l=1}^K exp(\theta_l^Tx^{(i)})}) (x^{(i)})^T θj:θj+αyj(i)(1l=1Kexp(θlTx(i))exp(θjTx(i)))(x(i))T
    mini-batch梯度下降
    θ j : θ j + α ∑ i = 1 m y j ( i ) ( 1 − e x p ( θ j T x ( i ) ) ∑ l = 1 K e x p ( θ l T x ( i ) ) ) ( x ( i ) ) T , m < n \theta_j: \theta_j+\alpha\sum_{i=1}^m y_j^{(i)} (1-\frac{exp(\theta_j^Tx^{(i)})}{\sum_{l=1}^K exp(\theta_l^Tx^{(i)})}) (x^{(i)})^T,m<n θj:θj+αi=1myj(i)(1l=1Kexp(θlTx(i))exp(θjTx(i)))(x(i))T,m<n

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值