contents
线性回归
- 目标函数
J ( θ ) = 1 2 ∑ i = 1 n ( h θ ( x ( i ) ) − y ( i ) ) 2 = 1 2 ( X θ − y ) T ( X θ − y ) J(\theta)=\frac{1}{2}\sum_{i=1}^n(h_{\theta}(x^{(i)})-y^{(i)})^2=\frac{1}{2}(X\theta-y)^T(X\theta-y) J(θ)=21i=1∑n(hθ(x(i))−y(i))2=21(Xθ−y)T(Xθ−y) - 求解方法
正规求解: θ = ( X T X ) − 1 X T y \theta=(X^TX)^{-1}X^Ty θ=(XTX)−1XTy
梯度下降: ∂ J ( θ ) ∂ θ j = θ j − α ∑ i = 1 n ( h θ ( x ( i ) ) − y ( i ) ) x j \frac{\partial J(\theta)}{\partial \theta_j}=\theta_j-\alpha\sum_{i=1}^n(h_{\theta}(x^{(i)})-y^{(i)})x_j ∂θj∂J(θ)=θj−αi=1∑n(hθ(x(i))−y(i))xj
3.正则项
\quad | L0正则项 | L1正则项 | L2正则项 | Elastic Net |
---|---|---|---|---|
形式 | ∑ I { θ i ≠ 0 } \sum I_{\{\theta_i \neq0\}} ∑I{θi=0} | ∑ ∣ θ i ∣ \sum \vert\theta_i \vert ∑∣θi∣ | ∑ θ i 2 \sum \theta_i^2 ∑θi2 | ρ ∑ ∣ θ i ∣ + ( 1 − ρ ) ∑ θ i 2 , ρ ∈ [ 0 , 1 ] \rho \sum \vert\theta_i \vert+(1-\rho)\sum \theta_i^2 , \rho \in[0,1] ρ∑∣θi∣+(1−ρ)∑θi2,ρ∈[0,1] |
含义 | L0正则化的值是模型参数中非零参数的个数 | L1范数是指向量中各个元素绝对值之和 | L2正则化标识各个参数的平方的和的开方值 | Elastic Net则为L1和L2的加权组合 |
结果倾向 | L1会趋向于产生少量的特征,而其他的特征都是0 | L2会选择更多的特征,但这些特征都会接近于0 | ||
使用场景 | 在所有特征中只有少数特征起重要作用的情况下,选择L1比较合适。L1不仅可以作为正则化手段,其在特征选择时候非常有用 | 如果所有特征中,大部分特征都能起作用,而且起的作用很平均,那么使用L2更合适。 |
-
R
2
R^2
R2 评估
R 2 = ∑ i = 1 n ( y i ^ − y ‾ ) 2 ∑ i = 1 n ( y i − y ‾ ) 2 = 1 − ϵ ^ i 2 ∑ i = 1 n ( y i − y ‾ ) 2 = 1 − ( y ^ − y i ) 2 ∑ i = 1 n ( y i − y ‾ ) 2 R^2=\frac{\sum_{i=1}^n (\hat{y_i}- \overline{y})^2}{\sum_{i=1}^n (y_i- \overline{y})^2}=1-\frac{\hat{\epsilon} _i^2}{\sum_{i=1}^n (y_i- \overline{y})^2}=1-\frac{(\hat{y}-y_i) ^2}{\sum_{i=1}^n (y_i- \overline{y})^2} R2=∑i=1n(yi−y)2∑i=1n(yi^−y)2=1−∑i=1n(yi−y)2ϵ^i2=1−∑i=1n(yi−y)2(y^−yi)2
( ∑ i = 1 n ( y i − y ‾ ) 2 ≥ ∑ i = 1 n ( y i ^ − y ‾ ) 2 + ∑ i = 1 n ϵ ^ i 2 ) (\sum_{i=1}^n (y_i- \overline{y})^2\ge\sum_{i=1}^n (\hat{y_i}- \overline{y})^2 +\sum_{i=1}^n\hat{\epsilon} _i^2) (i=1∑n(yi−y)2≥i=1∑n(yi^−y)2+i=1∑nϵ^i2)
其中, y ^ \hat{y} y^为估计值, x i , y i x_i,y_i xi,yi为样本值。 R 2 R^2 R2越大,拟合效果越好。括号中的等号当且仅当 θ \theta θ为无偏估计时成立。
梯度下降
\quad | 批量梯度下降 | 随机梯度下降 | mini-batch 梯度下降 |
---|---|---|---|
式子 | θ j : θ j − α ∑ i = 1 n ( h θ ( x ( i ) ) − y ( i ) ) x j \theta_j: \theta_j-\alpha\sum_{i=1}^n(h_{\theta}(x^{(i)})-y^{(i)})x_j θj:θj−α∑i=1n(hθ(x(i))−y(i))xj | θ j : θ j − α ( h θ ( x ( i ) ) − y ( i ) ) x j \theta_j:\theta_j-\alpha (h_{\theta}(x^{(i)})-y^{(i)})x_j θj:θj−α(hθ(x(i))−y(i))xj |
θ
j
:
θ
j
−
α
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
x
j
\theta_j: \theta_j-\alpha\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})x_j
θj:θj−α∑i=1m(hθ(x(i))−y(i))xj, m < n m<n m<n |
局部加权线性回归
- 目标函数
J ( θ ) = ∑ i = 1 n w ( i ) ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta)=\sum_{i=1}^nw^{(i)}(h_{\theta}(x^{(i)})-y^{(i)})^2 J(θ)=i=1∑nw(i)(hθ(x(i))−y(i))2
w w w为权重,若为高斯函数,则 w ( i ) = e x p ( − ( x ( i ) − x ) 2 2 τ 2 ) w^{(i)}=exp(-\frac{(x^{(i)}-x)^2}{2\tau^2}) w(i)=exp(−2τ2(x(i)−x)2),其中 τ \tau τ为带宽。
logistic回归
-
对数线性模型
对数几率: log i t ( p ) = log p 1 − p = log h θ ( x ) 1 − h θ ( x ) = θ T x \log it(p)=\log\frac{p}{1-p}=\log\frac{h_{\theta}(x)}{1-h_{\theta}(x)}=\theta^Tx logit(p)=log1−pp=log1−hθ(x)hθ(x)=θTx -
sigmoid函数
g ( z ) = 1 1 + e − z g(z)=\frac{1}{1+e^{-z}} g(z)=1+e−z1
g ′ ( z ) = g ( z ) ( 1 − g ( z ) ) g'(z)=g(z)(1-g(z)) g′(z)=g(z)(1−g(z)) -
参数估计
假定: P ( y = 1 ∣ x ; θ ) = h θ ( x ) , P ( y = 0 ∣ x ; θ ) = 1 − h θ ( x ) P(y=1|x;\theta)=h_{\theta}(x),P(y=0|x;\theta)=1-h_{\theta}(x) P(y=1∣x;θ)=hθ(x),P(y=0∣x;θ)=1−hθ(x),
则有 p ( y ∣ x ; θ ) = ( h θ ( x ) ) y ( 1 − h θ ( x ) ) 1 − y p(y|x;\theta)=(h_{\theta}(x))^y(1-h_{\theta}(x))^{1-y} p(y∣x;θ)=(hθ(x))y(1−hθ(x))1−y
那么对数似然函数为 log ( L ( θ ) ) = log ∏ i = 1 n p ( y ∣ x ; θ ) = ∑ i = 1 n y ( i ) log h ( x ( i ) ) + ( 1 − y ( i ) ) log ( 1 − h ( x ( i ) ) ) \log(L(\theta))=\log \prod_{i=1}^n p(y|x;\theta)=\sum_{i=1}^n y^{(i)}\log h(x^{(i)})+(1-y^{(i)})\log(1-h(x^{(i)})) log(L(θ))=logi=1∏np(y∣x;θ)=i=1∑ny(i)logh(x(i))+(1−y(i))log(1−h(x(i)))
对 θ j \theta_j θj求偏导
∂ l ( θ ) ∂ θ j = ∑ i = 1 n ( y ( i ) h ( x ( i ) ) − 1 − y ( i ) 1 − h ( x ( i ) ) ) ∂ h ( x ( i ) ) ∂ θ j = ∑ i = 1 n ( y ( i ) g ( θ T x ( i ) ) − 1 − y ( i ) 1 − g ( θ T x ( i ) ) ) ∂ g ( θ T x ( i ) ) ∂ θ j = ∑ i = 1 n ( y ( i ) g ( θ T x ( i ) ) − 1 − y ( i ) 1 − g ( θ T x ( i ) ) ) g ( θ T x ( i ) ) ( 1 − g ( θ T x ( i ) ) ) ∂ θ T x ( i ) ∂ θ j = ∑ i = 1 n ( y ( i ) − g ( θ T x ( i ) ) ) x ( i ) \begin{aligned} \frac{\partial l(\theta)}{\partial \theta_j} &= \sum_{i=1}^n(\frac{y^{(i)}}{h(x^{(i)})}- \frac{1-y^{(i)}}{1-h(x^{(i)})})\frac{\partial h(x^{(i)})}{\partial \theta_j}\\ &=\sum_{i=1}^n(\frac{y^{(i)}}{g(\theta^Tx^{(i)})}- \frac{1-y^{(i)}}{1-g(\theta^Tx^{(i)})})\frac{\partial g(\theta^Tx^{(i)})}{\partial \theta_j}\\ &=\sum_{i=1}^n(\frac{y^{(i)}}{g(\theta^Tx^{(i)})}- \frac{1-y^{(i)}}{1-g(\theta^Tx^{(i)})})g(\theta^Tx^{(i)})(1-g(\theta^Tx^{(i)}))\frac{\partial \theta^Tx^{(i)}}{\partial \theta_j}\\ &=\sum_{i=1}^n (y^{(i)}-g(\theta^Tx^{(i)}))x^{(i)} \end{aligned} ∂θj∂l(θ)=i=1∑n(h(x(i))y(i)−1−h(x(i))1−y(i))∂θj∂h(x(i))=i=1∑n(g(θTx(i))y(i)−1−g(θTx(i))1−y(i))∂θj∂g(θTx(i))=i=1∑n(g(θTx(i))y(i)−1−g(θTx(i))1−y(i))g(θTx(i))(1−g(θTx(i)))∂θj∂θTx(i)=i=1∑n(y(i)−g(θTx(i)))x(i) -
损失函数
L ( θ ) = − log ( L ( θ ) ) L(\theta)=-\log(L(\theta)) L(θ)=−log(L(θ)) -
梯度下降
批量梯度下降
θ j : θ j + α ∑ i = 1 n ( y ( i ) − g ( θ T x ( i ) ) ) x ( i ) \theta_j: \theta_j+\alpha\sum_{i=1}^n(y^{(i)}-g(\theta^Tx^{(i)}))x^{(i)} θj:θj+αi=1∑n(y(i)−g(θTx(i)))x(i)
随机梯度下降
θ j : θ j + α ( y ( i ) − g ( θ T x ( i ) ) ) x ( i ) \theta_j: \theta_j+\alpha(y^{(i)}-g(\theta^Tx^{(i)}))x^{(i)} θj:θj+α(y(i)−g(θTx(i)))x(i)
mini-batch梯度下降
θ j : θ j + α ∑ i = 1 m ( y ( i ) − h θ ( x ( i ) ) ) x ( i ) , m < n \theta_j: \theta_j+\alpha\sum_{i=1}^m(y^{(i)}-h_\theta (x^{(i)}))x^{(i)},m<n θj:θj+αi=1∑m(y(i)−hθ(x(i)))x(i),m<n
\quad | 线性回归 | 逻辑回归 |
---|---|---|
h θ ( x ( i ) ) h_\theta (x^{(i)}) hθ(x(i)) | h θ ( x ( i ) ) ) = θ T x h_\theta (x^{(i)}))=\theta^Tx hθ(x(i)))=θTx | h θ ( x ( i ) ) ) = 1 1 + e − x h_\theta (x^{(i)}))=\frac{1}{1+e^{-x}} hθ(x(i)))=1+e−x1 |
假设 | ϵ \epsilon ϵ(也就是 y = θ T x + ϵ y=\theta^Tx+\epsilon y=θTx+ϵ)服从高斯分布,是指数族分布 | y y y服从二项分布,是指数族分布 |
指数族分布的函数梯度下降都有类似的形式
Softmax回归
-
s
o
f
t
m
a
x
softmax
softmax函数
s o f t m a x ( z k ) = e x p ( z k ) ∑ i = 1 K e x p ( z i ) softmax(z_k)=\frac{exp(z_k)}{\sum_{i=1}^K exp(z_i)} softmax(zk)=∑i=1Kexp(zi)exp(zk)
∂ s o f t m a x ( z k ) ∂ z k = e x p ( z k ) ∑ i = 1 K e x p ( z i ) − e x p ( z k ) e x p ( z k ) ( ∑ i = 1 K e x p ( z i ) ) 2 = s o f t m a x ( z k ) ( 1 − s o f t m a x ( z k ) ) \begin{aligned} \frac{\partial softmax(z_k)}{\partial z_k}&=\frac{exp(z_k)\sum_{i=1}^K exp(z_i)-exp(z_k)exp(z_k)}{(\sum_{i=1}^K exp(z_i))^2}\\ &=softmax(z_k)(1-softmax(z_k)) \end{aligned} ∂zk∂softmax(zk)=(∑i=1Kexp(zi))2exp(zk)∑i=1Kexp(zi)−exp(zk)exp(zk)=softmax(zk)(1−softmax(zk))
∂ s o f t m a x ( z k ) ∂ z j = − e x p ( z k ) e x p ( z j ) ( ∑ i = 1 K e x p ( z i ) ) 2 = − s o f t m a x ( z k ) s o f t m a x ( z j ) \frac{\partial softmax(z_k)}{\partial z_j}=\frac{-exp(z_k)exp(z_j)}{(\sum_{i=1}^K exp(z_i))^2}=-softmax(z_k)softmax(z_j) ∂zj∂softmax(zk)=(∑i=1Kexp(zi))2−exp(zk)exp(zj)=−softmax(zk)softmax(zj)
∂ log s o f t m a x ( z k ) ∂ z k = ∂ ( z k − log ∑ i = 1 K e x p ( z i ) ) ∂ z k = 1 − s o f t m a x ( z k ) \frac{\partial \log softmax(z_k)}{\partial z_k}=\frac{\partial (z_k-\log \sum_{i=1}^K exp(z_i))}{\partial z_k}=1-softmax(z_k) ∂zk∂logsoftmax(zk)=∂zk∂(zk−log∑i=1Kexp(zi))=1−softmax(zk)
( ∂ log s o f t m a x ( z k ) ∂ z k = 1 s o f t m a x ( z k ) ∂ s o f t m a x ( z k ) = 1 − s o f t m a x ( z k ) ) \left(\frac{\partial \log softmax(z_k)}{\partial z_k}=\frac{1}{softmax(z_k)}\partial softmax(z_k)=1-softmax(z_k)\right) (∂zk∂logsoftmax(zk)=softmax(zk)1∂softmax(zk)=1−softmax(zk)) -
k
k
k分类
对于第 k k k类,参数为 θ k = ( θ 1 , … , θ m ) T \theta_k=(\theta_1,\dots,\theta_m)^T θk=(θ1,…,θm)T, m m m为数据 x x x的维数, Θ \Theta Θ是一个 k × m k\times m k×m的矩阵。第 i i i个样本 x ( i ) x^{(i)} x(i)的标签 y ( i ) = ( y 1 ( i ) , … , y k ( i ) ) \bm{y}^{(i)}=(y_1^{(i)},\dots,y_k^{(i)}) y(i)=(y1(i),…,yk(i))
则假定:
P ( y = k ∣ x ; θ ) = e x p ( θ k T x ) ∑ k = 1 K e x p ( θ k T x ) , k = 1 , 2 , … , K \begin{aligned} P(y=k|x;\theta)&=\frac{exp(\theta_k^Tx)}{\sum_{k=1}^K exp(\theta_k^Tx)},k=1,2,\dots,K\\ \end{aligned} P(y=k∣x;θ)=∑k=1Kexp(θkTx)exp(θkTx),k=1,2,…,K
记 ( y ( i ) ) T = ( y ^ 1 ( i ) , … , y ^ K ( i ) ) = ( P ( y = 1 ∣ x ( i ) ; θ ) , … , P ( y = K ∣ x ( i ) ) (\bm{y}^{(i)})^T=(\hat{y}^{(i)}_1,\dots,\hat{y}^{(i)}_K)=(P(y=1|x^{(i)};\theta),\dots,P(y=K|x^{(i)}) (y(i))T=(y^1(i),…,y^K(i))=(P(y=1∣x(i);θ),…,P(y=K∣x(i)). - 对数似然函数 log ( L ( θ ) ) = log ∏ i = 1 n p ( y ∣ x ; θ ) = l o g ∏ i = 1 n ∏ k = 1 K ( P ( y = k ∣ x ; θ ) ) y k ( i ) = l o g ∏ i = 1 n ∏ k = 1 K ( e x p ( θ k T x ( i ) ) ∑ k = 1 K e x p ( θ k T x ( i ) ) ) ) y k ( i ) ( = ∑ i = 1 n ∑ i = 1 K y k ( i ) log y ^ k ( i ) = ∑ i = 1 n ( y ( i ) ) T log y ^ = y T log y ^ ) = ∑ i = 1 n ∑ k = 1 K y k ( i ) ( θ k T x ( i ) − log ∑ k = 1 K e x p ( θ k T x ( i ) ) ) ) \begin{aligned} \log(L(\theta))&=\log \prod_{i=1}^n p(y|x;\theta)\\ &=log \prod_{i=1}^n \prod_{k=1}^K (P(y=k|x;\theta))^{y_k^{(i)}}\\ &=log \prod_{i=1}^n \prod_{k=1}^K \left(\frac{exp(\theta_k^Tx^{(i)})}{\sum_{k=1}^K exp(\theta_k^Tx^{(i)}))} \right)^{y_k^{(i)}}\\ &\left(=\sum_{i=1}^n \sum_{i=1}^K y_k^{(i)} \log \hat{y}^{(i)}_k =\sum_{i=1}^n (\bm{y}^{(i)})^T\log\bm{\hat{y}} =\bm{y}^T\log\bm{\hat{y}} \right)\\ &=\sum_{i=1}^n \sum_{k=1}^K y_k^{(i)} (\theta_k^Tx^{(i)}-\log\sum_{k=1}^K exp(\theta_k^Tx^{(i)}))) \end{aligned} log(L(θ))=logi=1∏np(y∣x;θ)=logi=1∏nk=1∏K(P(y=k∣x;θ))yk(i)=logi=1∏nk=1∏K(∑k=1Kexp(θkTx(i)))exp(θkTx(i)))yk(i)(=i=1∑ni=1∑Kyk(i)logy^k(i)=i=1∑n(y(i))Tlogy^=yTlogy^)=i=1∑nk=1∑Kyk(i)(θkTx(i)−logk=1∑Kexp(θkTx(i))))
对
θ
j
\theta_j
θj求偏导
∂
L
(
θ
)
∂
θ
j
=
∑
i
=
1
n
y
j
(
i
)
(
1
−
e
x
p
(
θ
j
T
x
(
i
)
)
∑
l
=
1
K
e
x
p
(
θ
l
T
x
(
i
)
)
)
(
x
(
i
)
)
T
\begin{aligned} \frac{\partial L(\theta)}{\partial \theta_j}=\sum_{i=1}^n y_j^{(i)} (1-\frac{exp(\theta_j^Tx^{(i)})}{\sum_{l=1}^K exp(\theta_l^Tx^{(i)})}) (x^{(i)})^T\\ \end{aligned}
∂θj∂L(θ)=i=1∑nyj(i)(1−∑l=1Kexp(θlTx(i))exp(θjTx(i)))(x(i))T
-
损失函数
L ( θ ) = − log ( L ( θ ) ) L(\theta)=-\log(L(\theta)) L(θ)=−log(L(θ)) -
梯度下降
批量梯度下降
θ j : θ j + α ∑ i = 1 n y j ( i ) ( 1 − e x p ( θ j T x ( i ) ) ∑ l = 1 K e x p ( θ l T x ( i ) ) ) ( x ( i ) ) T \theta_j: \theta_j+\alpha\sum_{i=1}^n y_j^{(i)} (1-\frac{exp(\theta_j^Tx^{(i)})}{\sum_{l=1}^K exp(\theta_l^Tx^{(i)})}) (x^{(i)})^T θj:θj+αi=1∑nyj(i)(1−∑l=1Kexp(θlTx(i))exp(θjTx(i)))(x(i))T
随机梯度下降
θ j : θ j + α y j ( i ) ( 1 − e x p ( θ j T x ( i ) ) ∑ l = 1 K e x p ( θ l T x ( i ) ) ) ( x ( i ) ) T \theta_j: \theta_j+\alpha y_j^{(i)} (1-\frac{exp(\theta_j^Tx^{(i)})}{\sum_{l=1}^K exp(\theta_l^Tx^{(i)})}) (x^{(i)})^T θj:θj+αyj(i)(1−∑l=1Kexp(θlTx(i))exp(θjTx(i)))(x(i))T
mini-batch梯度下降
θ j : θ j + α ∑ i = 1 m y j ( i ) ( 1 − e x p ( θ j T x ( i ) ) ∑ l = 1 K e x p ( θ l T x ( i ) ) ) ( x ( i ) ) T , m < n \theta_j: \theta_j+\alpha\sum_{i=1}^m y_j^{(i)} (1-\frac{exp(\theta_j^Tx^{(i)})}{\sum_{l=1}^K exp(\theta_l^Tx^{(i)})}) (x^{(i)})^T,m<n θj:θj+αi=1∑myj(i)(1−∑l=1Kexp(θlTx(i))exp(θjTx(i)))(x(i))T,m<n