逻辑回归学习笔记
本文仅为了个人学习理解使用
逻辑回归
逻辑回归是用来解决二分类问题
回归模型的输出是连续的;分类模型的输出是离散的
- 逻辑回归=线性回归+sigmoid函数
- 线性回归: z = w ∗ x + b z=w*x+b z=w∗x+b
- sigmoid函数: y = 1 1 + e − z = 1 1 + e − ( w ∗ x + b ) y=\frac{1}{1+e^{-z} } =\frac{1}{1+e^{-(w*x+b)} } y=1+e−z1=1+e−(w∗x+b)1
- 逻辑回归损失函数(损失函数越小模型越好,训练过程即使得损失函数最小的优化过程):
C = − [ y ln a + ( 1 − y ) ln ( 1 − a ) ] C=-[y \ln a+(1-y) \ln (1-a)] C=−[ylna+(1−y)ln(1−a)]
损失函数
cost
=
{
−
log
(
p
^
)
if
y
=
1
−
log
(
1
−
p
^
)
if
y
=
0
\text { cost }=\left\{\begin{array}{ccc} -\log (\hat{p}) & \text { if } & y=1 \\ -\log (1-\hat{p}) & \text { if } & y=0 \end{array}\right.
cost ={−log(p^)−log(1−p^) if if y=1y=0
单个样本损失函数则表示为:
cost
=
−
y
log
(
p
^
)
−
(
1
−
y
)
log
(
1
−
p
^
)
\text { cost }=-y \log (\hat{p})-(1-y) \log (1-\hat{p})
cost =−ylog(p^)−(1−y)log(1−p^)
所有样本损失函数(求和):
J
(
θ
)
=
−
1
m
∑
i
=
1
m
y
(
i
)
log
(
p
^
(
i
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
p
^
(
i
)
)
J(\theta)=-\frac{1}{m} \sum_{i=1}^{m} y^{(i)} \log \left(\hat{p}^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-\hat{p}^{(i)}\right)
J(θ)=−m1i=1∑my(i)log(p^(i))+(1−y(i))log(1−p^(i))
J
(
θ
)
=
−
1
m
∑
i
=
1
m
y
(
i
)
log
(
σ
(
X
b
(
i
)
θ
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
σ
(
X
b
(
i
)
θ
)
)
J(\theta)=-\frac{1}{m} \sum_{i=1}^{m} y^{(i)} \log \left(\sigma\left(X_{b}^{(i)} \theta\right)\right)+\left(1-y^{(i)}\right) \log \left(1-\sigma\left(X_{b}^{(i)} \theta\right)\right)
J(θ)=−m1i=1∑my(i)log(σ(Xb(i)θ))+(1−y(i))log(1−σ(Xb(i)θ))
上式无法做数学解析解,但是该解析函数为凸函数(没有全局最优解,只有局部最优解),可以用梯度下降法求解。
梯度下降法
J
(
θ
)
=
−
1
m
∑
i
=
1
m
y
(
i
)
log
(
σ
(
X
b
(
i
)
θ
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
σ
(
X
b
(
i
)
θ
)
)
J(\theta)=-\frac{1}{m} \sum_{i=1}^{m} y^{(i)} \log \left(\sigma\left(X_{b}^{(i)} \theta\right)\right)+\left(1-y^{(i)}\right) \log \left(1-\sigma\left(X_{b}^{(i)} \theta\right)\right)
J(θ)=−m1i=1∑my(i)log(σ(Xb(i)θ))+(1−y(i))log(1−σ(Xb(i)θ))
∇
J
(
θ
)
=
(
∂
J
(
θ
)
∂
θ
∂
∂
J
(
θ
)
∂
θ
1
⋯
∂
J
(
θ
)
∂
θ
n
)
\nabla J(\theta)=\left(\begin{array}{c} \frac{\partial J(\theta)}{\partial \theta_{\partial}} \\ \frac{\partial J(\theta)}{\partial \theta_{1}} \\ \cdots \\ \frac{\partial J(\theta)}{\partial \theta_{n}} \end{array}\right)
∇J(θ)=⎝
⎛∂θ∂∂J(θ)∂θ1∂J(θ)⋯∂θn∂J(θ)⎠
⎞
先看sigmoid函数求导
σ
(
t
)
=
1
1
+
e
−
t
=
(
1
+
e
−
t
)
−
1
\sigma(t)=\frac{1}{1+e^{-t}}=\left(1+e^{-t}\right)^{-1}
σ(t)=1+e−t1=(1+e−t)−1
σ
(
t
)
′
=
−
(
1
+
e
−
t
)
−
2
⋅
e
−
t
⋅
(
−
1
)
=
(
1
+
e
−
t
)
−
2
⋅
e
−
t
⋅
\sigma(t)^{\prime}=-\left(1+e^{-t}\right)^{-2} \cdot e^{-t} \cdot(-1)=\left(1+e^{-t}\right)^{-2} \cdot e^{-t} \cdot
σ(t)′=−(1+e−t)−2⋅e−t⋅(−1)=(1+e−t)−2⋅e−t⋅
再扩一层
(
log
σ
(
t
)
)
′
=
1
σ
(
t
)
⋅
σ
(
t
)
′
=
1
σ
(
t
)
⋅
(
1
+
e
−
t
)
−
2
⋅
e
−
t
=
1
(
1
+
e
−
t
)
−
1
⋅
(
1
+
e
−
t
)
−
2
⋅
e
−
t
=
(
1
+
e
−
t
)
−
1
⋅
e
−
t
\begin{aligned} (\log \sigma(t))^{\prime}&=\frac{1}{\sigma(t)} \cdot \sigma(t)^{\prime}=\frac{1}{\sigma(t)} \cdot\left(1+e^{-t}\right)^{-2} \cdot e^{-t} \\ &=\frac{1}{\left(1+e^{-t}\right)^{-1}} \cdot\left(1+e^{-t}\right)^{-2} \cdot e^{-t}=\left(1+e^{-t}\right)^{-1} \cdot e^{-t} \end{aligned}
(logσ(t))′=σ(t)1⋅σ(t)′=σ(t)1⋅(1+e−t)−2⋅e−t=(1+e−t)−11⋅(1+e−t)−2⋅e−t=(1+e−t)−1⋅e−t
(
log
σ
(
t
)
)
′
=
(
1
+
e
−
t
)
−
1
⋅
e
−
t
=
e
−
t
1
+
e
−
t
=
1
+
e
−
t
−
1
1
+
e
−
t
=
1
−
1
1
+
e
−
t
=
1
−
σ
(
t
)
\begin{aligned} (\log \sigma(t))^{\prime} &=\left(1+e^{-t}\right)^{-1} \cdot e^{-t} \\ &=\frac{e^{-t}}{1+e^{-t}}=\frac{1+e^{-t}-1}{1+e^{-t}}=1-\frac{1}{1+e^{-t}} \\ &=1-\sigma(t) \end{aligned}
(logσ(t))′=(1+e−t)−1⋅e−t=1+e−te−t=1+e−t1+e−t−1=1−1+e−t1=1−σ(t)
d
(
y
(
i
)
log
σ
(
X
b
(
i
)
θ
)
)
d
θ
j
=
y
(
i
)
(
1
−
σ
(
X
b
(
i
)
θ
)
)
⋅
X
j
(
i
)
\frac{d\left(y^{(i)} \log \sigma\left(X_{b}^{(i)} \theta\right)\right)}{d \theta_{j}}=y^{(i)}\left(1-\sigma\left(X_{b}^{(i)} \theta\right)\right) \cdot X_{j}^{(i)}
dθjd(y(i)logσ(Xb(i)θ))=y(i)(1−σ(Xb(i)θ))⋅Xj(i)
(
log
(
1
−
σ
(
t
)
)
)
′
=
1
1
−
,
σ
(
t
)
⋅
(
−
1
)
⋅
σ
(
t
)
′
=
−
1
1
−
σ
(
t
)
⋅
(
1
+
e
−
t
)
−
2
⋅
e
−
t
=
−
1
+
e
−
t
e
−
t
⋅
(
1
+
e
−
t
)
−
2
⋅
e
−
t
=
−
(
1
+
e
−
t
)
−
1
=
−
σ
(
t
)
\begin{aligned} (\log (1-\sigma(t)))^{\prime}=&\frac{1}{1-, \sigma(t)} \cdot(-1) \cdot \sigma(t)^{\prime}=-\frac{1}{1-\sigma(t)} \cdot\left(1+e^{-t}\right)^{-2} \cdot e^{-t} \\ &=-\frac{1+e^{-t}}{e^{-t}} \cdot\left(1+e^{-t}\right)^{-2} \cdot e^{-t} \\ &=-\left(1+e^{-t}\right)^{-1}=-\sigma(t) \end{aligned}
(log(1−σ(t)))′=1−,σ(t)1⋅(−1)⋅σ(t)′=−1−σ(t)1⋅(1+e−t)−2⋅e−t=−e−t1+e−t⋅(1+e−t)−2⋅e−t=−(1+e−t)−1=−σ(t)
d
(
(
1
−
y
(
i
)
)
log
(
1
−
σ
(
X
b
(
i
)
θ
)
)
)
d
θ
j
=
(
1
−
y
(
i
)
)
⋅
(
−
σ
(
X
b
(
i
)
θ
)
)
⋅
X
j
(
i
)
\frac{d\left(\left(1-y^{(i)}\right) \log \left(1-\sigma\left(X_{b}^{(i)} \theta\right)\right)\right)}{d \theta_{j}}=\left(1-y^{(i)}\right) \cdot\left(-\sigma\left(X_{b}^{(i)} \theta\right)\right) \cdot X_{j}^{(i)}
dθjd((1−y(i))log(1−σ(Xb(i)θ)))=(1−y(i))⋅(−σ(Xb(i)θ))⋅Xj(i)
J
(
θ
)
θ
j
=
1
m
∑
i
=
1
m
(
σ
(
X
b
(
i
)
θ
)
−
y
(
i
)
)
X
j
(
i
)
=
1
m
∑
i
=
1
m
(
y
^
(
i
)
−
y
(
i
)
)
X
j
(
i
)
\begin{aligned} \frac{J(\theta)}{\theta_{j}}= &\frac{1}{m} \sum_{i=1}^{m}\left(\sigma\left(X_{b}^{(i)} \theta\right)-y^{(i)}\right) X_{j}^{(i)}\\ &=\frac{1}{m} \sum_{i=1}^{m}\left(\hat{y}^{(i)}-y^{(i)}\right) X_{j}^{(i)} \end{aligned}
θjJ(θ)=m1i=1∑m(σ(Xb(i)θ)−y(i))Xj(i)=m1i=1∑m(y^(i)−y(i))Xj(i)
其中
y
^
(
i
)
\hat{y}^{(i)}
y^(i)就是那个预测值
∇
J
(
θ
)
=
(
∂
J
/
∂
θ
0
∂
J
/
∂
θ
1
∂
J
/
∂
θ
2
…
∂
J
/
∂
θ
n
)
=
1
m
⋅
(
∑
i
=
1
m
(
σ
(
X
b
(
i
)
θ
)
−
y
(
i
)
)
∑
i
=
1
m
(
σ
(
X
b
(
i
)
θ
)
−
y
(
i
)
)
⋅
X
1
(
i
)
∑
i
=
1
m
(
σ
(
X
b
(
i
)
θ
)
−
y
(
i
)
)
⋅
X
2
(
i
)
…
∑
i
=
1
m
(
σ
(
X
b
(
i
)
θ
)
−
y
(
i
)
)
⋅
X
n
(
i
)
)
=
1
m
⋅
(
∑
i
=
1
m
(
y
^
(
i
)
−
y
(
i
)
)
∑
i
=
1
m
(
y
^
(
i
)
−
y
(
i
)
)
⋅
X
1
(
i
)
∑
i
=
1
m
(
y
^
(
i
)
−
y
(
i
)
)
⋅
X
2
(
i
)
…
∑
i
=
1
m
(
y
^
(
i
)
−
y
(
i
)
)
⋅
X
n
(
i
)
)
\nabla J(\theta)=\left(\begin{array}{c} \partial J / \partial \theta_{0} \\ \partial J / \partial \theta_{1} \\ \partial J / \partial \theta_{2} \\ \ldots \\ \partial J / \partial \theta_{n} \end{array}\right)=\frac{1}{m} \cdot\left( \begin{gathered} \sum_{i=1}^{m}\left(\sigma\left(X_{b}^{(i)} \theta\right)-y^{(i)}\right) \\ \sum_{i=1}^{m}\left(\sigma\left(X_{b}^{(i)} \theta\right)-y^{(i)}\right) \cdot X_{1}^{(i)}\\ \sum_{i=1}^{m}\left(\sigma\left(X_{b}^{(i)} \theta\right)-y^{(i)}\right) \cdot X_{2}^{(i)}\\ \ldots \\ \sum_{i=1}^{m}\left(\sigma\left(X_{b}^{(i)} \theta\right)-y^{(i)}\right) \cdot X_{n}^{(i)} \end{gathered}\right) =\frac{1}{m} \cdot\left( \begin{gathered} \sum_{i=1}^{m}\left(\hat{y}^{(i)}-y^{(i)}\right) \\ \sum_{i=1}^{m}\left(\hat{y}^{(i)}-y^{(i)}\right) \cdot X_{1}^{(i)}\\ \sum_{i=1}^{m}\left(\hat{y}^{(i)}-y^{(i)}\right) \cdot X_{2}^{(i)}\\ \ldots \\ \sum_{i=1}^{m}\left(\hat{y}^{(i)} -y^{(i)}\right) \cdot X_{n}^{(i)} \end{gathered}\right)
∇J(θ)=⎝
⎛∂J/∂θ0∂J/∂θ1∂J/∂θ2…∂J/∂θn⎠
⎞=m1⋅⎝
⎛i=1∑m(σ(Xb(i)θ)−y(i))i=1∑m(σ(Xb(i)θ)−y(i))⋅X1(i)i=1∑m(σ(Xb(i)θ)−y(i))⋅X2(i)…i=1∑m(σ(Xb(i)θ)−y(i))⋅Xn(i)⎠
⎞=m1⋅⎝
⎛i=1∑m(y^(i)−y(i))i=1∑m(y^(i)−y(i))⋅X1(i)i=1∑m(y^(i)−y(i))⋅X2(i)…i=1∑m(y^(i)−y(i))⋅Xn(i)⎠
⎞
对上面式子向量化(没看懂,不会推)
∇
J
(
θ
)
=
1
m
⋅
X
b
T
⋅
(
σ
(
X
b
θ
)
−
y
)
\nabla J(\theta)=\frac{1}{m} \cdot X_{b}^{T} \cdot\left(\sigma\left(X_{b} \theta\right)-y\right)
∇J(θ)=m1⋅XbT⋅(σ(Xbθ)−y)
梯度下降法
最优化算法,适用于有唯一极值点的函数。对于没有唯一极值点的函数,可以多次运行,随机化初始点。
学习率
η
\eta
η取值会影响最优解的速度,若取值不合适则得不到最优解。
η
\eta
η为梯度下降法的一个超参数
目标:使得
J
(
θ
)
=
MSE
(
y
,
y
^
)
J(\theta)=\operatorname{MSE}(y, \hat{y})
J(θ)=MSE(y,y^)尽可能小
1
m
∑
i
=
1
m
(
y
(
i
)
−
y
^
(
i
)
)
2
=
1
m
∑
i
=
1
m
(
y
(
i
)
−
θ
0
−
θ
1
X
1
(
i
)
−
θ
2
X
2
k
(
i
)
−
…
−
θ
n
X
n
(
i
)
)
2
\frac{1}{m} \sum_{i=1}^{m}\left(y^{(i)}-\hat{y}^{(i)}\right)^{2}=\frac{1}{m} \sum_{i=1}^{m}\left(y^{(i)}-\theta_{0}-\theta_{1} X_{1}^{(i)}-\theta_{2} X_{2 k}^{(i)}-\ldots-\theta_{n} X_{n}^{(i)}\right)^{2}
m1i=1∑m(y(i)−y^(i))2=m1i=1∑m(y(i)−θ0−θ1X1(i)−θ2X2k(i)−…−θnXn(i))2
∇
J
(
θ
)
=
(
∂
J
/
∂
θ
0
∂
J
/
∂
θ
1
∂
J
/
∂
θ
2
…
∂
J
/
∂
θ
n
)
=
(
∑
i
=
1
m
2
(
y
(
i
)
−
X
b
(
i
)
θ
)
⋅
(
−
1
)
∑
i
=
1
m
2
(
y
(
i
)
−
X
b
(
i
)
θ
)
⋅
(
−
X
1
(
i
)
)
∑
i
=
1
m
2
(
y
(
i
)
−
X
b
(
i
)
θ
)
⋅
(
−
X
2
(
i
)
)
…
∑
i
=
1
m
2
(
y
(
i
)
−
X
b
(
i
)
θ
)
⋅
(
−
X
n
(
i
)
)
)
=
2
m
⋅
(
∑
i
=
1
m
(
X
b
(
i
)
θ
−
y
(
i
)
)
∑
i
=
1
m
(
X
b
(
i
)
θ
−
y
(
i
)
)
⋅
X
1
(
i
)
∑
i
=
1
m
(
X
b
(
i
)
θ
−
y
(
i
)
)
⋅
X
2
(
i
)
…
∑
i
=
1
m
(
X
b
(
i
)
θ
−
y
(
i
)
)
⋅
X
n
(
i
)
)
\nabla J(\theta)=\left(\begin{array}{c} \partial J / \partial \theta_{0} \\ \partial J / \partial \theta_{1} \\ \partial J / \partial \theta_{2} \\ \ldots \\ \partial J / \partial \theta_{n} \end{array}\right)=\left(\begin{array}{c} \sum_{i=1}^{m} 2\left(y^{(i)}-X_{b}^{(i)} \theta\right) \cdot(-1) \\ \sum_{i=1}^{m} 2\left(y^{(i)}-X_{b}^{(i)} \theta\right) \cdot\left(-X_{1}^{(i)}\right) \\ \sum_{i=1}^{m} 2\left(y^{(i)}-X_{b}^{(i)} \theta\right) \cdot\left(-X_{2}^{(i)}\right) \\ \ldots \\ \sum_{i=1}^{m} 2\left(y^{(i)}-X_{b}^{(i)} \theta\right) \cdot\left(-X_{n}^{(i)}\right) \end{array}\right)=\frac{2 }{m}\cdot\left(\begin{array}{c} \sum_{i=1}^{m}\left(X_{b}^{(i)} \theta-y^{(i)}\right) \\ \sum_{i=1}^{m}\left(X_{b}^{(i)} \theta-y^{(i)}\right) \cdot X_{1}^{(i)} \\ \sum_{i=1}^{m}\left(X_{b}^{(i)} \theta-y^{(i)}\right) \cdot X_{2}^{(i)} \\ \ldots \\ \sum_{i=1}^{m}\left(X_{b}^{(i)} \theta-y^{(i)}\right) \cdot X_{n}^{(i)} \end{array}\right)
∇J(θ)=⎝
⎛∂J/∂θ0∂J/∂θ1∂J/∂θ2…∂J/∂θn⎠
⎞=⎝
⎛∑i=1m2(y(i)−Xb(i)θ)⋅(−1)∑i=1m2(y(i)−Xb(i)θ)⋅(−X1(i))∑i=1m2(y(i)−Xb(i)θ)⋅(−X2(i))…∑i=1m2(y(i)−Xb(i)θ)⋅(−Xn(i))⎠
⎞=m2⋅⎝
⎛∑i=1m(Xb(i)θ−y(i))∑i=1m(Xb(i)θ−y(i))⋅X1(i)∑i=1m(Xb(i)θ−y(i))⋅X2(i)…∑i=1m(Xb(i)θ−y(i))⋅Xn(i)⎠
⎞
θ
i
=
θ
i
−
η
∂
J
(
θ
0
,
θ
1
,
⋯
,
θ
n
)
∂
θ
i
\theta_{i}=\theta_{i}-\eta \frac{\partial J\left(\theta_{0}, \theta_{1}, \cdots, \theta_{n}\right)}{\partial \theta_{i}}
θi=θi−η∂θi∂J(θ0,θ1,⋯,θn)
梯度下降法的小算例(含PYTHON 程序)
梯度下降法1
梯度下降法2