逻辑回归概述
接受含多个特征值输入样本,输出预测分类
优点: 计算代价不高,易于理解和实现
缺点: 容易欠拟合,分类精度可能不高
需要用到 Sigmoid 函数来对输出做二元分类,公式如下:
σ
(
x
)
=
1
1
+
e
−
x
\sigma(x) = \frac{1}{1 + e^{-x}}
σ(x)=1+e−x1

假设有 m m m 个样本,每个样本有 n n n 个特征,这些样本分别属于两类,记为 0 和 1
特征0 | 特征1 | 特征2 | … | 特征 n n n | |
---|---|---|---|---|---|
样本1 | x 0 ( 1 ) x_0^{(1)} x0(1) | x 1 ( 1 ) x_1^{(1)} x1(1) | x 2 ( 1 ) x_2^{(1)} x2(1) | … | x n ( 1 ) x_n^{(1)} xn(1) |
样本2 | x 0 ( 2 ) x_0^{(2)} x0(2) | x 1 ( 2 ) x_1^{(2)} x1(2) | x 2 ( 2 ) x_2^{(2)} x2(2) | … | x n ( 2 ) x_n^{(2)} xn(2) |
样本3 | x 0 ( 3 ) x_0^{(3)} x0(3) | x 1 ( 3 ) x_1^{(3)} x1(3) | x 2 ( 3 ) x_2^{(3)} x2(3) | … | x n ( 3 ) x_n^{(3)} xn(3) |
… | … | … | … | … | … |
样本 m m m | x 0 ( m ) x_0^{(m)} x0(m) | x 1 ( m ) x_1^{(m)} x1(m) | x 2 ( m ) x_2^{(m)} x2(m) | … | x n ( m ) x_n^{(m)} xn(m) |
对于某个样本,它的预测函数为:
h
θ
(
x
)
=
σ
(
θ
0
x
0
+
θ
1
x
1
+
θ
2
x
2
+
.
.
.
+
θ
n
x
n
)
=
σ
(
θ
T
X
)
h_\theta(x)=\sigma(\theta_0x_0+\theta_1x_1+\theta_2x_2+...+\theta_nx_n)=\sigma(\theta^TX)
hθ(x)=σ(θ0x0+θ1x1+θ2x2+...+θnxn)=σ(θTX)
定义成本函数如下:
C
o
s
t
(
h
θ
(
x
)
,
y
)
=
{
−
log
(
h
θ
(
x
)
)
,
y
=
1
−
l
o
g
(
1
−
h
θ
(
x
)
)
,
y
=
0
Cost(h_\theta(x),y)= \begin{cases} -\log(h_\theta(x)), y=1 \\ -log(1-h_\theta(x)), y=0 \\ \end{cases}
Cost(hθ(x),y)={−log(hθ(x)),y=1−log(1−hθ(x)),y=0
真实类别为
y
i
∈
[
0
,
1
]
y^i∈[0, 1]
yi∈[0,1],合并公式:
J
(
θ
)
=
−
1
m
[
∑
i
=
1
m
y
(
i
)
l
o
g
(
h
θ
(
x
(
i
)
)
)
+
(
1
−
y
(
i
)
)
l
o
g
(
1
−
h
θ
(
x
(
i
)
)
)
]
J(\theta) =-\frac{1}{m} \left [ \sum_{i=1}^{m} y^{(i)}log(h_\theta(x^{(i)})) +(1-y^{(i)})log(1-h_\theta (x^{(i)})) \right ]
J(θ)=−m1[i=1∑my(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))]
求 Sigmoid 的导函数:
σ
(
x
)
′
=
(
1
1
+
e
−
x
)
′
=
−
(
1
+
e
−
x
)
′
(
1
+
e
−
x
)
2
=
−
(
−
x
)
′
(
e
−
x
)
(
1
+
e
−
x
)
2
=
e
−
x
(
1
+
e
−
x
)
2
=
(
1
1
+
e
−
x
)
(
e
−
x
1
+
e
−
x
)
=
σ
(
x
)
(
e
−
x
1
+
e
−
x
)
=
σ
(
x
)
(
1
+
e
−
x
−
1
1
+
e
−
x
)
=
σ
(
x
)
(
1
−
1
1
+
e
−
x
)
=
σ
(
x
)
(
1
−
σ
(
x
)
)
\begin{aligned} {\sigma(x)}' & = {\left (\frac{1}{1 + e^{-x}}\right)}' = \frac{-{(1 + e^{-x})}'}{(1 + e^{-x})^2} = \frac{-{(-x)}'(e^{-x})}{(1 + e^{-x})^2} \\ &= \frac{e^{-x}}{(1 + e^{-x})^2} = \left( \frac{1}{1 + e^{-x}}\right) \left( \frac{e^{-x}}{1 + e^{-x}}\right) \\ &=\sigma(x) \left(\frac{e^{-x}}{1 + e^{-x}}\right)=\sigma(x) \left(\frac{1+e^{-x}-1}{1 + e^{-x}}\right) \\ &=\sigma(x) \left(1-\frac{1}{1 + e^{-x}} \right) \\ &=\sigma(x)(1-\sigma(x)) \end{aligned}
σ(x)′=(1+e−x1)′=(1+e−x)2−(1+e−x)′=(1+e−x)2−(−x)′(e−x)=(1+e−x)2e−x=(1+e−x1)(1+e−xe−x)=σ(x)(1+e−xe−x)=σ(x)(1+e−x1+e−x−1)=σ(x)(1−1+e−x1)=σ(x)(1−σ(x))
对成本函数求
θ
\theta
θ 的偏导数,也就是梯度:
知道了参数
θ
\theta
θ 在成本函数中的梯度,就可以对
θ
\theta
θ 参数进行迭代更新了
特征0 | 特征1 | 特征2 | … | 特征 n n n | 预测值 | 真实值 | |
---|---|---|---|---|---|---|---|
样本1 | x 0 ( 1 ) x_0^{(1)} x0(1) | x 1 ( 1 ) x_1^{(1)} x1(1) | x 2 ( 1 ) x_2^{(1)} x2(1) | … | x n ( 1 ) x_n^{(1)} xn(1) | h θ ( x ( 1 ) ) h_\theta(x^{(1)}) hθ(x(1)) | y ( 1 ) y^{(1)} y(1) |
样本2 | x 0 ( 2 ) x_0^{(2)} x0(2) | x 1 ( 2 ) x_1^{(2)} x1(2) | x 2 ( 2 ) x_2^{(2)} x2(2) | … | x n ( 2 ) x_n^{(2)} xn(2) | h θ ( x ( 2 ) ) h_\theta(x^{(2)}) hθ(x(2)) | y ( 2 ) y^{(2)} y(2) |
样本3 | x 0 ( 3 ) x_0^{(3)} x0(3) | x 1 ( 3 ) x_1^{(3)} x1(3) | x 2 ( 3 ) x_2^{(3)} x2(3) | … | x n ( 3 ) x_n^{(3)} xn(3) | h θ ( x ( 3 ) ) h_\theta(x^{(3)}) hθ(x(3)) | y ( 3 ) y^{(3)} y(3) |
… | … | … | … | … | … | … | … |
样本 m m m | x 0 ( m ) x_0^{(m)} x0(m) | x 1 ( m ) x_1^{(m)} x1(m) | x 2 ( m ) x_2^{(m)} x2(m) | … | x n ( m ) x_n^{(m)} xn(m) | h θ ( x ( m ) ) h_\theta(x^{(m)}) hθ(x(m)) | y ( m ) y^{(m)} y(m) |
假如现在要求某个模型参数 θ 1 \theta_1 θ1 的梯度,那么展开式为:
∂ J ( θ ) ∂ θ 1 = 1 m ( x 1 ( 1 ) ( h θ ( x ( 1 ) ) − y ( 1 ) ) + x 1 ( 2 ) ( h θ ( x ( 2 ) ) − y ( 2 ) ) + x 1 ( 3 ) ( h θ ( x ( 3 ) ) − y ( 3 ) ) + . . . + x 1 ( m ) ( h θ ( x ( m ) ) − y ( m ) ) ) \frac{ \partial J(\theta)}{ \partial \theta_1}=\frac {1}{m} \left ( x_1^{(1)}(h_\theta(x^{(1)})-y^{(1)})+x_1^{(2)}(h_\theta(x^{(2)})-y^{(2)})+x_1^{(3)}(h_\theta(x^{(3)})-y^{(3)})+ ... +x_1^{(m)}(h_\theta(x^{(m)})-y^{(m)})\right) ∂θ1∂J(θ)=m1(x1(1)(hθ(x(1))−y(1))+x1(2)(hθ(x(2))−y(2))+x1(3)(hθ(x(3))−y(3))+...+x1(m)(hθ(x(m))−y(m)))
下面用Python代码实现这个梯度下降算法:
def grad_descent(data_arr, class_labels):
data_mat = np.mat(data_arr) # 将数组转成矩阵
labels_mat = np.mat(class_labels).transpose() # transpose:矩阵转置
m, n = np.shape(data_mat) # m个样本,每个样本有n个特征
alpha = 0.1 # 学习率
max_cycles = 1000 # 迭代次数
weights = np.ones((n, 1)) # 初始化权重为1
for k in range(max_cycles): # 每次更新回归系数时遍历整个数据集,数据量多时计算复杂度高
h = sigmoid(data_mat * weights) # 预测值 h(xⁱ)
error = (h - labels_mat) # 预测值与真实值之间的误差 h(xⁱ)-yⁱ
grad = data_mat.transpose() * error / m # 梯度 1/m * ∑(h(xⁱ)-yⁱ) * xⁱ
weights = weights - alpha * grad # 按照梯度相反的方向更新参数
return np.array(weights) # 这里weights为一个矩阵
h = sigmoid(data_mat * weights)
是一次性计算出所有
h
θ
(
x
(
i
)
)
h_\theta(x^{(i)})
hθ(x(i)),矩阵形状为(m, 1)
error = (h - labels_mat)
是一次性计算出所有的
h
θ
(
x
(
i
)
)
−
y
(
i
)
h_\theta(x^{(i)})-y^{(i)}
hθ(x(i))−y(i),结果仍是个(m, 1)
的矩阵
data_mat.transpose()
是对训练数据矩阵做了一个矩阵转置,形状由(m, n)
变成了(n, m)
grad = data_mat.transpose() * error / m
是一次性计算出所有模型参数
θ
\theta
θ 的梯度,即
1
m
∑
i
=
1
m
[
h
θ
(
x
(
i
)
)
−
y
(
i
)
]
x
(
i
)
\frac{1}{m}\sum_{i=1}^{m}\left[h_\theta(x^{(i)})- y^{(i)}\right]x^{(i)}
m1∑i=1m[hθ(x(i))−y(i)]x(i),形状为(n, 1)
weights = weights - alpha * grad
是根据梯度和学习率更新模型参数
θ
\theta
θ
完整代码如下,利用了 scikit-learn 来生成数据集,然后用 logistics 回归对数据点进行分类:
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
import numpy as np
def load_dataset(n_samples):
X, y = make_blobs(n_samples=n_samples, n_features=2, centers=2, random_state=2, cluster_std=2)
bias = np.ones((n_samples, 1))
X = np.concatenate((bias, X), axis=1)
return X, y
def sigmoid(inx):
return 1.0 / (1 + np.exp(-inx))
def grad_descent(data_arr, class_labels):
data_mat = np.mat(data_arr) # 将数组转成矩阵
labels_mat = np.mat(class_labels).transpose() # transpose:矩阵转置
m, n = np.shape(data_mat) # m个样本,每个样本有n个特征
alpha = 0.1 # 学习率
max_cycles = 1000 # 迭代次数
weights = np.ones((n, 1)) # 初始化权重为1
for k in range(max_cycles): # 每次更新回归系数时遍历整个数据集,数据量多时计算复杂度高
h = sigmoid(data_mat * weights) # 预测值 h(xⁱ)
error = (h - labels_mat) # 预测值与真实值之间的误差 h(xⁱ)-yⁱ
grad = data_mat.transpose() * error / m # 梯度 1/m * ∑(h(xⁱ)-yⁱ) * xⁱ
weights = weights - alpha * grad # 按照梯度相反的方向更新参数
return np.array(weights) # 这里weights为一个矩阵
def train_model():
X, y = load_dataset(100)
minx = np.min(X[:, 1])
maxx = np.max(X[:, 1])
weights = grad_descent(X, y)
fx = np.arange(minx, maxx, 0.1)
fy = (-weights[0] - weights[1] * fx) / weights[2] # y = w₀x₀ + w₁x₁ + w₂x₂,当 y=0 时,为分隔点
plt.plot(fx, fy, 'b-') # 画出决策边界
plt.plot(X[y == 0][:, 1], X[y == 0][:, 2], 'ro')
plt.plot(X[y == 1][:, 1], X[y == 1][:, 2], 'gs')
plt.show()
if __name__ == '__main__':
train_model()
运行程序后,画出分类的决策边界如下: