笔记
描述
什么叫逻辑回归?
逻辑回归有什么用?
有那些相关名词?
决策边界、Sigmoid函数、代价函数
需要注意的是,虽然名叫回归,但这是一个分类算法。
(叫这个名字是由于计算方式类似于线性回归)
决策边界:逻辑回归的主要目的是得到决策边界。这是通过数据集的规律,先构建一个假设函数,训练后得到其系数。最后通过计算测试数据在这个边界位置,来决定数据的正确与否。
Sigmoid函数:由于我们的数据集的结果是一个0或1的值,我们的假设函数计算的结果要做正确与否的判断。如果简单的再采用线性函数,会由于训练数据相差大时,线性的假设函数无法正确在分界位置附加分类。
下图就是线性函数用于分类,会出现的问题。
sigmoid函数选用:
s
=
1
1
+
e
z
s = \frac{1}{1 + e^z}
s=1+ez1
故假设函数为:
h
θ
(
x
)
=
1
1
+
e
−
θ
j
x
h_\theta (x) = \frac{1}{1 + e^{- \theta_j x}}
hθ(x)=1+e−θjx1
代价函数:如果继续使用线性回归的方式计算代价,可能会出现代价函数是非凸函数,这样就会导致该代价函数的图像看上去有很多小凹坑,使我们在梯度下降时,进入到某一个局部最优解,得不到想要的效果。
所以为了避免这种情况,重新选用代价函数。由于数据集的结果只分0|1,我们的代价也可以通过训练集的结果0|1直接计算代价。观察sigmoid函数,其取值范围在(0, 1),故可以选用log函数在(0, 1)的函数部分即可。
代价函数可选用:
c
o
s
t
(
h
θ
(
x
)
,
y
)
=
{
−
l
o
g
(
h
θ
(
x
)
)
,
y
=
1
−
l
o
g
(
1
−
h
θ
(
x
)
)
,
y
=
0
cost(h_\theta (x), y) = \begin{cases} \ - log(h_\theta (x)), \ \ \ y = 1 \\ -log(1 - h_\theta (x)), \ \ \ y = 0 \end{cases}
cost(hθ(x),y)={ −log(hθ(x)), y=1−log(1−hθ(x)), y=0
合并2项,便于计算:
c
o
s
t
(
h
θ
(
x
)
,
y
)
=
−
y
l
o
g
(
h
θ
(
x
)
)
−
(
1
−
y
)
l
o
g
(
1
−
h
θ
(
x
)
)
cost (h_\theta (x), y) = - y log(h_\theta (x)) - (1 - y)log(1 - h_\theta (x))
cost(hθ(x),y)=−ylog(hθ(x))−(1−y)log(1−hθ(x))
多元分类
推广到多元的分类,只需要简单的将需要关注的数据集视为正集合,其它的都视为负集合,再使用一元的逻辑回归分类,重复多次前面的操作即可。
为了作出预测,在输入一个测试数据后,需要在对多次进行过一元分类得到的分类器中都预测一遍,最后选择可信度最大的类别。
c
l
a
s
s
i
:
m
a
x
(
h
θ
(
i
)
(
x
)
)
class \ i : max ( h_\theta^{(i)} (x))
class i:max(hθ(i)(x))
关键点
决策边界(decision boundaris)
假设函数:
h
θ
(
x
)
=
1
1
+
e
−
θ
j
x
h_\theta (x) = \frac{1}{1 + e^{- \theta_j x}}
hθ(x)=1+e−θjx1
代价函数:
c
o
s
t
(
h
θ
(
x
)
,
y
)
=
−
y
l
o
g
(
h
θ
(
x
)
)
−
(
1
−
y
)
l
o
g
(
1
−
h
θ
(
x
)
)
cost (h_\theta (x), y) = - y log(h_\theta (x)) - (1 - y)log(1 - h_\theta (x))
cost(hθ(x),y)=−ylog(hθ(x))−(1−y)log(1−hθ(x))
代价函数求偏导(正好与线性回归相同):
∂
∂
θ
j
J
(
θ
)
=
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
)
2
x
j
(
i
)
\frac{\partial}{\partial \theta_j} J (\theta) = \frac{1}{m} \sum_{i=1}^m (h_\theta (x^{(i)} - y^{(i)})) ^2 x_j^{(i)}
∂θj∂J(θ)=m1i=1∑m(hθ(x(i)−y(i)))2xj(i)
梯度下降算法(Gradient descent algorithm)
θ
j
:
=
θ
j
−
α
∂
∂
θ
j
J
(
θ
)
\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J (\theta)
θj:=θj−α∂θj∂J(θ)
(
j
=
0
,
1
,
2
,
.
.
.
,
n
)
(j = 0, 1, 2, ... , n)
(j=0,1,2,...,n)
注:j = 0时,
x
0
x_0
x0 = 1,即函数的偏移项
代码
python实现
大部分内容与线性回归相同,只是代价函数不同,和一些小地方的修改,并将Sigmoid()函数单独写出
import numpy as np
class LogisticRegression:
'''
逻辑回归
# !不同于线性回归的地方,用'# !'标注了
参数:
X - 训练集(需要将训练集的特征缩放到合适范围,并将参数以列向量重排)
Y - 训练集的结果(取值为0|1)
W - 假设函数的多参数组成矩阵(w1、w2、w3 ...)(W_j对应theta_j,j=1,2,3...)
b - 假设函数的参数(x0 = 1的值)(b对应theta_0)
learning_rate - 学习速率
num_iter - 迭代次数
costs - 代价函数值的集合(非必须操作)
使用:
lg = LogisticRegression()
lg.init(X_train, Y_train)
lg.train(0.001, 2000)
predicted = lg.predict(X_test)
'''
X = 0
Y = 0
W = 0
b = 0
learning_rate = 0
num_iter = 0
costs = []
# 初始化变量
def init(self, X, Y):
'''
加载训练集,并设置一些初始值
参数:
X - 训练集
Y - 训练集的结果
'''
self.X = X
self.Y = Y
self.W = np.zeros(shape = (X.shape[0], 1))
self.b = 0
self.costs = []
# 对代价函数J求导
# h(x) = W * x + b
def partial_derivative(self):
'''
对梯度下降公式后半部分的求导(手动计算)数值
返回:
dW,db - 假设函数的参数的偏导值
'''
m = self.X.shape[1]
# 假设函数(正向传播)
# !不同于线性回归,这里用于分类,假设函数不同(其中,训练集X的值需要预处理到合适范围(如,-0.5~0.5或0~1之间等等),避免不能正确的进行学习
# 特征缩放:结合理论+实际来看,最好是能将特征缩放到sigmoid中间变化幅度大的地方,避免与学习速率不匹配,导致很难收敛到最优解
# (如一个图像的颜色在0~255直接,不进行特征缩放的话,基本上大部分值都会在sigmoid中使输出非常接近于1,尽管没有出现梯度消失的情况,但学习速率极慢)
H = 1 / (1 + np.exp(-(np.dot(self.W.T, self.X) + self.b)))
# 计算代价,记录代价(非必须操作,只是便于观察梯度下降的效果)
# !直接计算sum(h-y),尽管在线性回归中好用,但在逻辑回归中能用,由于假设函数h是非线性函数,故可能会出现非凸的代价函数,导致只能找到局部最优,而不是全局最优
cost = (-1 / m) * np.sum(self.Y * np.log(H) + (1 - self.Y) * np.log(1 - H))
self.costs.append(cost)
# 求偏导(反向传播)
# !与线性回归不同的代价计算方法(避免成为非凸函数),故计算后的导数式子也不同
# ?不理解为何Andrew Ng(第50课)对J代价函数求偏导为何是和线性回归的式子一样
# ?!这篇github给出了证明,幸运的是的确和线性回归中J的求导结果一致: https://github.com/halfrost/Halfrost-Field/blob/master/contents/Machine_Learning/Logistic_Regression.ipynb
dW = 1 / m * np.dot(self.X, (H - self.Y).T)
db = 1 / m * np.sum(H - self.Y)
return dW, db
# 梯度下降
# temp0 = W - alpha * partial_derivative(J0(W, b))
# temp1 = b - alpha * partial_derivative(J1(W, b))
# ...
def gradient_descent(self):
'''
进行梯度下降的运算,公式:W_j = W_j - alpha * partial_derivative(J_j(W_j, b)), j = 1,2,3...
'''
for i in range(self.num_iter):
dW, db = self.partial_derivative()
# 梯度下降,优化参数W、b
self.W = self.W - self.learning_rate * dW
self.b = self.b - self.learning_rate * db
# 开始训练
def train(self, learning_rate = 0, num_iter = 0):
'''
开始训练
参数:
learning_rate - 学习速率
num_iter - 迭代次数
'''
self.learning_rate = learning_rate
self.num_iter = num_iter
self.gradient_descent()
# 预测
def predict(self, X):
'''
预测X数据集
参数:
X - 测试数据集
返回:
predicted - 对于测试数据集X的预测结果
'''
# 带入参数w、b预测测试集
# !不同于线性回归,这里将测试集数据代入假设函数计算,再手动二值化
predicted = 1 / (1 + np.exp(-(np.dot(self.W.T, X) + self.b)))
# 将结果二值化
predicted = np.round(predicted)
predicted = predicted.astype(np.int)
return predicted
测试
测试集是数字0~1的图片
# 数字分类
# 本例子,仅对数字0识别
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import os
from logistic_regression import LogisticRegression
if __name__ == '__main__':
# 加载数据
train_dataset_path = '../datasets/digital_datasets/train_images/'
test_dataset_path = '../datasets/digital_datasets/test_images/'
def load_dataset(dataset_path):
images = []
targets = []
path_root = os.listdir(dataset_path)
for path_root_dir in path_root: # 目录下
path_root_x = dataset_path + path_root_dir + '/'
path_root_root = os.listdir(path_root_x) # 子目录下
train_sets_path = [path_root_x + filename for filename in path_root_root]
for path in train_sets_path:
image = Image.open(path) # 加载图片
image_array = np.array(image) # 转为矩阵
image_array_ravel = image_array.ravel() # 改变形状
image_array_ravel_scale = image_array_ravel / 255 # 缩放
images.append(image_array_ravel_scale)
l = path.split('/')
targets.append(1.0 if l[-2] == '0' else 0.0) # 不属于数字0的目录下的图片,结果都设为0
X = np.stack(images) # 取出训练数据
Y = np.array(targets, ndmin=2).T # 取出数据集结果
return X.T, Y.T
# 加载数据
train_set_x, train_set_y = load_dataset(train_dataset_path)
test_set_x, test_set_y = load_dataset(test_dataset_path)
# 逻辑回归
N = 2000 # 设置迭代次数
lr = LogisticRegression()
lr.init(train_set_x, train_set_y)
lr.train(0.003, N)
predicted = lr.predict(test_set_x)
# 显示结果对比、准确率
print('预测结果:', end = '')
print(predicted)
print('测试集结果:', end = '')
print(test_set_y)
print(f'准确率:{np.mean(np.equal(test_set_y, predicted)) * 100}%')
# 显示代价函数迭代
plt.plot([x for x in range(N)], lr.costs)
plt.show()
训练集
训练集和测试集的数据,从别人那直接copy的。
我放在了github上:数字图片
附
逻辑回归的求导过程1
先对逻辑函数(Sigmoid函数)求导:
σ
(
x
)
′
=
(
1
1
+
e
−
x
)
′
=
−
(
1
+
e
−
x
)
′
(
1
+
e
−
x
)
2
=
−
1
′
−
(
e
−
x
)
′
(
1
+
e
−
x
)
2
=
0
−
(
−
x
)
′
(
e
−
x
)
(
1
+
e
−
x
)
2
=
−
(
−
1
)
(
e
−
x
)
(
1
+
e
−
x
)
2
=
e
−
x
(
1
+
e
−
x
)
2
=
(
1
1
+
e
−
x
)
(
e
−
x
1
+
e
−
x
)
=
σ
(
x
)
(
+
1
−
1
+
e
−
x
1
+
e
−
x
)
=
σ
(
x
)
(
1
+
e
−
x
1
+
e
−
x
−
1
1
+
e
−
x
)
=
σ
(
x
)
(
1
−
σ
(
x
)
)
\sigma(x)'=\left(\frac{1}{1+e^{-x}}\right)'=\frac{-(1+e^{-x})'}{(1+e^{-x})^2}=\frac{-1'-(e^{-x})'}{(1+e^{-x})^2}=\frac{0-(-x)'(e^{-x})}{(1+e^{-x})^2}=\frac{-(-1)(e^{-x})}{(1+e^{-x})^2}=\frac{e^{-x}}{(1+e^{-x})^2} \\ =\left(\frac{1}{1+e^{-x}}\right)\left(\frac{e^{-x}}{1+e^{-x}}\right)=\sigma(x)\left(\frac{+1-1 + e^{-x}}{1+e^{-x}}\right)=\sigma(x)\left(\frac{1 + e^{-x}}{1+e^{-x}} - \frac{1}{1+e^{-x}}\right)\\ =\sigma(x)(1 - \sigma(x))
σ(x)′=(1+e−x1)′=(1+e−x)2−(1+e−x)′=(1+e−x)2−1′−(e−x)′=(1+e−x)20−(−x)′(e−x)=(1+e−x)2−(−1)(e−x)=(1+e−x)2e−x=(1+e−x1)(1+e−xe−x)=σ(x)(1+e−x+1−1+e−x)=σ(x)(1+e−x1+e−x−1+e−x1)=σ(x)(1−σ(x))
在通过上面的结果,借助符合函数求导:
∂
∂
θ
j
J
(
θ
)
=
∂
∂
θ
j
−
1
m
∑
i
=
1
m
[
y
(
i
)
l
o
g
(
h
θ
(
x
(
i
)
)
)
+
(
1
−
y
(
i
)
)
l
o
g
(
1
−
h
θ
(
x
(
i
)
)
)
]
=
−
1
m
∑
i
=
1
m
[
y
(
i
)
∂
∂
θ
j
l
o
g
(
h
θ
(
x
(
i
)
)
)
+
(
1
−
y
(
i
)
)
∂
∂
θ
j
l
o
g
(
1
−
h
θ
(
x
(
i
)
)
)
]
=
−
1
m
∑
i
=
1
m
[
y
(
i
)
∂
∂
θ
j
h
θ
(
x
(
i
)
)
h
θ
(
x
(
i
)
)
+
(
1
−
y
(
i
)
)
∂
∂
θ
j
(
1
−
h
θ
(
x
(
i
)
)
)
1
−
h
θ
(
x
(
i
)
)
]
=
−
1
m
∑
i
=
1
m
[
y
(
i
)
∂
∂
θ
j
σ
(
θ
T
x
(
i
)
)
h
θ
(
x
(
i
)
)
+
(
1
−
y
(
i
)
)
∂
∂
θ
j
(
1
−
σ
(
θ
T
x
(
i
)
)
)
1
−
h
θ
(
x
(
i
)
)
]
=
−
1
m
∑
i
=
1
m
[
y
(
i
)
σ
(
θ
T
x
(
i
)
)
(
1
−
σ
(
θ
T
x
(
i
)
)
)
∂
∂
θ
j
θ
T
x
(
i
)
h
θ
(
x
(
i
)
)
+
−
(
1
−
y
(
i
)
)
σ
(
θ
T
x
(
i
)
)
(
1
−
σ
(
θ
T
x
(
i
)
)
)
∂
∂
θ
j
θ
T
x
(
i
)
1
−
h
θ
(
x
(
i
)
)
]
=
−
1
m
∑
i
=
1
m
[
y
(
i
)
h
θ
(
x
(
i
)
)
(
1
−
h
θ
(
x
(
i
)
)
)
∂
∂
θ
j
θ
T
x
(
i
)
h
θ
(
x
(
i
)
)
−
(
1
−
y
(
i
)
)
h
θ
(
x
(
i
)
)
(
1
−
h
θ
(
x
(
i
)
)
)
∂
∂
θ
j
θ
T
x
(
i
)
1
−
h
θ
(
x
(
i
)
)
]
=
−
1
m
∑
i
=
1
m
[
y
(
i
)
(
1
−
h
θ
(
x
(
i
)
)
)
x
j
(
i
)
−
(
1
−
y
(
i
)
)
h
θ
(
x
(
i
)
)
x
j
(
i
)
]
=
−
1
m
∑
i
=
1
m
[
y
(
i
)
(
1
−
h
θ
(
x
(
i
)
)
)
−
(
1
−
y
(
i
)
)
h
θ
(
x
(
i
)
)
]
x
j
(
i
)
=
−
1
m
∑
i
=
1
m
[
y
(
i
)
−
y
(
i
)
h
θ
(
x
(
i
)
)
−
h
θ
(
x
(
i
)
)
+
y
(
i
)
h
θ
(
x
(
i
)
)
]
x
j
(
i
)
=
−
1
m
∑
i
=
1
m
[
y
(
i
)
−
h
θ
(
x
(
i
)
)
]
x
j
(
i
)
=
1
m
∑
i
=
1
m
[
h
θ
(
x
(
i
)
)
−
y
(
i
)
]
x
j
(
i
)
\frac{\partial}{\partial \theta_j} J(\theta) = \frac{\partial}{\partial \theta_j} \frac{-1}{m}\sum_{i=1}^m \left [ y^{(i)} log (h_\theta(x^{(i)})) + (1-y^{(i)}) log (1 - h_\theta(x^{(i)})) \right ] \newline= - \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} \frac{\partial}{\partial \theta_j} log (h_\theta(x^{(i)})) + (1-y^{(i)}) \frac{\partial}{\partial \theta_j} log (1 - h_\theta(x^{(i)}))\right ] \newline= - \frac{1}{m}\sum_{i=1}^m \left [ \frac{y^{(i)} \frac{\partial}{\partial \theta_j} h_\theta(x^{(i)})}{h_\theta(x^{(i)})} + \frac{(1-y^{(i)})\frac{\partial}{\partial \theta_j} (1 - h_\theta(x^{(i)}))}{1 - h_\theta(x^{(i)})}\right ] \newline= - \frac{1}{m}\sum_{i=1}^m \left [ \frac{y^{(i)} \frac{\partial}{\partial \theta_j} \sigma(\theta^T x^{(i)})}{h_\theta(x^{(i)})} + \frac{(1-y^{(i)})\frac{\partial}{\partial \theta_j} (1 - \sigma(\theta^T x^{(i)}))}{1 - h_\theta(x^{(i)})}\right ] \newline= - \frac{1}{m}\sum_{i=1}^m \left [ \frac{y^{(i)} \sigma(\theta^T x^{(i)}) (1 - \sigma(\theta^T x^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T x^{(i)}}{h_\theta(x^{(i)})} + \frac{- (1-y^{(i)}) \sigma(\theta^T x^{(i)}) (1 - \sigma(\theta^T x^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T x^{(i)}}{1 - h_\theta(x^{(i)})}\right ] \newline= - \frac{1}{m}\sum_{i=1}^m \left [ \frac{y^{(i)} h_\theta(x^{(i)}) (1 - h_\theta(x^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T x^{(i)}}{h_\theta(x^{(i)})} - \frac{(1-y^{(i)}) h_\theta(x^{(i)}) (1 - h_\theta(x^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T x^{(i)}}{1 - h_\theta(x^{(i)})}\right ] \newline= - \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} (1 - h_\theta(x^{(i)})) x^{(i)}_j - (1-y^{(i)}) h_\theta(x^{(i)}) x^{(i)}_j\right ] \newline= - \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} (1 - h_\theta(x^{(i)})) - (1-y^{(i)}) h_\theta(x^{(i)}) \right ] x^{(i)}_j \newline= - \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} - y^{(i)} h_\theta(x^{(i)}) - h_\theta(x^{(i)}) + y^{(i)} h_\theta(x^{(i)}) \right ] x^{(i)}_j \newline= - \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} - h_\theta(x^{(i)}) \right ] x^{(i)}_j \newline= \frac{1}{m}\sum_{i=1}^m \left [ h_\theta(x^{(i)}) - y^{(i)} \right ] x^{(i)}_j
∂θj∂J(θ)=∂θj∂m−1∑i=1m[y(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))]=−m1∑i=1m[y(i)∂θj∂log(hθ(x(i)))+(1−y(i))∂θj∂log(1−hθ(x(i)))]=−m1∑i=1m[hθ(x(i))y(i)∂θj∂hθ(x(i))+1−hθ(x(i))(1−y(i))∂θj∂(1−hθ(x(i)))]=−m1∑i=1m[hθ(x(i))y(i)∂θj∂σ(θTx(i))+1−hθ(x(i))(1−y(i))∂θj∂(1−σ(θTx(i)))]=−m1∑i=1m[hθ(x(i))y(i)σ(θTx(i))(1−σ(θTx(i)))∂θj∂θTx(i)+1−hθ(x(i))−(1−y(i))σ(θTx(i))(1−σ(θTx(i)))∂θj∂θTx(i)]=−m1∑i=1m[hθ(x(i))y(i)hθ(x(i))(1−hθ(x(i)))∂θj∂θTx(i)−1−hθ(x(i))(1−y(i))hθ(x(i))(1−hθ(x(i)))∂θj∂θTx(i)]=−m1∑i=1m[y(i)(1−hθ(x(i)))xj(i)−(1−y(i))hθ(x(i))xj(i)]=−m1∑i=1m[y(i)(1−hθ(x(i)))−(1−y(i))hθ(x(i))]xj(i)=−m1∑i=1m[y(i)−y(i)hθ(x(i))−hθ(x(i))+y(i)hθ(x(i))]xj(i)=−m1∑i=1m[y(i)−hθ(x(i))]xj(i)=m1∑i=1m[hθ(x(i))−y(i)]xj(i)
故幸运的是,其偏导函数与线性回归的基本相同,可以使用线性回归的偏导函数用于逻辑回归的计算
sigmoid函数的额外特点
c
o
s
t
(
h
θ
(
x
)
,
y
)
=
{
−
l
o
g
(
h
θ
(
x
)
)
,
y
=
1
−
l
o
g
(
1
−
h
θ
(
x
)
)
,
y
=
0
cost(h_\theta (x), y) = \begin{cases} \ - log(h_\theta (x)), \ \ \ y = 1 \\ -log(1 - h_\theta (x)), \ \ \ y = 0 \end{cases}
cost(hθ(x),y)={ −log(hθ(x)), y=1−log(1−hθ(x)), y=0
由于sigmoid函数画出来0~1区间的图像,可发现预测值越接近实际,代价越小;反之,代价会激增,可用此惩罚这个学习算法。