机器学习算法实践——逻辑回归
算法原理
逻辑斯谛回归模型是由以下条件概率分布表示的分类模型。逻辑斯谛回归模型可以用于二类或多类分类。
P ( Y = k ∣ x ) = exp ( w k ⋅ x ) 1 + ∑ k = 1 K − 1 exp ( w k ⋅ x ) , k = 1 , 2 , ⋯ , K − 1 P(Y=k | x)=\frac{\exp \left(w_{k} \cdot x\right)}{1+\sum_{k=1}^{K-1} \exp \left(w_{k} \cdot x\right)}, \quad k=1,2, \cdots, K-1 P(Y=k∣x)=1+∑k=1K−1exp(wk⋅x)exp(wk⋅x),k=1,2,⋯,K−1
P
(
Y
=
K
∣
x
)
=
1
1
+
∑
k
=
1
K
−
1
exp
(
w
k
⋅
x
)
P(Y=K | x)=\frac{1}{1+\sum_{k=1}^{K-1} \exp \left(w_{k} \cdot x\right)}
P(Y=K∣x)=1+∑k=1K−1exp(wk⋅x)1
这里,
x
x
x为输入特征,
w
w
w为特征的权值,
Y
Y
Y为类别,类别总数为
K
K
K。
逻辑斯谛回归模型源自逻辑斯谛分布,其分布函数
F
(
x
)
F(x)
F(x)是
S
S
S形函数。逻辑斯谛回归模型是由输入的线性函数表示的输出的对数几率模型。
逻辑斯谛函数(也称为Sigmoid函数)形式为
l
o
g
i
(
z
)
=
1
1
+
e
−
z
logi (z)=\frac{1}{1+e^{-z}}
logi(z)=1+e−z1
模型参数估计
以二分类模型为例:
假如逻辑斯蒂模型在学习时,训练数据集为
T
=
{
(
x
1
,
y
1
)
,
(
x
2
,
y
2
)
,
.
.
.
,
(
x
N
,
y
N
)
}
T=\{(x_1,y_1),(x_2,y_2),...,(x_N,y_N)\}
T={(x1,y1),(x2,y2),...,(xN,yN)},其中
x
i
∈
R
n
x_i\in{R^n}
xi∈Rn,
y
i
∈
{
0
,
1
}
y_i\in{\{0,1\}}
yi∈{0,1}.应用极大似然法估计模型参数,就可以得到逻辑斯蒂回归模型。
首先在二分类中,
P
(
Y
=
1
∣
x
)
=
exp
(
w
⋅
x
)
1
+
exp
(
w
⋅
x
)
P(Y=1|x)=\frac{\exp(w\cdot x)}{1+\exp(w\cdot x)}
P(Y=1∣x)=1+exp(w⋅x)exp(w⋅x)
P
(
Y
=
0
∣
x
)
=
1
1
+
exp
(
w
⋅
x
)
P(Y=0|x)=\frac{1}{1+\exp(w\cdot x)}
P(Y=0∣x)=1+exp(w⋅x)1
设
P
(
Y
=
1
∣
x
)
=
π
(
x
)
P(Y=1|x)=\pi(x)
P(Y=1∣x)=π(x),
P
(
Y
=
0
∣
x
)
=
1
−
π
(
x
)
P(Y=0|x)=1-\pi(x)
P(Y=0∣x)=1−π(x)
似然函数:
∏
i
=
1
N
π
(
x
i
)
y
i
⋅
[
1
−
π
(
x
i
)
]
1
−
y
i
\prod_{i=1}^{N}\pi(x_i)^{y_i} \cdot [1-\pi(x_i)]^{1-y_i}
i=1∏Nπ(xi)yi⋅[1−π(xi)]1−yi
对数似然函数:
L
(
w
)
=
l
o
g
{
∏
i
=
1
N
π
(
x
i
)
y
i
⋅
[
1
−
π
(
x
i
)
]
1
−
y
i
}
=
∑
i
=
1
N
y
i
l
o
g
π
(
x
i
)
+
(
1
−
y
i
)
l
o
g
[
1
−
π
(
x
i
)
]
=
∑
i
=
1
N
y
i
l
o
g
π
(
x
i
)
1
−
π
(
x
i
)
+
l
o
g
[
1
−
π
(
x
i
)
]
=
∑
i
=
1
N
w
i
⋅
x
i
−
l
o
g
(
1
+
exp
(
w
⋅
x
i
)
)
L(w) =log\{ \prod_{i=1}^{N}\pi(x_i)^{y_i} \cdot [1-\pi(x_i)]^{1-y_i}\} \\ =\sum_{i=1}^{N}y_i log\pi(x_i)+(1-y_i)log[1-\pi(x_i)] \\ = \sum_{i=1}^{N}y_i log \frac{\pi(x_i)}{1-\pi(x_i)}+log[1-\pi(x_i)] \\ = \sum_{i=1}^{N}w_i \cdot x_i - log(1+ \exp (w \cdot x_i))
L(w)=log{i=1∏Nπ(xi)yi⋅[1−π(xi)]1−yi}=i=1∑Nyilogπ(xi)+(1−yi)log[1−π(xi)]=i=1∑Nyilog1−π(xi)π(xi)+log[1−π(xi)]=i=1∑Nwi⋅xi−log(1+exp(w⋅xi))
对
L
(
w
)
L(w)
L(w)求极大值就可以得到模型参数
w
w
w的估计值
逻辑斯蒂回归学习中一般采用梯度下降法以及拟牛顿法。
算法实践
我们使用sklearn库中的函数并且将其使用到鸢尾花数据集上来进行分类预测,环境为jupyter notebook
Step1:库函数导入
## 基础函数库
import numpy as np
## 导入画图库
import matplotlib.pyplot as plt
import seaborn as sns
## 导入逻辑回归模型函数
from sklearn.linear_model import LogisticRegression
Step2:模型训练
##Demo演示LogisticRegression分类
## 构造数据集
x_fearures = np.array([[-1, -2], [-2, -1], [-3, -2], [1, 3], [2, 1], [3, 2]])
y_label = np.array([0, 0, 0, 1, 1, 1])
## 调用逻辑回归模型
lr_clf = LogisticRegression()
## 用逻辑回归模型拟合构造的数据集
lr_clf = lr_clf.fit(x_fearures, y_label) #其拟合方程为 y=w0+w1*x1+w2*x2
Step3:模型参数查看
##查看其对应模型的w
print('the weight of Logistic Regression:',lr_clf.coef_)
##查看其对应模型的w0
print('the intercept(w0) of Logistic Regression:',lr_clf.intercept_)
## 输出结果
the weight of Logistic Regression: [[ 0.73462087 0.6947908 ]]
the intercept(w0) of Logistic Regression: [-0.03643213]
Step4:数据和模型可视化
## 可视化构造的数据样本点
plt.figure()
plt.scatter(x_fearures[:,0],x_fearures[:,1], c=y_label, s=50, cmap='viridis')
plt.title('Dataset')
plt.show()
# 可视化决策边界
plt.figure()
plt.scatter(x_fearures[:,0],x_fearures[:,1], c=y_label, s=50, cmap='viridis')
plt.title('Dataset')
nx, ny = 200, 100
x_min, x_max = plt.xlim()
y_min, y_max = plt.ylim()
x_grid, y_grid = np.meshgrid(np.linspace(x_min, x_max, nx),np.linspace(y_min, y_max, ny))
z_proba = lr_clf.predict_proba(np.c_[x_grid.ravel(), y_grid.ravel()])
z_proba = z_proba[:, 1].reshape(x_grid.shape)
plt.contour(x_grid, y_grid, z_proba, [0.5], linewidths=2., colors='blue')
plt.show()
### 可视化预测新样本
plt.figure()
## new point 1
x_fearures_new1 = np.array([[0, -1]])
plt.scatter(x_fearures_new1[:,0],x_fearures_new1[:,1], s=50, cmap='viridis')
plt.annotate(s='New point 1',xy=(0,-1),xytext=(-2,0),color='blue',arrowprops=dict(arrowstyle='-|>',connectionstyle='arc3',color='red'))
## new point 2
x_fearures_new2 = np.array([[1, 2]])
plt.scatter(x_fearures_new2[:,0],x_fearures_new2[:,1], s=50, cmap='viridis')
plt.annotate(s='New point 2',xy=(1,2),xytext=(-1.5,2.5),color='red',arrowprops=dict(arrowstyle='-|>',connectionstyle='arc3',color='red'))
## 训练样本
plt.scatter(x_fearures[:,0],x_fearures[:,1], c=y_label, s=50, cmap='viridis')
plt.title('Dataset')
# 可视化决策边界
plt.contour(x_grid, y_grid, z_proba, [0.5], linewidths=2., colors='blue')
plt.show()
Step5:模型预测
##在训练集和测试集上分布利用训练好的模型进行预测
y_label_new1_predict=lr_clf.predict(x_fearures_new1)
y_label_new2_predict=lr_clf.predict(x_fearures_new2)
print('The New point 1 predict class:\n',y_label_new1_predict)
print('The New point 2 predict class:\n',y_label_new2_predict)
##由于逻辑回归模型是概率预测模型(前文介绍的p = p(y=1|x,\theta)),所有我们可以利用predict_proba函数预测其概率
y_label_new1_predict_proba=lr_clf.predict_proba(x_fearures_new1)
y_label_new2_predict_proba=lr_clf.predict_proba(x_fearures_new2)
print('The New point 1 predict Probability of each class:\n',y_label_new1_predict_proba)
print('The New point 2 predict Probability of each class:\n',y_label_new2_predict_proba)
## 输出
The New point 1 predict class:
[0]
The New point 2 predict class:
[1]
The New point 1 predict Probability of each class:
[[ 0.67507358 0.32492642]]
The New point 2 predict Probability of each class:
[[ 0.11029117 0.88970883]]
鸢尾花数据集分类实践
Step1:函数库导入
## 基础函数库
import numpy as np
import pandas as pd
## 绘图函数库
import matplotlib.pyplot as plt
import seaborn as sns
本次我们选择鸢花数据(iris)进行方法的尝试训练,该数据集一共包含5个变量,其中4个特征变量,1个目标分类变量。共有150个样本,目标变量为 花的类别 其都属于鸢尾属下的三个亚属,分别是山鸢尾 (Iris-setosa),变色鸢尾(Iris-versicolor)和维吉尼亚鸢尾(Iris-virginica)。包含的三种鸢尾花的四个特征,分别是花萼长度(cm)、花萼宽度(cm)、花瓣长度(cm)、花瓣宽度(cm),这些形态特征在过去被用来识别物种。
变量 | 描述 |
---|---|
sepal length | 花萼长度(cm) |
sepal width | 花萼宽度(cm) |
petal length | 花瓣长度(cm) |
petal width | 花瓣宽度(cm) |
target | 鸢尾的三个亚属类别,‘setosa’(0), ‘versicolor’(1), ‘virginica’(2) |
Step2:数据读取/载入
##我们利用sklearn中自带的iris数据作为数据载入,并利用Pandas转化为DataFrame格式
from sklearn.datasets import load_iris
data = load_iris() #得到数据特征
iris_target = data.target #得到数据对应的标签
iris_features = pd.DataFrame(data=data.data, columns=data.feature_names) #利用Pandas转化为DataFrame格式
##利用.info()查看数据的整体信息
iris_features.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
sepal length (cm) 150 non-null float64
sepal width (cm) 150 non-null float64
petal length (cm) 150 non-null float64
petal width (cm) 150 non-null float64
dtypes: float64(4)
memory usage: 4.8 KB
##进行简单的数据查看,我们可以利用.head()头部.tail()尾部
iris_features.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
##其对应的类别标签为,其中0,1,2分别代表'setosa','versicolor','virginica'三种不同花的类别
iris_target
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
##利用value_counts函数查看每个类别数量
pd.Series(iris_target).value_counts()
2 50
1 50
0 50
dtype: int64
##对于特征进行一些统计描述
iris_features.describe()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
## 合并标签和特征信息
iris_all = iris_features.copy() ##进行浅拷贝,防止对于原始数据的修改
iris_all['target'] = iris_target
## 特征与标签组合的散点可视化
sns.pairplot(data=iris_all,diag_kind='hist', hue= 'target')
plt.show()
for col in iris_features.columns:
sns.boxplot(x='target', y=col, saturation=0.5,
palette='pastel', data=iris_all)
plt.title(col)
plt.show()
# 选取其前三个特征绘制三维散点图
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111, projection='3d')
iris_all_class0 = iris_all[iris_all['target']==0].values
iris_all_class1 = iris_all[iris_all['target']==1].values
iris_all_class2 = iris_all[iris_all['target']==2].values
# 'setosa'(0), 'versicolor'(1), 'virginica'(2)
ax.scatter(iris_all_class0[:,0], iris_all_class0[:,1], iris_all_class0[:,2],label='setosa')
ax.scatter(iris_all_class1[:,0], iris_all_class1[:,1], iris_all_class1[:,2],label='versicolor')
ax.scatter(iris_all_class2[:,0], iris_all_class2[:,1], iris_all_class2[:,2],label='virginica')
plt.legend()
plt.show()
##为了正确评估模型性能,将数据划分为训练集和测试集,并在训练集上训练模型,在测试集上验证模型性能。
from sklearn.model_selection import train_test_split
##选择其类别为0和1的样本(不包括类别为2的样本)
iris_features_part=iris_features.iloc[:100]
iris_target_part=iris_target[:100]
##测试集大小为20%,80%/20%分
x_train,x_test,y_train,y_test=train_test_split(iris_features_part,iris_target_part,test_size=0.2,random_state=2020)
##从sklearn中导入逻辑回归模型
from sklearn.linear_model import LogisticRegression
##定义逻辑回归模型
clf=LogisticRegression(random_state=0,solver='lbfgs')
##在训练集上训练逻辑回归模型
clf.fit(x_train,y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=0, solver='lbfgs', tol=0.0001,
verbose=0, warm_start=False)
##查看其对应的w
print('the weight of Logistic Regression:',clf.coef_)
##查看其对应的w0
print('the intercept(w0) of Logistic Regression:',clf.intercept_)
the weight of Logistic Regression: [[ 0.45244919 -0.81010583 2.14700385 0.90450733]]
the intercept(w0) of Logistic Regression: [-6.57504448]
##在训练集和测试集上分布利用训练好的模型进行预测
train_predict=clf.predict(x_train)
test_predict=clf.predict(x_test)
from sklearn import metrics
##利用accuracy(准确度)【预测正确的样本数目占总预测样本数目的比例】评估模型效果
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_train,train_predict))
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_test,test_predict))
##查看混淆矩阵(预测值和真实值的各类情况统计矩阵)
confusion_matrix_result=metrics.confusion_matrix(test_predict,y_test)
print('The confusion matrix result:\n',confusion_matrix_result)
##利用热力图对于结果进行可视化
plt.figure(figsize=(8,6))
sns.heatmap(confusion_matrix_result,annot=True,cmap='Blues')
plt.xlabel('Predictedlabels')
plt.ylabel('Truelabels')
plt.show()
The accuracy of the Logistic Regression is: 1.0
The accuracy of the Logistic Regression is: 1.0
The confusion matrix result:
[[ 9 0]
[ 0 11]]
##测试集大小为20%,80%/20%分
x_train,x_test,y_train,y_test=train_test_split(iris_features,iris_target,test_size=0.2,random_state=2020)
##定义逻辑回归模型
clf=LogisticRegression(random_state=0,solver='lbfgs')
##在训练集上训练逻辑回归模型
clf.fit(x_train,y_train)
##查看其对应的w
print('the weight of Logistic Regression:\n',clf.coef_)
##查看其对应的w0
print('the intercept(w0) of Logistic Regression:\n',clf.intercept_)
##由于这个是3分类,所有我们这里得到了三个逻辑回归模型的参数,其三个逻辑回归组合起来即可实现三分类
the weight of Logistic Regression:
[[-0.43538857 0.87888013 -2.19176678 -0.94642091]
[-0.39434234 -2.6460985 0.76204684 -1.35386989]
[-0.00806312 0.11304846 2.52974343 2.3509289 ]]
the intercept(w0) of Logistic Regression:
[ 6.30620875 8.25761672 -16.63629247]
##在训练集和测试集上分布利用训练好的模型进行预测
train_predict=clf.predict(x_train)
test_predict=clf.predict(x_test)
##由于逻辑回归模型是概率预测模型(前文介绍的p=p(y=1|x,\theta)),所有我们可以利用predict_proba函数预测其概率
train_predict_proba=clf.predict_proba(x_train)
test_predict_proba=clf.predict_proba(x_test)
print('The test predict Probability of each class:\n',test_predict_proba)
##其中第一列代表预测为0类的概率,第二列代表预测为1类的概率,第三列代表预测为2类的概率。
##利用accuracy(准确度)【预测正确的样本数目占总预测样本数目的比例】评估模型效果
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_train,train_predict))
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_test,test_predict))
The test predict Probability of each class:
[[ 1.32525870e-04 2.41745142e-01 7.58122332e-01]
[ 7.02970475e-01 2.97026349e-01 3.17667822e-06]
[ 3.37367886e-02 7.25313901e-01 2.40949311e-01]
[ 5.66207138e-03 6.53245545e-01 3.41092383e-01]
[ 1.06817066e-02 6.72928600e-01 3.16389693e-01]
[ 8.98402870e-04 6.64470713e-01 3.34630884e-01]
[ 4.06382037e-04 3.86192249e-01 6.13401369e-01]
[ 1.26979439e-01 8.69440588e-01 3.57997319e-03]
[ 8.75544317e-01 1.24437252e-01 1.84312617e-05]
[ 9.11209514e-01 8.87814689e-02 9.01671605e-06]
[ 3.86067682e-04 3.06912689e-01 6.92701243e-01]
[ 6.23261939e-03 7.19220636e-01 2.74546745e-01]
[ 8.90760124e-01 1.09235653e-01 4.22292409e-06]
[ 2.32339490e-03 4.47236837e-01 5.50439768e-01]
[ 8.59945211e-04 4.22804376e-01 5.76335679e-01]
[ 9.24814068e-01 7.51814638e-02 4.46852786e-06]
[ 2.01307999e-02 9.35166320e-01 4.47028801e-02]
[ 1.71215635e-02 5.07246971e-01 4.75631465e-01]
[ 1.83964097e-04 3.17849048e-01 6.81966988e-01]
[ 5.69461042e-01 4.30536566e-01 2.39269631e-06]
[ 8.26025475e-01 1.73971556e-01 2.96936737e-06]
[ 3.05327704e-04 5.15880492e-01 4.83814180e-01]
[ 4.69978972e-03 2.90561777e-01 7.04738434e-01]
[ 8.61077168e-01 1.38915993e-01 6.83858427e-06]
[ 6.99887637e-04 2.48614010e-01 7.50686102e-01]
[ 5.33421842e-02 8.31557126e-01 1.15100690e-01]
[ 2.34973018e-02 3.54915328e-01 6.21587370e-01]
[ 1.63311193e-03 3.48301765e-01 6.50065123e-01]
[ 7.72156866e-01 2.27838662e-01 4.47157219e-06]
[ 9.30816593e-01 6.91640361e-02 1.93708074e-05]]
The accuracy of the Logistic Regression is: 0.958333333333
The accuracy of the Logistic Regression is: 0.8
##查看混淆矩阵
confusion_matrix_result=metrics.confusion_matrix(test_predict,y_test)
print('The confusion matrix result:\n',confusion_matrix_result)
##利用热力图对于结果进行可视化
plt.figure(figsize=(8,6))
sns.heatmap(confusion_matrix_result,annot=True,cmap='Blues')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()
The confusion matrix result:
[[10 0 0]
[ 0 7 3]
[ 0 3 7]]
```