机器算法（一）---基于逻辑回归的算法预测---阿里云天池

最新推荐文章于 2023-02-26 21:10:57 发布

原创最新推荐文章于 2023-02-26 21:10:57 发布 · 457 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#机器学习 #人工智能 #python #算法

阿里云天池专栏收录该内容

4 篇文章

订阅专栏

本文深入介绍了逻辑回归的概念、优缺点，并通过Python的sklearn库展示了从创建数据集到训练模型的完整过程。内容包括构造数据、模型训练、参数查看、数据可视化、预测新样本以及在鸢尾花数据集上的应用。重点讨论了二分类和三分类的逻辑回归模型，同时探讨了predict和predict_proba的区别。

本篇文章包括三部分，第一部分为逻辑回归的理论概念等，第二部分为demo实践，第三部分为基于鸢尾花（iris）数据集的逻辑回归分类实践。第二三部分是我在训练营学习后，在jupyter book上复现的。
同时，关于demo以及鸢尾花项目中的各种函数，包括实践中出现的疑惑，都在文章中做了注释，适合大家一起学习

一.理论，介绍
1.逻辑回归（Logistic regression，简称LR）算法，尽管带着回归两字，但是属于分类算法的一种，分类问题的处理结果是离散的。
逻辑回归广泛应用于各个领域之中，可以说是最常见的机器学习算法之一
2**.Logistic回归的因变量可以是二分类的，也可以是多分类的，但是二分类的更为常用，也更加容易解释。所以实际中最常用的就是二分类的Logistic回归。**
3.逻辑回归模型的优劣势:
优点：实现简单，易于理解和实现；计算代价不高，速度很快，存储资源低；
缺点：容易欠拟合，分类精度可能不高

二.demo实践
1.导入相应的库函数

import numpy as np     #python基础库，机器学习大部分算法都需要用到他来做基础数值运算
import matplotlib.pyplot as plt
import seaborn as sns   ##seaborn就是在matplotlib基础上面的封装，方便直接传参数调用
from sklearn.linear_model import LogisticRegression  #sklearn库中有很多机器学习常用算法，本次从这个库里直接调用线性回归模型

2.模型训练
（1）首先构造数据集

x = np.array([[-1,-2],[-2,-1],[-3,-2],[1,3],[2,1],[3,2]])#多维数组存放数据
y = np.array([0,0,0,-1,-1,-1])

（2）调用线性回归模型并进行拟合

lr_clf = LogisticRegression()
lr_clf = lr_clf.fit(x,y)  #拟合方程y = w0+w1*x1+w2*x2

3.查看构造的参数

print('w=',lr_clf.coef_)
print('w0=',lr_clf.intercept_)

注意：coef_和intercept_
coef_是一次项系数，而intercept_是截距

4.数据可视化
（1）可视化构造的数据样点

plt.figure()    #构造一个画板
plt.scatter(x[:,0],x[:,1],c=y,s=20,cmap='viridis')
plt.title('dataset')      #设置图像标题
plt.show()

注释：
matplotlib.pyplot.scatter(x, y, s=20, c=‘b’, marker=‘o’,cmap=None, norm=None, vmin=None, vmax=None, alpha=None, linewidths=None, verts=None,
hold=None, **kwargs)
x，y：表示的是shape大小为(n,)的数组，也就是我们即将绘制散点图的数据点，输入数据。
s：表示的是大小，是一个标量或者是一个shape大小为(n,)的数组，可选，默认20。
c：表示的是色彩或颜色序列，可选，默认蓝色’b’。但是c不应该是一个单一的RGB数字，也不应该是一个RGBA的序列，因为不便区分。c可以是一个RGB或RGBA二维行数组。
marker：MarkerStyle，表示的是标记的样式，可选，默认’o’。
cmap：Colormap，标量或者是一个colormap的名字，cmap仅仅当c是一个浮点数数组的时候才使用。如果没有申明就是image.cmap，可选，默认None。
norm：Normalize，数据亮度在0-1之间，也是只有c是一个浮点数的数组的时候才使用。如果没有申明，就是默认None。
vmin，vmax：标量，当norm存在的时候忽略。用来进行亮度数据的归一化，可选，默认None。
alpha：标量，0-1之间，可选，默认None。
linewidths：也就是标记点的长度，默认None。

（2）可视化决策边界

plt.figure()    #构造一个画板
plt.scatter(x[:,0],x[:,1],c=y,s=20,cmap='viridis')
plt.title('dataset')      #设置图像标题
plt.show()

nx, ny = 200, 100
x_min, x_max = plt.xlim()
y_min, y_max = plt.ylim()
x_grid, y_grid = np.meshgrid(np.linspace(x_min, x_max, nx),np.linspace(y_min, y_max, ny))

z_proba = lr_clf.predict_proba(np.c_[x_grid.ravel(), y_grid.ravel()])     #np.c_是矩阵按行“相加”,ravel()将多维数组降位一维
z_proba = z_proba[:, 1].reshape(x_grid.shape)
plt.contour(x_grid, y_grid, z_proba, [0.5], linewidths=2., colors='blue')

plt.show()

注释：
[1] xlim/ylim函数功能：设置x/y轴的数值显示范围。
调用签名：plt.xlim(xmin, xmax)
xmin：x轴上的最小值
xmax：x轴上的最大值
[2]np.meshgrid函数生成网格矩阵
[3]**numpy.linspace(**start, stop, num=50, endpoint=True, retstep=False, dtype=None)
在指定的间隔内返回均匀间隔的数字
[4]predict_proba返回的是一个 n 行 k 列的数组，第 i 行第 j 列上的数值是模型预测第 i 个预测样本为某个标签的概率，并且每一行的概率和为1。

（3）可视化预测新样本

plt.figure()
## new point 1
x_new1 = np.array([[0, -1]])
plt.scatter(x_new1[:,0],x_new1[:,1], s=20, cmap='viridis')
plt.annotate(s='New point 1',xy=(0,-1),xytext=(-2,0),color='blue',arrowprops=dict(arrowstyle='-|>',connectionstyle='arc3',color='red'))

## new point 2
x_new2 = np.array([[1, 2]])
plt.scatter(x_new2[:,0],x_new2[:,1], s=20, cmap='viridis')
plt.annotate(s='New point 2',xy=(1,2),xytext=(-1.5,2.5),color='red',arrowprops=dict(arrowstyle='-|>',connectionstyle='arc3',color='red'))

## 训练样本
plt.scatter(x_fearures[:,0],x_fearures[:,1], c=y_label, s=50, cmap='viridis')
plt.title('Dataset')

# 可视化决策边界
plt.contour(x_grid, y_grid, z_proba, [0.5], linewidths=2., colors='blue')

plt.show()

注释：
[1]plt.annotate()函数用于标注文字 plt.annotate(s=‘str’,
xy=(x,y) ,xytext=(l1,l2) ,
xy 为被注释的坐标点
xytext 为注释文字的坐标位置
[2]plt.contour
plt.contour([X, Y,] Z, [levels], ** kwargs)
参数：
X，Y ：值Z的坐标。
X和Y必须都是2-D，且形状与Z相同，或者它们必须都是1-d，这样len（X）== M是Z中的列数，len（Y）== N是Z中的行数。
Z ：绘制轮廓的高度值。
levels： int或类数组，确定轮廓线/区域的数量和位置。
返回：
c ： QuadContourSet
5.模型预测

## 在训练集和测试集上分布利用训练好的模型进行预测
y_new1_predict = lr_clf.predict(x_fearures_new1)
y_new2_predict = lr_clf.predict(x_fearures_new2)

print('The New point 1 predict class:\n',y_new1_predict)
print('The New point 2 predict class:\n',y_new2_predict)

## 由于逻辑回归模型是概率预测模型（前文介绍的 p = p(y=1|x,\theta)）,所有我们可以利用 predict_proba 函数预测其概率
y_new1_predict_proba = lr_clf.predict_proba(x_fearures_new1)
y_new2_predict_proba = lr_clf.predict_proba(x_fearures_new2)

print('The New point 1 predict Probability of each class:\n',y_new1_predict_proba)
print('The New point 2 predict Probability of each class:\n',y_new2_predict_proba)

三.基于鸢尾花（iris）数据集的逻辑回归分类实践
1.导入相应的库

import numpy as np  #（Python进行科学计算的基础软件包）
import pandas as pd  #pandas是一种快速，强大，灵活且易于使用的开源数据分析和处理工具）
import matplotlib.pyplot as plot  #绘图
import seaborn as sns

2.数据的导入以及查看
（1）

from sklearn.datasets import load_iris      #导入sklearn库中带有的iris数据集
data = load_iris()   #数据特征 
iris_target = data.target  #数据对应标签

###新知识
iris_features = pd.DataFrame(data=data.data, columns=data.feature_names) #利用Pandas转化为DataFrame格式

注释：pd.DataFrame
DataFrame的单元格可以存放数值、字符串等，这和excel表很像，同时DataFrame可以设置列名columns与行名index。

例如：
import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(3, 3), index=list('abc'), columns=list('ABC'))
print(df1)

#           A         B         C
# a -0.612978  0.237191  0.312969
# b -1.281485  1.135944  0.162456
# c  2.232905  0.200209  0.028671

其中第一个参数是存放在DataFrame里的数据，第二个参数index就是之前说的行名，第三个参数columns是之前说的列名。

（2）

iris_features.info()
iris_features.head()

output:
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
sepal length (cm) 150 non-null float64
sepal width (cm) 150 non-null float64
petal length (cm) 150 non-null float64
petal width (cm) 150 non-null float64
dtypes: float64(4)
memory usage: 4.8 KB
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

iris_features.tail()
iris_target

output:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

iris_features.describe()   #describe可以对特征进行统计特性描述
python
#新知识！
# 利用value_counts函数查看每个类别数量
pd.Series(iris_target).value_counts()

（4）
pd.Series().value_counts()

#新知识！
# 利用value_counts函数查看每个类别数量
pd.Series(iris_target).value_counts()

output:
2 50
1 50
0 50
dtype: int64

3.数据可视化

4.第四步：基于二分类的训练预测（逻辑回归模型）

###新知识
#使用from sklearn.model_selection import train_test_split把数据划分为训练数据和测试数据
from sklearn.model_selection import train_test_split

## 选择其类别为0和1的样本 （不包括类别为2的样本）
iris_features_part = iris_features.iloc[:100]    #因为是基于二分类，这里只选择前两个特征
iris_target_part = iris_target[:100]

#新知识
#测试集大小为20%， 80%/20%分
x_train, x_test, y_train, y_test = train_test_split(iris_features_part, iris_target_part, test_size = 0.2, random_state = 2020)#将iris_features_part, iris_target_part分别按照8/2分开

from sklearn.linear_model import LogisticRegression    #L  R大写哈哈
clf = LogisticRegression()
clf.fit(x_train,y_train)

ouput:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class=‘ovr’, n_jobs=1,
penalty=‘l2’, random_state=None, solver=‘liblinear’, tol=0.0001,
verbose=0, warm_start=False)

print('w0',clf.coef_)
print('w',clf.intercept_)

output:
w0 [[-0.3498687 -1.40196797 2.07800144 0.94498658]]
w [-0.24430856]

train_predict = clf.predict(x_train)
test_predict = clf.predict(x_test)

新知识：

from sklearn import metrics
## 利用accuracy（准确度）【预测正确的样本数目占总预测样本数目的比例】评估模型效果
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_train,train_predict))
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_test,test_predict))

## 查看混淆矩阵 (预测值和真实值的各类情况统计矩阵)
confusion_matrix_result = metrics.confusion_matrix(test_predict,y_test)
print('The confusion matrix result:\n',confusion_matrix_result)

ouput:
The accuracy of the Logistic Regression is: 1.0
The accuracy of the Logistic Regression is: 1.0
The confusion matrix result:
[[ 9 0]
[ 0 11]]

5.第五步：基于三分类的分类预测
（1）划分测试集，训练集

from sklearn.model_selection import train_test_split

#测试集大小为20%， 80%/20%分
x_train, x_test, y_train, y_test = train_test_split(iris_features, iris_target, test_size = 0.2, random_state = 2020)#将iris_features, iris_target分别按照8/2分开

（2）训练逻辑回归模型

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(x_train,y_train)

由于这个是3分类，所有我们这里得到了三个逻辑回归模型的参数，其三个逻辑回归组合起来即可实现三分类。

print('w0',clf.coef_)
print('w',clf.intercept_)

w0 [[ 0.34073704 1.43117941 -2.10890793 -0.98898823]
[ 0.50070416 -1.77865627 0.57929301 -1.54058129]
[-1.58892803 -1.1412023 2.12145168 2.49728038]]
w [ 0.23196362 1.28831332 -1.22005974]

（3）重点predict/predict_proba
与二分类不同之处：

#与二分类不同之处
train_predict = clf.predict(x_train)
test_predict = clf.predict(x_test)

train_predict_proba = clf.predict_proba(x_train)
test_predict_proba = clf.predict_proba(x_test)

#重点！！！
print(test_predict)   #返回的是标签，就是测试集里预测的结果
print(test_predict_proba)  #返回的是概率，每一行对应一个数据，3个数分别对应3个结果的概率，每一行概率和是1

output:
[2 0 1 1 1 2 2 1 0 0 2 2 0 2 2 0 1 1 2 0 0 2 2 0 2 1 2 1 0 0]
[[ 3.74241320e-04 2.44230158e-01 7.55395600e-01]
[ 7.97564600e-01 2.02164769e-01 2.70630857e-04]
[ 4.35253921e-02 5.48952293e-01 4.07522315e-01]
[ 1.22743283e-02 6.72460074e-01 3.15265598e-01]
[ 1.52441289e-02 5.25066151e-01 4.59689720e-01]
[ 9.19066428e-04 4.91385182e-01 5.07695752e-01]
[ 8.87295883e-04 3.58230937e-01 6.40881767e-01]
[ 9.19390677e-02 7.71310908e-01 1.36750024e-01]
[ 8.68941663e-01 1.30810922e-01 2.47414874e-04]
[ 8.72269265e-01 1.27663175e-01 6.75604667e-05]
[ 1.11270432e-03 3.45807870e-01 6.53079426e-01]
[ 2.79007402e-03 2.72235984e-01 7.24973942e-01]
[ 8.58581052e-01 1.41374521e-01 4.44274209e-05]
[ 4.01178389e-03 3.72260284e-01 6.23727932e-01]
[ 1.13293692e-03 2.87714289e-01 7.11152774e-01]
[ 9.00538322e-01 9.94109506e-02 5.07270775e-05]
[ 1.48929099e-02 6.44540727e-01 3.40566363e-01]
[ 4.02704045e-02 6.33957862e-01 3.25771734e-01]
[ 4.36200437e-04 2.92038638e-01 7.07525161e-01]
[ 6.69237871e-01 3.30337965e-01 4.24164182e-04]
[ 8.52827223e-01 1.47072874e-01 9.99030838e-05]
[ 5.13048221e-04 4.58004707e-01 5.41482245e-01]
[ 7.68482919e-03 2.38009694e-01 7.54305476e-01]
[ 8.64909820e-01 1.34958747e-01 1.31433680e-04]
[ 1.93857946e-03 2.85359825e-01 7.12701596e-01]
[ 5.95765146e-02 5.98693618e-01 3.41729868e-01]
[ 4.41058308e-02 3.80460459e-01 5.75433710e-01]
[ 5.56207212e-03 5.00435713e-01 4.94002215e-01]
[ 8.15112823e-01 1.84721086e-01 1.66090178e-04]
[ 9.21462358e-01 7.82887728e-02 2.48869077e-04]]

predict/predict_proba:
#在sklearn包里面，一般predict(X)是给出一个N行1列的标签（结果），而predict_proba(X)是给出概率，几个标签就有几列
#predict：返回的是一个大小为n的一维数组，一维数组中的第i个值为模型预测第i个预测样本的标签；
#predict_proba：返回的是一个n行k列的数组，第i行第j列上的数值是模型预测第i个预测样本的标签为j的概率，此时每一行的和应该等于1。

（4）

from sklearn import metrics

print('The test predict Probability of each class:\n',test_predict_proba)
## 其中第一列代表预测为0类的概率，第二列代表预测为1类的概率，第三列代表预测为2类的概率。

## 利用accuracy（准确度）【预测正确的样本数目占总预测样本数目的比例】评估模型效果
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_train,train_predict))
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_test,test_predict))

The test predict Probability of each class:
[[ 3.74241320e-04 2.44230158e-01 7.55395600e-01]
[ 7.97564600e-01 2.02164769e-01 2.70630857e-04]
[ 4.35253921e-02 5.48952293e-01 4.07522315e-01]
[ 1.22743283e-02 6.72460074e-01 3.15265598e-01]
[ 1.52441289e-02 5.25066151e-01 4.59689720e-01]
[ 9.19066428e-04 4.91385182e-01 5.07695752e-01]
[ 8.87295883e-04 3.58230937e-01 6.40881767e-01]
[ 9.19390677e-02 7.71310908e-01 1.36750024e-01]
[ 8.68941663e-01 1.30810922e-01 2.47414874e-04]
[ 8.72269265e-01 1.27663175e-01 6.75604667e-05]
[ 1.11270432e-03 3.45807870e-01 6.53079426e-01]
[ 2.79007402e-03 2.72235984e-01 7.24973942e-01]
[ 8.58581052e-01 1.41374521e-01 4.44274209e-05]
[ 4.01178389e-03 3.72260284e-01 6.23727932e-01]
[ 1.13293692e-03 2.87714289e-01 7.11152774e-01]
[ 9.00538322e-01 9.94109506e-02 5.07270775e-05]
[ 1.48929099e-02 6.44540727e-01 3.40566363e-01]
[ 4.02704045e-02 6.33957862e-01 3.25771734e-01]
[ 4.36200437e-04 2.92038638e-01 7.07525161e-01]
[ 6.69237871e-01 3.30337965e-01 4.24164182e-04]
[ 8.52827223e-01 1.47072874e-01 9.99030838e-05]
[ 5.13048221e-04 4.58004707e-01 5.41482245e-01]
[ 7.68482919e-03 2.38009694e-01 7.54305476e-01]
[ 8.64909820e-01 1.34958747e-01 1.31433680e-04]
[ 1.93857946e-03 2.85359825e-01 7.12701596e-01]
[ 5.95765146e-02 5.98693618e-01 3.41729868e-01]
[ 4.41058308e-02 3.80460459e-01 5.75433710e-01]
[ 5.56207212e-03 5.00435713e-01 4.94002215e-01]
[ 8.15112823e-01 1.84721086e-01 1.66090178e-04]
[ 9.21462358e-01 7.82887728e-02 2.48869077e-04]]
The accuracy of the Logistic Regression is: 0.95
The accuracy of the Logistic Regression is: 0.933333333333