动手数据分析第三章模型搭建与评估Task05-优快云博客

本文链接：https://blog.youkuaiyun.com/m0_61762695/article/details/128759646

一，建模

1.1.导入需要的包

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image

图像正常显示中文

plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号
plt.rcParams['figure.figsize'] = (10, 6)  # 设置输出图片大小

1.2.读取数据并显示

1.2.1读取训练数据

train = pd.read_csv('train.csv')
train.shape
train.head()

1.2.2读取清洗过的数据

data = pd.read_csv('clear_data.csv')
print(data)

1.3模型搭建（sklearn）

1.3.1分割数据集

from sklearn.model_selection import train_test_split
 # 取出X和y
X = data
y = train['survived']

# 分割数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
# 查看数据形状
X_train.shape    (668, 11)
X_test.shape     (223,11)

1.3.2 模型的创建与使用

# 导入模型
from sklearn.linear_model import LogisticRegression   # 逻辑回归
from sklearn.ensemble import RandomForestClassifier   # 随机森林

# 使用逻辑回归模型
lr = LogisticRegression()
lr。fit(X_train, y_train)    # 拟合训练


# 查看训练集和测试集score值
print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(lr.score(X_test, y_test)))



# 调整参数后的逻辑回归模型
lr2 = LogisticRegression(C=100)
lr2.fit(X_train, y_train)
print("Training set score: {:.2f}".format(lr2.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(lr2.score(X_test, y_test)))

# 默认参数的随机森林分类模型
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

print("Training set score: {:.2f}".format(rfc.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(rfc.score(X_test, y_test)))

# 给定参数后的随机森林分类模型
rfc2 = RandomForestClassifier(n_estimators=100, max_depth=5)
rfc2.fit(X_train, y_train)

print("Training set score: {:.2f}".format(rfc2.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(rfc2.score(X_test, y_test)))

【思考】为什么线性模型可以进行分类任务，背后是怎么的数学关系

答：将线性模型得到的结果最后通过一个非线性的激活函数，然后分类到期望的标签类别，就可以得到近似概率的预测。

1.3.3输出模型预测结果

#对训练数据进行标签的预测
pred = lr.predict(X_train)
pred[:10]                # array([0, 1, 1, 1, 0, 0, 1, 0, 1, 1])

# 预测标签概率
pred_proba = lr.predict_proba(X_train)
pred_proba[:5]                    
# array([[0.60870022, 0.39129978],
#      [0.17725433, 0.82274567],
#      [0.40750365, 0.59249635],
#      [0.18925851, 0.81074149],
#      [0.87973912, 0.12026088],])

二，评估

模型评估：提高模型的泛化能力

交叉验证（cross-validation）:多次划分数据集（训练集，测试集），常用的方法有K折交叉验证（k-fold ）

2.1交叉验证

lr = LogisticRegression(C=100)
scores = cross_val_score(lr, X_train, y_train, cv=10)   # 做10次交叉验证

#array([0.85074627, 0.74626866, 0.74626866, 0.80597015, 0.88059701,
#       0.8358209 , 0.76119403, 0.8358209 , 0.74242424, 0.75757576])
# 平均交叉验证分数0.80
print("Average cross-validation score: {:.2f}".format(scores.mean()))

2.2混淆矩阵

# 导入混淆矩阵的包
from sklearn.metrics import confusion_matrix

# 训练模型
lr = LogisticRegression(C=100)
lr.fit(X_train, y_train)

# 模型预测结果
pred = lr.predict(X_train)

# 混淆矩阵
confusion_matrix(y_train, pred)

# array([[355,  57],
#       [ 83, 173]], dtype=int64)


# 精确率、召回率以及f1-score
from sklearn.metrics import classification_report
print(classififcation_report(y_train, pred))

2.3ROC曲线

fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test))
plt.plot(fpr, tpr, label="ROC Curve")
plt.xlabel("FPR")
plt.ylabel("TPR (recall)")
# 找到最接近于0的阈值
close_zero = np.argmin(np.abs(thresholds))
plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10, label="threshold zero", fillstyle="none", c='k', mew=2)
plt.legend(loc=4)

decision_function()：返回一个numpy数组，其中每个元素表示分类器对x_test的预测样本是位于超平面的右侧还是左侧，以及离超平面有多远，以及分类器为x_test预测的每个值是正值(大幅度正值)，还是负值(大幅度负值)，以及相应的信任程度。

roc_curve():返回递增的假阳性率fpr，递增的真阳性率tpr，thresholds:用于计算决策函数的递减。`thresholds[0]` 代表没有被预测的实例。