模型评估的步骤、scikit-learn函数及实例说明
1. 数据划分(Train-Test Split)
- 函数:
train_test_split
- 使用场景:将数据分为训练集和测试集,避免模型过拟合。
- 作用:确保模型在未见过的数据上验证性能。
- 示例:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
2. 模型训练与预测
3. 评估指标计算
- 函数:
accuracy_score
, classification_report
, confusion_matrix
- 使用场景:量化模型性能,分析分类结果的详细指标(如精确率、召回率)。
- 作用:全面评估模型的准确性和潜在缺陷(如类别偏差)。
- 示例:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
4. 调参与交叉验证
- 函数:
GridSearchCV
- 使用场景:寻找最佳超参数组合,避免手动试错。
- 作用:提高模型泛化能力,减少过拟合风险。
- 示例:
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.1, 1, 10], 'penalty': ['l1', 'l2']}
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
5. 交叉验证(Cross-Validation)
- 函数:
cross_val_score
- 使用场景:评估模型在不同数据子集上的稳定性。
- 作用:减少数据划分的随机性对结果的影响。
- 示例:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print("Cross-Validation Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
完整评估实例(使用鸢尾花数据集)
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
iris = datasets.load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
param_grid = {'C': [0.1, 1, 10], 'penalty': ['l1', 'l2']}
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Score:", grid_search.best_score_)
cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print("Overall Cross-Validation Accuracy:", np.mean(cv_scores))
输出示例
Accuracy: 0.9666666666666667
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 9
1 1.00 0.93 0.96 15
2 0.92 1.00 0.96 12
accuracy 0.97 36
macro avg 0.97 0.98 0.97 36
weighted avg 0.97 0.97 0.97 36
Best Parameters: {'C': 1, 'penalty': 'l2'}
Best Cross-Validation Score: 0.9666666666666666
Overall Cross-Validation Accuracy: 0.9533333333333334
关键点总结
- 数据划分:避免模型在训练集上过拟合。
- 评估指标:结合准确率、分类报告和混淆矩阵,全面分析模型表现。
- 调参与交叉验证:通过网格搜索和交叉验证优化超参数,确保模型泛化能力。
- 完整流程:从数据划分到最终评估,形成闭环验证。