### 绘制机器学习模型的学习曲线
绘制学习曲线有助于理解模型的表现以及是否存在过拟合或欠拟合的情况。良好的学习曲线应显示训练集与验证集之间的差距较小,并且两条曲线都趋于稳定[^2]。
为了实现这一目标,Python 的 `scikit-learn` 库提供了便捷的方法来创建这样的图表。下面是一个具体的例子:
#### 使用 Scikit-Learn 创建学习曲线
```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
"""
Generate a simple plot of the test and training learning curve.
Parameters:
estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.
title : string
Title for the chart.
X : array-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
y : array-like, shape (n_samples) or (n_samples, n_features), optional
Target relative to X for classification or regression;
None for unsupervised learning.
ylim : tuple, shape (ymin, ymax), optional
Defines minimum and maximum yvalues plotted.
cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy.
n_jobs : integer, optional
Number of jobs to run in parallel.
train_sizes : array-like, shape (n_ticks,), dtype float or int
Relative or absolute numbers of training examples that will be used to generate
the learning curve. If the dtype is float, it is regarded as a fraction
of the maximum size of the training set (that is determined by the selected validation method),
i.e. it has to be within (0, 1]. Otherwise it is interpreted as absolute sizes of the training sets.
"""
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
# 加载数据并准备绘图
data = load_iris()
X, y = data.data, data.target
title = "Learning Curves (Logistic Regression)"
estimator = LogisticRegression(max_iter=200)
plot_learning_curve(estimator, title, X, y, cv=5)
plt.show()
```
此代码片段定义了一个名为 `plot_learning_curve()` 函数用于生成给定估计器(在此例中为逻辑回归)的学习曲线。通过调整输入参数如 `train_sizes`, 可以控制所使用的不同规模的数据子集来进行训练和评估。最终的结果会展示两个分数随样本数量变化的趋势——一个是训练得分,另一个则是交叉验证得分。理想情况下,这两个分数应该接近并且随着更多的数据加入而逐渐平稳下来。