sklearn.model_selection包下的learning_curve方法

最新推荐文章于 2025-03-17 09:58:44 发布

wenlish

最新推荐文章于 2025-03-17 09:58:44 发布

阅读量387

点赞数

分类专栏： Python 文章标签： sklearn python 人工智能

本文链接：https://blog.youkuaiyun.com/m0_48520385/article/details/121050184

版权

Python 专栏收录该内容

5 篇文章

订阅专栏

这个函数的调用格式是

learning_curve(estimator, X, y, *, groups=None,
                   train_sizes=np.linspace(0.1, 1.0, 5), cv=None,
                   scoring=None, exploit_incremental_learning=False,
                   n_jobs=None, pre_dispatch="all", verbose=0, shuffle=False,
                   random_state=None, error_score=np.nan, return_times=False,
                   fit_params=None):

这个函数的作用为：对于不同大小的训练集，确定交叉验证训练和测试的分数。一个交叉验证发生器将整个数据集分割k次，分割成训练集和测试集。不同大小的训练集的子集将会被用来训练评估器并且对于每一个大小的训练子集都会产生一个分数，然后测试集的分数也会计算。然后，对于每一个训练子集，运行k次之后的所有这些分数将会被平均。
参数说明
estimator：所使用的分类器，如KNN分类器
X : array-like of shape (n_samples, n_features) Training vector, where n_samples is the number of samples and n_features is the number of features.训练向量，n_samples是样本的数量，n_features是样本特征的数量
y : array-like of shape (n_samples,) or (n_samples, n_outputs) Target relative to X for classification or regression;None for unsupervised learning.
相对于X的分类或回归目标；无监督学习时值为None
train_sizes：array-like of shape (n_ticks,), default=np.linspace(0.1, 1.0, 5)
训练样本的相对的或绝对的数字，这些量的样本将会生成learning curve。如果dtype是float，他将会被视为最大数量训练集的一部分（这个由所选择的验证方法所决定）。否则，他将会被视为训练集的绝对尺寸。要注意的是，对于分类而言，样本的大小必须要充分大，达到对于每一个分类都至少包含一个样本的情况。
cv：int, cross-validation generator or an iterable, optional
确定交叉验证的分离策略
–None，使用默认的5-fold cross-validation,
–integer,指定分层K折验证中K的折数
–:term:‘CV splitter’ :术语：‘CV拆分器’
–一个产生（训练集，测试集）划分的指数数组的迭代器，比sklearn.model_selection下的ShuffleSplit(n_splits=10, test_size=0.2,random_state=0)函数就属于这种形式
n_jobs : int, default=None
Number of jobs to run in parallel.Training the estimator and computing the score are parallelized over the different training and test sets.None means 1 unless in a :obj:joblib.parallel_backendcontext.-1 means using all processors.
并行运行的作业数，默认为None，即为1，-1时意味着使用所有的处理器处理作业
verbose: int, default=0,Controls the verbosity: the higher, the more messages.控制冗余程度，越高，信息越多，默认为0，即冗余程度为0。
此外groups、scoring、exploit_incremental_learning、pre_dispatch、shuffle、random_state、error_score、 return_times、fit_params等参数在这里不再做详细介绍，感兴趣的同学可以查看官方文档：https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html#sklearn.model_selection.learning_curve。
还有
返回值
train_sizes_abs : array of shape (n_unique_ticks,) Numbers of training examples that has been used to generate the learning curve. Note that the number of ticks might be less than n_ticks because duplicate entries will be removed.用于生成learning curve的训练集的样本数。由于重复的输入将会被删除，所以ticks可能会少于n_ticks.
train_scores : array of shape (n_ticks, n_cv_folds) Scores on training sets.在训练集上的分数
test_scores：array of shape (n_ticks, n_cv_folds)
Scores on test set.在测试集上的分数
返回值还有fit_times、score_times等返回值，在此不再说明，感兴趣的同学可以取查看官方文档，网址同上
参考内容：https://www.cnblogs.com/xiaohua92/p/5525788.html