训练集(train set), 验证集(validation set), 测试集(test set)

转载博文地址:原博文地址

在有监督(supervise)的机器学习中,数据集常被分成2~3个,即:训练集(train set) 验证集(validation set) 测试集(test set)。

http://blog.sina.com.cn/s/blog_4d2f6cf201000cjx.html

一般需要将样本分成独立的三部分训练集(train set),验证集(validation set)和测试集(test set)。其中训练集用来估计模型,验证集用来确定网络结构或者控制模型复杂程度的参数,而测试集则检验最终选择最优的模型的性能如何。一个典型的划分是训练集占总样本的50%,而其它各占25%,三部分都是从样本中随机抽取。
样本少的时候,上面的划分就不合适了。常用的是留少部分做测试集。然后对其余N个样本采用K折交叉验证法。就是将样本打乱,然后均匀分成K份,轮流选择其中K-1份训练,剩余的一份做验证,计算预测误差平方和,最后把K次的预测误差平方和再做平均作为选择最优模型结构的依据。特别的K取N,就是留一法(leave one out)。

http://www.cppblog.com/guijie/archive/2008/07/29/57407.html

这三个名词在机器学习领域的文章中极其常见,但很多人对他们的概念并不是特别清楚,尤其是后两个经常被人混用。Ripley, B.D(1996)在他的经典专著Pattern Recognition and Neural Networks中给出了这三个词的定义。
Training set: A set of examples used for learning, which is to fit the parameters [i.e., weights] of the classifier. 
Validation set: A set of examples used to tune the parameters [i.e., architecture, not weights] of a classifier, for example to choose the number of hidden units in a neural network. 
Test set: A set of examples used only to assess the performance [generalization] of a fully specified classifier.
显然,training set是用来训练模型或确定模型参数的,如ANN中权值等; validation set是用来做模型选择(model selection),即做模型的最终优化及确定的,如ANN的结构;而 test set则纯粹是为了测试已经训练好的模型的推广能力。当然,test set这并不能保证模型的正确性,他只是说相似的数据用此模型会得出相似的结果。但实际应用中,一般只将数据集分成两类,即training set 和test set,大多数文章并不涉及validation set。
Ripley还谈到了Why separate test and validation sets?
1. The error rate estimate of the final model on validation data will be biased (smaller than the true error rate) since the validation set is used to select the final model.
2. After assessing the final model with the test set, YOU MUST NOT tune the model any further.

http://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set

Step 1) Training: Each type of algorithm has its own parameter options (the number of layers in a Neural Network, the number of trees in a Random Forest, etc). For each of your algorithms, you must pick one option. That’s why you have a validation set.

Step 2) Validating: You now have a collection of algorithms. You must pick one algorithm. That’s why you have a test set. Most people pick the algorithm that performs best on the validation set (and that's ok). But, if you do not measure your top-performing algorithm’s error rate on the test set, and just go with its error rate on the validation set, then you have blindly mistaken the “best possible scenario” for the “most likely scenario.” That's a recipe for disaster.

Step 3) Testing: I suppose that if your algorithms did not have any parameters then you would not need a third step. In that case, your validation step would be your test step. Perhaps Matlab does not ask you for parameters or you have chosen not to use them and that is the source of your confusion.

My Idea is that those option in neural network toolbox is for avoiding overfitting. In this situation the weights are specified for the training data only and don't show the global trend. By having a validation set, the iterations are adaptable to where decreases in the training data error cause decreases in validation data and increases in validation data error; along with decreases in training data error, this demonstrates the overfitting phenomenon.

http://blog.sciencenet.cn/blog-397960-666113.html

http://stackoverflow.com/questions/2976452/whats-is-the-difference-between-train-validation-and-test-set-in-neural-networ

for each epoch
for each training data instance
propagate error through the network
adjust the weights
calculate the accuracy over training data
for each validation data instance
calculate the accuracy over the validation data
if the threshold validation accuracy is met
exit training
else
continue training

Once you're finished training, then you run against your testing set and verify that the accuracy is sufficient.

Training Set: this data set is used to adjust the weights on the neural network.

Validation Set: this data set is used to minimize overfitting. You're not adjusting the weights of the network with this data set, you're just verifying that any increase in accuracy over the training data set actually yields an increase in accuracy over a data set that has not been shown to the network before, or at least the network hasn't trained on it (i.e. validation data set). If the accuracy over the training data set increases, but the accuracy over then validation data set stays the same or decreases, then you're overfitting your neural network and you should stop training.

Testing Set: this data set is used only for testing the final solution in order to confirm the actual predictive power of the network.

Validating set is used in the process of training. Testing set is not. The Testing set allows

1)to see if the training set was enough and 
2)whether the validation set did the job of preventing overfitting. If you use the testing set in the process of training then it will be just another validation set and it won't show what happens when new data is feeded in the network.

 

 

Training set: A set of examples used for learning, that is to fit the parameters [i.e., weights] of the classifier.

Validation set: A set of examples used to tune the parameters [i.e., architecture, not weights] of a classifier, for example to choose the number of hidden units in a neural network.

Test set: A set of examples used only to assess the performance [generalization] of a fully specified classifier.

The error surface will be different for different sets of data from your data set (batch learning). Therefore if you find a very good local minima for your test set data, that may not be a very good point, and may be a very bad point in the surface generated by some other set of data for the same problem. Therefore you need to compute such a model which not only finds a good weight configuration for the training set but also should be able to predict new data (which is not in the training set) with good error. In other words the network should be able to generalize the examples so that it learns the data and does not simply remembers or loads the training set by overfitting the training data.

The validation data set is a set of data for the function you want to learn, which you are not directly using to train the network. You are training the network with a set of data which you call the training data set. If you are using gradient based algorithm to train the network then the error surface and the gradient at some point will completely depend on the training data set thus the training data set is being directly used to adjust the weights. To make sure you don't overfit the network you need to input the validation dataset to the network and check if the error is within some range. Because the validation set is not being using directly to adjust the weights of the netowork, therefore a good error for the validation and also the test set indicates that the network predicts well for the train set examples, also it is expected to perform well when new example are presented to the network which was not used in the training process.

Early stopping is a way to stop training. There are different variations available, the main outline is, both the train and the validation set errors are monitored, the train error decreases at each iteration (backprop and brothers) and at first the validation error decreases. The training is stopped at the moment the validation error starts to rise. The weight configuration at this point indicates a model, which predicts the training data well, as well as the data which is not seen by the network . But because the validation data actually affects the weight configuration indirectly to select the weight configuration. This is where the Test set comes in. This set of data is never used in the training process. Once a model is selected based on the validation set, the test set data is applied on the network model and the error for this set is found. This error is a representative of the error which we can expect from absolutely new data for the same problem.


在机器学习中,数据集通常被划分为三个部分:训练集train set)、验证集validation set)和测试集test set)。每一部分在模型的开发和评估过程中扮演着不同的角色。 ### 训练集的作用 训练集是用于训练模型的数据集。具体来说,模型通过学习训练集中的数据特征和标签之间的关系来调整其内部参数。这个过程类似于学生通过学习教材内容来掌握知识。训练集的质量和多样性对模型的性能有重要影响,因为模型只能从训练数据中学习到规律,而无法处理训练数据中未出现的模式。 ### 验证集的作用 验证集用于在训练过程中评估模型的性能,并调整模型的超参数(如学习率、正则化系数等)。超参数不是通过训练自动学习的,而是需要手动设置的。验证集帮助选择最佳的超参数组合,以确保模型在未见过的数据上表现良好。这个过程类似于考试前的模拟测试,通过模拟测试的结果来调整复习策略。 ### 测试集的作用 测试集用于最终评估模型的性能。它完全独立于训练集验证集,确保模型的评估结果不受训练过程的影响。测试集的目的是提供一个无偏见的评估,反映模型在实际应用中的表现。测试集的误差通常被视为模型泛化能力的一个估计,即模型在新数据上的表现。 ### 数据划分比例 通常情况下,数据集的划分比例为训练集:验证集:测试集 = 0.6:0.2:0.2。这种划分方式确保了模型有足够的数据进行训练,同时保留了足够的数据用于验证和测试。当然,具体的划分比例可以根据数据集的大小和任务的需求进行调整。 ### 防止过拟合 将数据集划分为训练集验证集测试集的主要原因之一是为了防止过拟合(overfitting)。过拟合是指模型在训练数据上表现很好,但在新数据上表现较差的现象。如果使用训练集进行测试,模型可能会过度适应训练数据中的噪声和细节,导致评估结果过于乐观。通过使用独立的验证集测试集,可以更准确地评估模型的泛化能力,从而避免过拟合。 ### 代码示例 以下是一个简单的Python代码示例,展示如何使用`scikit-learn`库将数据集划分为训练集验证集测试集: ```python from sklearn.model_selection import train_test_split import numpy as np # 假设X是特征数据,y是标签数据 X, y = np.arange(100).reshape((50, 2)), range(50) # 首先将数据集划分为训练集和临时集(验证集+测试集) X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42) # 再将临时集划分为验证集测试集 X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42) print("训练集大小:", len(X_train)) print("验证集大小:", len(X_val)) print("测试集大小:", len(X_test)) ``` 上述代码首先将数据集划分为训练集和临时集,其中临时集包含验证集测试集。然后,临时集被进一步划分为验证集测试集。最终,数据集被划分为训练集验证集测试集,比例为0.6:0.2:0.2。 ### 总结 训练集验证集测试集在机器学习中各自承担着不同的职责。训练集用于模型的学习,验证集用于超参数的选择和模型的调优,测试集用于最终评估模型的性能。合理的数据划分有助于提高模型的泛化能力,防止过拟合,确保模型在实际应用中的稳定性[^1]。
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值