Andrew Ng 's machine learning lecture note (11)

最新推荐文章于 2018-06-07 12:12:23 发布

原创最新推荐文章于 2018-06-07 12:12:23 发布 · 258 阅读

0 ·

CC 4.0 BY-SA版权

Andrew Ng 's Note 专栏收录该内容

15 篇文章

订阅专栏

本文介绍如何通过划分训练集、交叉验证集及测试集来选择最优模型，并探讨了高偏差与高方差问题及其解决方法。

Model choosing

After,we get a model. Sometimes we will wonder how we can optimize it.In order to do so, we can divide our data set into 3 parts, First the traning set(60%),Second the cross validation set(20%), Third the test set(20%).

For linear regression

Suppose that we have gotten several model based on the same traning set. So which model should we choose? Well, we can use the cost function to estimate the error in the cross validation set. Then we should choose the minimum error model. The test set can also help us to estimate the error.

For logistic regression

The procedure is the same as above, except the cost function should defined as followed:

and the error for the test set is as followed

Bias or high variance problem?

Bias problem means that the figure is underfitting while variance problem means the figure is over fitting.

So, how can we tell? We should consider which element leads to this problem, for example, degree of polynomial. We can plot the J(theta) of the test data and traning data in the same figure and figure out whether there is a bias problem or variance problem.

Now we have the summary on how to choose a good model as followed:

If we are suffering from a high bias problem, adding more data is not likely to help you while if we are suffering from a high variance problem,adding more data is gonna be helpful.

When we're having a model, and we want to check whether we have a high bias(underfitting) or high variance problem(overfitting), we'd better plot the J(theta) of validation set and traning set correspond to the traning examples. When practising, remember that the validation set should remain the same, we should learn the new theta each time we increase the number of the traning examples.