1.
Our data is a sample,and what we need is to resample, Cross-validation is a resampling method.
The lower the training error, the test error can get higher if we over fit.
Model Complexity: for example, in alinear model: it is the number of features, or the number of coefficients that we fit in the model. Low means a few number of features or predictors. High means a large numbers.
比如fitting a model in a polynomial with higher and higher degree
从图中可以看出,随着model complexity的增大,training error越来越小,因为得到的model是越来越符合training data的,而这样testerror会变小到一个最低值后,则开始由于overfitting而增大,这个图也很好的解释了overfit。
Bias 和 variance 是 predictor error. So the bias is how far off on the average the model is from the truth. The variance is how much that the estimate varies around its average. 也就是说,当我们不是fit very hard的时候,那么离truth就越远,bias就越大,而同时variance越小,因为这个时候the number of features比较少。
为了找到一个合适的点,合适的degree或者说合适的model complexity,这个时候cross validation的方法就比较管用。
2.
当我们的数据集比较小时,就用到cross-validation method.
先取出一部分数据,然后将剩下的数据作为training data,再将fit出来的model对取出来的数据进行test.
比如这张图就是twofold validation,把数据分成两部分,蓝色部分training,橙色部分test.
从图中可以看到,当degree大于等于2的时候,MSE比较小,右图显示出了twofold cross validation的缺点:每次,我们重新选择两部分作为training和test,虽然每次的结果图形的形状大致相同,但是可以看到MSE从16到24不等,可以发现variability很大。那么多少fold比较合适?K=5到10比较合适。
3. K-fold cross-validation
把数据分成K部分,有一部分是validation set, 剩下的是K-1部分是training set.
比如这张图,先将第一部分作为validation set,其余四部分合在一起fit the model,再把model去test the validation set;再把第二部分作为validation set,然后做同样的步骤,以此类推总共做5次。
这张图可以看到两点:其一,当degree等于2的时候,MSE较小并且10次的点基本重合,这解决了model complexity的问题,其二,当K=10,也就是10fold的时候,variability非常小,10次基本保持一致,和前面的twofold的图对比,证明了5-10个fold比较合适。
(1) Since each training set is only (K-1)/K as big as the original training set, the estimates ofprediction error will typically be biased upward.
(2) This bias is minimized when K = n(LOOCV), but this estimate has high variance, as noted earlier.
(3) K = 5 or 10 provides a good compromise for this bias-variance tradeoff.
LOOCV的R代码:(Leave-one-out cross validation)
require(ISLR)
require(boot)
?cv.glm #cv.glm就是cross validation的命令,可以用?cv.glm查看详细的用法
plot(mpg~horsepower,data=Auto)
## LOOCV
glm.fit=glm(mpg~horsepower, data=Auto)
cv.glm(Auto,glm.fit)$delta #pretty slow (doesnt use formula (5.2) on page 180) delta就是prediction error
##Lets write a simple function to use formula (5.2)
loocv=function(fit){
h=lm.influence(fit)$h
mean((residuals(fit)/(1-h))^2)
}
## Now we try it out
loocv(glm.fit)
cv.error=rep(0,5)
degree=1:5
for(d in degree){
glm.fit=glm(mpg~poly(horsepower,d), data=Auto)
cv.error[d]=loocv(glm.fit)
}
plot(degree,cv.error,type="b")
## 10-fold CV
cv.error10=rep(0,5)
for(d in degree){
glm.fit=glm(mpg~poly(horsepower,d), data=Auto)
cv.error10[d]=cv.glm(Auto,glm.fit,K=10)$delta[1]
}
lines(degree,cv.error10,type="b",col="red")