Introduction
Resampling methods involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model. (e.g. cross-validation, bootstrap)
- Estimates of test-set prediction error (CV)
- S.E. and bias of estimated parameters (Bootstrap)
- C.I. of target parameter (Bootstrap)
Cross-Validation
The training error rate often is quite different from the test error rate, and in particular the former can dramatically underestimate the latter.
-
Model Complexity Low: High bias, Low variance
-
Model Complexity High: Low bias, High variance
Prediction Error Estimates
- Large test set
- Mathematical adjustment
- C p = 1 n ( S S E d + 2 d σ ^ 2 ) C_p=\frac{1}{n}(SSE_d+2d\hat{\sigma}^2) Cp=n1(SSEd+2dσ^2)
- A I C = 1 n σ ^ 2 ( S S E d + 2 d σ ^ 2 ) AIC=\frac{1}{n\hat{\sigma}^2}(SSE_d+2d\hat{\sigma}^2) AIC=nσ^21(SSEd+2dσ^2)
- B I C = 1 n σ ^ 2 ( S S E d + l o g ( n ) d σ ^ 2 ) BIC=\frac{1}{n\hat{\sigma}^2}(SSE_d+log(n)d\hat{\sigma}^2) BIC=nσ^21(SSEd+log(n)dσ^2)
- CV: Consider a class of methods that estimate the test error rate by holding out a subset of the training observations from the fitting process, and then applying the statistical learning method to those held out observations.
The Validation Set Approach
A random splitting into two halves: left part is training set, right part is validation set.
Drawbacks
- The validation estimate of the test error rate can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set.
- Only a subset of the observations are used to fit the model.
- Validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set.
Leave-One-Out Cross-Validation
LOOCV involves splitting the set of observations into two parts. However, instead of creating two subsets of comparable size, a single observation ( x 1 , y 1 ) (x_1 , y_1 ) (x1,y1) is used for the validation set, and the remaining observations ( x 2 , y 2 ) , . . . , ( x n , y n ) { (x_2 , y_2 ), . . . , (x_n , y_n ) } (x2,y2),...,(xn,yn) make up the training set.
In Linear Regression
C V ( n ) = 1 n ∑ i = 1 n ( y i − y i ^ 1 − h i ) 2 CV_{(n)}=\frac{1}{n}\sum^n_{i=1}(\frac{y_i-\hat{y_i}}{1-h_i})^2 CV(n)=n1i=1∑n(1−hiyi−yi^)2
- C V n CV_n CVn bacomes a weighted MSE
Drawbacks
- Estimates from each fold are highly correlated and hence their average can have high variance.
K-fold Cross-Validation
This approach involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds. This procedure is repeated k times; each time, a different group of observations is treated as a validation set. This process results in k estimates of the test error. The k-fold CV estimate is computed by averaging these values. If k=n, then it is LOOCV.
C V ( k ) = 1 k ∑ i = 1 k M S E o r C V ( k ) = 1 k ∑ i = 1 k E r r k CV_{(k)}=\frac{1}{k}\sum^k_{i=1}{MSE}\\or\\ CV_{(k)}=\frac{1}{k}\sum^k_{i=1}{Err_k} CV(k)=k1