重采样方法 (Resampling Methods) (CV, Bootstrap)

最新推荐文章于 2025-01-11 16:13:58 发布

haoen110

最新推荐文章于 2025-01-11 16:13:58 发布

阅读量6.5k

点赞数 2

分类专栏：数据科学和机器学习文章标签：机器学习

本文链接：https://blog.youkuaiyun.com/haoen110/article/details/104040914

版权

本文介绍了重采样方法，如交叉验证（包括留一法和K折法）和Bootstrap，用于评估模型预测误差和参数估计的不确定度。通过不同的采样方式，这些方法可以帮助减少训练误差与测试误差之间的差距，提供标准误差估计和置信区间。Bootstrap百分位法和基于标准误差的置信区间被讨论，同时指出在特定情况下，如时间序列数据，需要调整采样策略。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Introduction

Resampling methods involve repeatedly drawing samples from a training set and reﬁtting a model of interest on each sample in order to obtain additional information about the ﬁtted model. (e.g. cross-validation, bootstrap)

Estimates of test-set prediction error (CV)
S.E. and bias of estimated parameters (Bootstrap)
C.I. of target parameter (Bootstrap)

Cross-Validation

The training error rate often is quite diﬀerent from the test error rate, and in particular the former can dramatically underestimate the latter.

Model Complexity Low: High bias, Low variance
Model Complexity High: Low bias, High variance

Prediction Error Estimates

Large test set
Mathematical adjustment
- $C_p=\frac{1}{n}(SSE_d+2d\hat{\sigma}^2)$
- $AIC=\frac{1}{n\hat{\sigma}^2}(SSE_d+2d\hat{\sigma}^2)$
- $BIC=\frac{1}{n\hat{\sigma}^2}(SSE_d+log(n)d\hat{\sigma}^2)$
CV: Consider a class of methods that estimate the test error rate by holding out a subset of the training observations from the ﬁtting process, and then applying the statistical learning method to those held out observations.

The Validation Set Approach

A random splitting into two halves: left part is training set, right part is validation set.

Drawbacks

The validation estimate of the test error rate can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set.
Only a subset of the observations are used to ﬁt the model.
Validation set error rate may tend to overestimate the test error rate for the model ﬁt on the entire data set.

Leave-One-Out Cross-Validation

LOOCV involves splitting the set of observations into two parts. However, instead of creating two subsets of comparable size, a single observation $x_1 , y_1 )$ is used for the validation set, and the remaining observations ${ (x_2 , y_2 ), . . . , (x_n , y_n ) }$ make up the training set.

In Linear Regression

$CV_{(n)}=\frac{1}{n}\sum^n_{i=1}(\frac{y_i-\hat{y_i}}{1-h_i})^2$

$CV_n$ bacomes a weighted MSE

Drawbacks

Estimates from each fold are highly correlated and hence their average can have high variance.

K-fold Cross-Validation

This approach involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. The ﬁrst fold is treated as a validation set, and the method is ﬁt on the remaining k − 1 folds. This procedure is repeated k times; each time, a diﬀerent group of observations is treated as a validation set. This process results in k estimates of the test error. The k-fold CV estimate is computed by averaging these values. If k=n, then it is LOOCV.
$CV_{(k)}=\frac{1}{k}\sum^k_{i=1}{MSE}\\or\\ CV_{(k)}=\frac{1}{k}\sum^k_{i=1}{Err_k}$