11.How can Machine Learn Better? - Overfitting and Solution

本文详细阐述了机器学习中过拟合与欠拟合的概念,分析了过拟合产生的原因,并提供了五种避免过拟合的方法,包括使用简单模型、数据清理、数据提示、正则化及验证。

How can Machine Learn Better? - Overfitting and Solution

1. What is Overfitting?

上一节最后,我们提到了如果线性模型的模型复杂度太大的话,可能会引起Overfitting。同样的很明显会有Underfitting的情况。

那什么是Overfitting,Underfitting呢?

首先根据名字,Overfitting:over fitting,就是在fitting的时候太over了。我们用线性模型去分类/回归处理数据的过程就是一个fitting的过程,所以也就是说我们处理过头了。同理Underfitting就是处理不够到位。

那么什么时候才是处理过头呢?什么时候才是处理不到位呢?

就是在处理相对简单的问题的时候用了相对复杂的模型去处理。
就是在处理相对复杂的问题的时候用了相对简单的模型去处理。

我们用下面的例子来进行说明
1.首先例子如图一所示(这里用的是Ng的图,因为在林老师的ppt中没找到很好地图同时体现Underfit, good fit overfit)。左图是欠拟合(underfit),中间的图四好的拟合(good fit),右图是过度拟合(overfit)。单纯从拟合结果来看:明显左边和中间的图 Ein 比右图要大。但是从泛化好坏来看,显然左图和中间的图要比右图好。
假如我们考虑good fit分类出错的点为噪音点(noise),那么Overfit的模型就会受到了严重的干扰。

Underfit, Good Fit and Overfit

图一 Underfit, Good Fit and Overfit [1]

  1. 接着我们回头看之前总结的VC Dimension 的曲线, 如图二所示。

Learning Curve

图二 Learning Curve [1]

图中可以看到在VC Dimension变大时, Ein 变小, 但 Eout 先变小后变大,而过拟合和欠拟合的情况的主要区别就在于 Eout 的变化情况,具体解释如下:
- 过拟合(Overfitting)发生在VC Dimension较大时, Ein 太小, 但 Eout 太大(即VC Dimension 太大了) ,表示在训练样本上拟合做的很好, Ein 太小,但是过度了,使得泛化能力变差, Eout 很大
- 欠拟合(Underfitting)发生在在VC Dimension较小时, Ein 太大,同时 Eout 太大(即VC Dimension 太小了),表示在训练样本上拟合做不够好, Ein 太大,虽然泛化能力很强(即 Eout 也太大)

关于如何解决欠拟合的问题之前也讨论过: 从低到高不断地提高多项式次数,使得VC维提高,达到拟合的效果。
但过拟合的问题更为复杂,下面会更深入的探讨。

  1. 下面的图三,可以让我们更加直观的看到overfitting造成的问题

Cases of Overfitting

图三 Case of Overfitting [2]

从图中可以看到,Overfitting的 Ein 都比 good-fit的要低,但是 Eout 却很高(泛化能力差)。

总结起来有3个因数会导致Overfitting的发生:
- data size N 太小
- noise 太多
- VC Dimension太大



2. Dealing with Overfitting

上一节我们提出了overfitting并作了分析。总结出3个因数会导致overfitting,下面根据这3个因数,我们有5种方法帮助我们避免overfitting的发生。

  • 使用简单的模型(start from simple model),逐次增加模型的复杂度 - 防止 VC Dimension太大
  • 进行数据清理/裁剪(data cleaning/pruning) - 防止 noise 太多
  • 数据提示(data hinting) - 防止 data size N 太小
  • 正则化(regularization) - 防止 VC Dimension太大
  • 确认(validation) - 提取一部分的数据作为测试集,提前估计模型的泛化强度

下面我们分别介绍这5种方法,其中前三种方法比较简单,这里不做深入讨论,而 Regularization 和 Validation 较复杂,这里会用比较多的笔墨进行讨论。

1) Start from Simple Model

上一章中也提到过,由于VC Dimension太大的话,导致 Ein 变小的同时 Eout 却在变大。所以我们如果从d=1阶的模型开始debug,如果 Ein 不符合要求,那么我们增大d为2阶,然后在进行debug,以此类推,直到 Ein 符合我们要求位置,这个时候的 VC Dimension 不会很大,而且我们也得到泛化能力相对较强的模型。

2) Data Cleaning/Pruning

Data cleaning/pruning就是对训练数据集里label有明显错误的样本进行清理(data cleaning)或者裁剪(pruning)。data cleaning/pruning关键在于如何准确寻找label错误的点或者是noise的点。而处理的方法为
- 纠正,即数据清理(data cleaning)的方式处理该情况;
- 删除错误样本,即数据裁剪(data pruning)的方式处理。

处理措施很简单,但是发现样本是噪音或离群点却比较困难。

3) Data Hinting

Data hinting是针对N不够大的情况,通过data hinting的方法就可以对已知的样本进行简单的处理、变换,从而获得更多的样本。比如说:数字分类问题,可以对已知的数字图片进行轻微的平移或者旋转,从而得到更多的数据,达到扩大训练集的目的。这种通过data hinting得到的数据叫做:virtual examples。

需要注意的是,新获取的virtual examples可能不再是iid某个distribution。所以新构建的virtual examples要尽量合理,且是独立同分布的。

4) Regularization

Regularization(正规化)处理属于penalized方法的一种,通过正规化的处理来对原来的方程加上一个regularizer进行penalize,从而使得过渡复杂的模型,变得没那么复杂。

关于Regularization 的讨论看此链接:
12. 机器学习基石-How can Machine Learn Better? - Regularization

5) Validation

这个是目前最常用的方法之一,通过提前把一部分的数据拿出来作为测试集,因为测试集是随机取出来的,而将来实际的应用中,数据也大体和测试集出入不大,所以用这种方法,可以提前得到实际应用的时候,模型的错误 Eout 通过这个作为衡量模型是否合格的条件之一。

关于Validation 的讨论看此链接

13. 机器学习基石-How can Machine Learn Better? - Validation



Summary

1.首先介绍了Overfitting和Underfitting的概念。

2.接着我们着重分析Overfitting,总结了产生Overfitting的原因:

  • data size N 太小

  • noise 太多

  • VC Dimension太大

3.最后我们分析如何最大程度的避免Overfitting。在solution中.



Reference

[1] 机器学习基石(台湾大学-林轩田)\13\13 - 1 - What is Overfitting- (10-45)

[2] 机器学习基石(台湾大学-林轩田)\13\13 - 2 - The Role of Noise and Data Size (13-36)



开始训练: hidden_size=106, dropout=0.217, lr=0.1692, bs=125, wd=5.0e-04 Epoch 1: Train Loss=2.9063, Val Loss=0.3776, LR=1.69e-01 Epoch 2: Train Loss=0.7092, Val Loss=0.8457, LR=1.69e-01 Epoch 3: Train Loss=1.7772, Val Loss=0.2000, LR=1.69e-01 Epoch 4: Train Loss=0.9905, Val Loss=0.1919, LR=1.69e-01 Warning: Gradient explosion detected at Epoch 5, norm=24.28 Epoch 5: Train Loss=0.4239, Val Loss=0.2505, LR=1.69e-01 Warning: Gradient explosion detected at Epoch 6, norm=19.22 Epoch 6: Train Loss=1.3525, Val Loss=0.6657, LR=1.69e-01 Warning: Potential overfitting detected! Warning: Gradient explosion detected at Epoch 7, norm=28.31 Epoch 7: Train Loss=2.0367, Val Loss=2.5047, LR=1.69e-01 Warning: Potential overfitting detected! Epoch 8: Train Loss=1.7812, Val Loss=8.5760, LR=1.69e-01 Epoch 9: Train Loss=0.5221, Val Loss=0.1499, LR=1.69e-01 Warning: Potential overfitting detected! Epoch 10: Train Loss=0.4516, Val Loss=0.4852, LR=1.69e-01 Warning: Potential overfitting detected! Warning: Gradient explosion detected at Epoch 11, norm=18.76 Epoch 11: Train Loss=1.5965, Val Loss=3.0365, LR=1.69e-01 Warning: Potential overfitting detected! Epoch 12: Train Loss=8.5612, Val Loss=0.7186, LR=1.69e-01 Epoch 13: Train Loss=1.2013, Val Loss=0.1332, LR=1.69e-01 Warning: Potential overfitting detected! Epoch 14: Train Loss=0.5843, Val Loss=0.2373, LR=1.69e-01 Warning: Potential overfitting detected! Epoch 15: Train Loss=0.8476, Val Loss=0.4644, LR=1.69e-01 Warning: Potential overfitting detected! Epoch 16: Train Loss=0.8168, Val Loss=0.1607, LR=1.69e-01 Warning: Potential overfitting detected! Epoch 17: Train Loss=0.3797, Val Loss=0.2155, LR=1.69e-01 Warning: Potential overfitting detected! Epoch 18: Train Loss=0.8100, Val Loss=0.2240, LR=1.69e-01 Warning: Potential overfitting detected! Epoch 19: Train Loss=0.6434, Val Loss=1.2231, LR=1.69e-02 Epoch 20: Train Loss=0.2101, Val Loss=0.0963, LR=1.69e-02 Epoch 21: Train Loss=0.0455, Val Loss=0.0151, LR=1.69e-02 Warning: Potential overfitting detected! Epoch 22: Train Loss=0.0350, Val Loss=0.0155, LR=1.69e-02 Warning: Potential overfitting detected! Epoch 23: Train Loss=0.0373, Val Loss=0.0259, LR=1.69e-02 Warning: Potential overfitting detected! Epoch 24: Train Loss=0.0347, Val Loss=0.0204, LR=1.69e-02 Warning: Potential overfitting detected! Epoch 25: Train Loss=0.0383, Val Loss=0.0804, LR=1.69e-02 Warning: Potential overfitting detected! Epoch 26: Train Loss=0.0398, Val Loss=0.0177, LR=1.69e-02 Epoch 27: Train Loss=0.0356, Val Loss=0.0129, LR=1.69e-02 Warning: Potential overfitting detected! Epoch 28: Train Loss=0.0407, Val Loss=0.0689, LR=1.69e-02 Warning: Potential overfitting detected! Epoch 29: Train Loss=0.0386, Val Loss=0.0319, LR=1.69e-02 Warning: Potential overfitting detected! Epoch 30: Train Loss=0.0376, Val Loss=0.0605, LR=1.69e-02 Warning: Potential overfitting detected! Epoch 31: Train Loss=0.0411, Val Loss=0.0178, LR=1.69e-02 Warning: Potential overfitting detected! Epoch 32: Train Loss=0.0372, Val Loss=0.0454, LR=1.69e-02 Warning: Potential overfitting detected! Epoch 33: Train Loss=0.0415, Val Loss=0.0308, LR=1.69e-03 Epoch 34: Train Loss=0.0266, Val Loss=0.0118, LR=1.69e-03 Warning: Potential overfitting detected! Epoch 35: Train Loss=0.0260, Val Loss=0.0139, LR=1.69e-03 Epoch 36: Train Loss=0.0255, Val Loss=0.0101, LR=1.69e-03 Warning: Potential overfitting detected! Epoch 37: Train Loss=0.0250, Val Loss=0.0106, LR=1.69e-03 Warning: Potential overfitting detected! Epoch 38: Train Loss=0.0257, Val Loss=0.0159, LR=1.69e-03 Warning: Potential overfitting detected! Epoch 39: Train Loss=0.0237, Val Loss=0.0102, LR=1.69e-03 Epoch 40: Train Loss=0.0246, Val Loss=0.0083, LR=1.69e-03 Warning: Potential overfitting detected! Epoch 41: Train Loss=0.0237, Val Loss=0.0094, LR=1.69e-03 Warning: Potential overfitting detected! Epoch 42: Train Loss=0.0259, Val Loss=0.0148, LR=1.69e-03 Warning: Potential overfitting detected! Epoch 43: Train Loss=0.0259, Val Loss=0.0141, LR=1.69e-03 Warning: Potential overfitting detected! Epoch 44: Train Loss=0.0240, Val Loss=0.0157, LR=1.69e-03 Warning: Potential overfitting detected! Epoch 45: Train Loss=0.0237, Val Loss=0.0084, LR=1.69e-03 Warning: Potential overfitting detected! Epoch 46: Train Loss=0.0240, Val Loss=0.0198, LR=1.69e-04 Warning: Potential overfitting detected! Epoch 47: Train Loss=0.0216, Val Loss=0.0098, LR=1.69e-04 Epoch 48: Train Loss=0.0212, Val Loss=0.0081, LR=1.69e-04 Epoch 49: Train Loss=0.0211, Val Loss=0.0077, LR=1.69e-04 Warning: Potential overfitting detected! Epoch 50: Train Loss=0.0210, Val Loss=0.0078, LR=1.69e-04 Warning: Potential overfitting detected! Epoch 51: Train Loss=0.0215, Val Loss=0.0077, LR=1.69e-04 Warning: Potential overfitting detected! Epoch 52: Train Loss=0.0218, Val Loss=0.0079, LR=1.69e-04 Epoch 53: Train Loss=0.0209, Val Loss=0.0077, LR=1.69e-04 Warning: Potential overfitting detected! Epoch 54: Train Loss=0.0209, Val Loss=0.0078, LR=1.69e-04 Warning: Potential overfitting detected! Epoch 55: Train Loss=0.0208, Val Loss=0.0079, LR=1.69e-04 Warning: Potential overfitting detected! Epoch 56: Train Loss=0.0211, Val Loss=0.0080, LR=1.69e-04 Warning: Potential overfitting detected! Epoch 57: Train Loss=0.0216, Val Loss=0.0082, LR=1.69e-04 Epoch 58: Train Loss=0.0207, Val Loss=0.0077, LR=1.69e-04 Warning: Potential overfitting detected! Epoch 59: Train Loss=0.0207, Val Loss=0.0085, LR=1.69e-04 Warning: Potential overfitting detected! Epoch 60: Train Loss=0.0209, Val Loss=0.0082, LR=1.69e-04 Warning: Potential overfitting detected! Epoch 61: Train Loss=0.0210, Val Loss=0.0077, LR=1.69e-04 Warning: Potential overfitting detected! Epoch 62: Train Loss=0.0213, Val Loss=0.0082, LR=1.69e-04 Warning: Potential overfitting detected! Epoch 63: Train Loss=0.0205, Val Loss=0.0081, LR=1.69e-04 Warning: Potential overfitting detected! Epoch 64: Train Loss=0.0208, Val Loss=0.0078, LR=1.69e-05 Warning: Potential overfitting detected! Epoch 65: Train Loss=0.0200, Val Loss=0.0077, LR=1.69e-05 Warning: Potential overfitting detected! Epoch 66: Train Loss=0.0201, Val Loss=0.0082, LR=1.69e-05 Warning: Potential overfitting detected! Epoch 67: Train Loss=0.0206, Val Loss=0.0081, LR=1.69e-05 早停触发于第67轮 === 最终测试结果 === MSE: 0.007725 RMSE: 0.087893 R² Score: 0.9889
最新发布
06-13
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值