the problem of overfitting

最新推荐文章于 2024-07-01 22:07:18 发布

原创最新推荐文章于 2024-07-01 22:07:18 发布 · 415 阅读

0 ·

CC 4.0 BY-SA版权

机器学习专栏收录该内容

26 篇文章

订阅专栏

本文探讨了机器学习中常见的过拟合与欠拟合问题，并介绍了如何通过减少特征数量或使用正则化来解决这些问题。此外，还详细解释了正则化在逻辑回归和线性回归中的应用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

这里写图片描述

underfitting or high bias—hypothesis function h maps poorly to the trend of the data.
usually caused by a function that is too simple or uses too few features.

overfitting or high variance—fits the available data but does not generalize well to predict new data.
usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.

to address it:
1) Reduce the number of features:
1. Manually select which features to keep.
2. Use a model selection algorithm .
2) Regularization
1. Keep all the features: but reduce the magnitude of parameters θj.
2. Regularization works well when we have a lot of slightly useful features.

Regularization:
1.regularized linear regression
Without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our cost function:

m i n θ 1 2 m [\sum i = 1 m (h θ (x (i)) - y (i)) 2 + λ \sum j = 1 n θ 2 j]

$min_\theta\frac{1}{2m}[\sum_{i=1}^{m} (h_\theta(x^{(i)})-y^{(i)})^2+\lambda\sum_{j=1}^{n} \theta_j^2]$
The

λ $\lambda$ , is the regularization parameter.
If

λ $\lambda$ is chosen to be too large, it may smooth out the function too much and cause underfitting.
这里写图片描述

As a result, we see that the new hypothesis (depicted by the pink curve) looks like a quadratic function but fits the data better due to the extra small terms θ

这里写图片描述
actually, $(1-\alpha\frac{\lambda}{m})<1$
so it shrink the parameter a little bit before do the same thing as previous.

Using regularization also takes care of any non-invertibility issues of the X transpose X matrix as well.

if m ≤ n, then $X^TX$ is non-invertible. However, when we add the term λ⋅L, then $X^TX + λ⋅L$ becomes invertible.

2.regularized logistic regression
这里写图片描述

the θ vector is indexed from 0 to n (holding n+1 values, $\theta_0$ through θn), and this sum explicitly skips $\theta_0$

b.t.w Because regularization causes J(θ) to no longer be convex, gradient descent may not always converge to the global minimum (when λ>0, and when using an appropriate learning rate α).