R squared or coefficient of determination

最新推荐文章于 2025-03-12 10:46:31 发布

原创最新推荐文章于 2025-03-12 10:46:31 发布 · 691 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#机器学习 #数学基础

Machine Learning 专栏收录该内容

3 篇文章

订阅专栏

本文深入探讨了R²（决定系数）的概念及其在机器学习中的应用。R²是一种评估回归模型拟合优度的方法，通过比较模型预测误差与数据固有变化来衡量模型的有效性。文章解释了为何采用平方误差以及使用平均值作为基准的重要性，并讨论了R²在不同场景下的适用性和局限性。

R squared or coefficient of determination

This is my study notes in machine learning, writing articles in English because I want to improve my writing skills. Anyway, thanks for watching and if I made some mistakes, let me know please.

R squared, or which names coefficient of determination, is an automated way of discovering how good for best fit line actually is.

During data analysis or machine learning, it is usually define error is the distance between data’s y value(reality) and regression’s y value(predicted).

Why uses squared?

First, we don’t want to see any negative number because when adding errors we may get errors equate zero, but it is absolutely ridiculous.
The other reason is squared can punish outliers, using the power of 4,6,8 even larger numbers will be OK. The number larger, the constraint for outliers tougher, but it consumes more computational recourses and times.

Why uses y-mean?

Y-mean symbols the data’s variation. Informally, it measures how far a set of (random) numbers are spread out from their average value.
https://en.wikipedia.org/wiki/Variance

Equation

$R^2 = 1 - \frac{SE \hat y}{SE \bar y}$
$S$ is squared operation and $E$ is sum operation.

Drawback

R squared is not always a useful way to measure error. It may depend on your goals. If you care about predicting exact future values, r squared is very useful. If you interested in predicting motion or direction, r squared should not carry as much weight. Besides, R squared also bases on the variance from value to value.

If variance is low, the R squared is perfect. We can improve it by simply codes, just make a random dataset generator to test it. The dataset generator allow us change the variance indeed. Correlation is describe data relations in our graph, False means data have no correlation.

def create_dataset(hm, variance, step=2, correlation=False):
    val = 1
    ys = []
    for i in range(hm):
        y = val + random.randrange(-variance,variance)
        ys.append(y)
        if correlation and correlation == 'pos':
            val+=step
        elif correlation and correlation == 'neg':
            val-=step

    xs = [i for i in range(len(ys))]
    return np.array(xs, dtype=np.float64), np.array(ys, dtype=np.float64)