Going Deeper into Regression Analysis with Assumptions, Plots & Solutions

最新推荐文章于 2020-04-15 15:17:56 发布
转载 最新推荐文章于 2020-04-15 15:17:56 发布 · 1k 阅读
· 0
· 0
文章标签:

#线性回归 #错误排查分析

机器学习 专栏收录该内容
23 篇文章
订阅专栏
本文深入探讨了回归分析的基本假设,包括线性和加性关系、无自相关、无多重共线性、误差项的恒定方差及正态分布。通过解读四个关键回归图:残差与拟合值图、正态Q-Q图、尺度位置图和残差与杠杆图,文章提供了处理这些假设被违反时的解决方案,如变量转换和权重最小二乘法。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

摘自 ANALYTICS VIDHYA CONTENT TEAM https://www.analyticsvidhya.com/blog/2016/07/deeper-regression-analysis-assumptions-plots-solutions/

Business AnalyticsMachine Learning

Going Deeper into Regression Analysis with Assumptions, Plots & Solutions

Analytics Vidhya Content Team, July 14, 2016

width="728" height="90" vspace="0" hspace="0" scrolling="no" allowfullscreen="true" id="aswift_4">

Introduction

All models are wrong, but some are useful – George Box

Regression analysis marks the first step in predictive modeling. No doubt, it’s fairly easy to implement. Neither it’s syntax nor its parameters create any kind of confusion. But, merely running just one line of code, doesn’t solve the purpose. Neither just looking at R² or MSE values. Regression tells much more than that!

In R, regression analysis return 4 plots using plot(model_name) function. Each of the plot provides significant information or rather an interesting story about the data. Sadly, many of the beginners either fail to decipher the information or don’t care about what these plots say. Once you understand these plots, you’d be able to bring significant improvement in your regression model.

For model improvement, you also need to understand regression assumptions and ways to fix them when they get violated.

In this article, I’ve explained the important regression assumptions and plots (with fixes and solutions) to help you understand the regression concept in further detail. As said above, with this knowledge you can bring drastic improvements in your models.

Note: To understand these plots, you must know basics of regression analysis. If you are completely new to it, you can start here. Then, proceed with this article.

Going Deeper into Regression Analysis with Assumptions, Plots & Solutions

 

Assumptions in Regression

Regression is a parametric approach. ‘Parametric’ means it makes assumptions about data for the purpose of analysis. Due to its parametric side, regression is restrictive in nature. It fails to deliver good results with data sets which doesn’t fulfill its assumptions. Therefore, for a successful regression analysis, it’s essential to validate these assumptions.

So, how would you check (validate) if a data set follows all regression assumptions? You check it using the regression plots (explained below) along with some statistical test.

Let’s look at the important assumptions in regression analysis:

  1. There should be a linear and additive relationship between dependent (response) variable and independent (predictor) variable(s). A linear relationship suggests that a change in response Y due to one unit change in X¹ is constant, regardless of the value of X¹. An additive relationship suggests that the effect of X¹ on Y is independent of other variables.
  2. There should be no correlation between the residual (error) terms. Absence of this phenomenon is known as Autocorrelation.
  3. The independent variables should not be correlated. Absence of this phenomenon is known as multicollinearity.
  4. The error terms must have constant variance. This phenomenon is known as homoskedasticity. The presence of non-constant variance is referred to heteroskedasticity.
  5.  The error terms must be normally distributed.

 

What if these assumptions get violated ?

Let’s dive into specific assumptions and learn about their outcomes (if violated):

1. Linear and Additive:  If you fit a linear model to a non-linear, non-additive data set, the regression algorithm would fail to capture the trend mathematically, thus resulting in an inefficient model. Also, this will result in erroneous predictions on an unseen data set.

How to check: Look for residual vs fitted value plots (explained below). Also, you can include polynomial terms (X, X², X³) in your model to capture the non-linear effect.

 

2. Autocorrelation: The presence of correlation in error terms drastically reduces model’s accuracy. This usually occurs in time series models where the next instant is dependent on previous instant. If the error terms are correlated, the estimated standard errors tend to underestimate the true standard error.

If this happens, it causes confidence intervals and prediction intervals to be narrower. Narrower confidence interval means that a 95% confidence interval would have lesser probability than 0.95 that it would contain the actual value of coefficients. Let’s understand narrow prediction intervals with an example:

For example, the least square coefficient of X¹ is 15.02 and its standard error is 2.08 (without autocorrelation). But in presence of autocorrelation, the standard error reduces to 1.20. As a result, the prediction interval narrows down to (13.82, 16.22) from (12.94, 17.10).

Also, lower standard errors would cause the associated p-values to be lower than actual. This will make us incorrectly conclude a parameter to be statistically significant.

How to check: Look for Durbin – Watson (DW) statistic. It must lie between 0 and 4. If DW = 2, implies no autocorrelation, 0 < DW < 2 implies positive autocorrelation while 2 < DW < 4 indicates negative autocorrelation. Also, you can see residual vs time plot and look for the seasonal or correlated pattern in residual values.

 

3. Multicollinearity: This phenomenon exists when the independent variables are found to be moderately or highly correlated. In a model with correlated variables, it becomes a tough task to figure out the true relationship of a predictors with response variable. In other words, it becomes difficult to find out which variable is actually contributing to predict the response variable.

Another point, with presence of correlated predictors, the standard errors tend to increase. And, with large standard errors, the confidence interval becomes wider leading to less precise estimates of slope parameters.

Also, when predictors are correlated, the estimated regression coefficient of a correlated variable depends on which other predictors are available in the model. If this happens, you’ll end up with an incorrect conclusion that a variable strongly / weakly affects target variable. Since, even if you drop one correlated variable from the model, its estimated regression coefficients would change. That’s not good!

How to check: You can use scatter plot to visualize correlation effect among variables. Also, you can also use VIF factor. VIF value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious multicollinearity. Above all, a correlation table should also solve the purpose.

 

4. Heteroskedasticity: The presence of non-constant variance in the error terms results in heteroskedasticity. Generally, non-constant variance arises in presence of outliers or extreme leverage values. Look like, these values get too much weight, thereby disproportionately influences the model’s performance. When this phenomenon occurs, the confidence interval for out of sample prediction tends to be unrealistically wide or narrow.

How to check: You can look at residual vs fitted values plot. If heteroskedasticity exists, the plot would exhibit a funnel shape pattern (shown in next section). Also, you can use Breusch-Pagan / Cook – Weisberg test or White general test to detect this phenomenon.

 

5. Normal Distribution of error terms: If the error terms are non- normally distributed, confidence intervals may become too wide or narrow. Once confidence interval becomes unstable, it leads to difficulty in estimating coefficients based on minimization of least squares. Presence of non – normal distribution suggests that there are a few unusual data points which must be studied closely to make a better model.

How to check: You can look at QQ plot (shown below). You can also perform statistical tests of normality such as Kolmogorov-Smirnov test, Shapiro-Wilk test.

 

Interpretation of Regression Plots

Until here, we’ve learnt about the important regression assumptions and the methods to undertake, if those assumptions get violated.

But that’s not the end. Now, you should know the solutions also to tackle the violation of these assumptions. In this section, I’ve explained the 4 regression plots along with the methods to overcome limitations on assumptions.

 

1. Residual vs Fitted Values

interpretation of residual vs fitted regression plot

residual vs fitted value heteroskedasticity plot interpretation

This scatter plot shows the distribution of residuals (errors) vs fitted values (predicted values). It is one of the most important plot which everyone must learn. It reveals various useful insights including outliers. The outliers in this plot are labeled by their observation number which make them easy to detect.

There are two major things which you should learn:

  1. If there exist any pattern (may be, a parabolic shape) in this plot, consider it as signs of non-linearity in the data. It means that the model doesn’t capture non-linear effects.
  2. If a funnel shape is evident in the plot, consider it as the signs of non constant variance i.e. heteroskedasticity.

Solution: To overcome the issue of non-linearity, you can do a non linear transformation of predictors such as log (X), √X or X² transform the dependent variable. To overcome heteroskedasticity, a possible way is to transform the response variable such as log(Y) or √Y. Also, you can use weighted least square method to tackle heteroskedasticity.

 

2. Normal Q-Q Plot

normal q-q plot regression interpretation

This q-q or quantile-quantile is a scatter plot which helps us validate the assumption of normal distribution in a data set. Using this plot we can infer if the data comes from a normal distribution. If yes, the plot would show fairly straight line. Absence of normality in the errors can be seen with deviation in the straight line.

If you are wondering what is a ‘quantile’, here’s a simple definition: Think of quantiles as points in your data below which a certain proportion of data falls. Quantile is often referred to as percentiles. For example: when we say the value of 50th percentile is 120, it means half of the data lies below 120.

Solution: If the errors are not normally distributed, non – linear transformation of the variables (response or predictors) can bring improvement in the model.

 

3. Scale Location Plot

scale location regression plot

This plot is also used to detect homoskedasticity (assumption of equal variance). It shows how the residual are spread along the range of predictors. It’s similar to residual vs fitted value plot except it uses standardized residual values. Ideally, there should be no discernible pattern in the plot. This would imply that errors are normally distributed. But, in case, if the plot shows any discernible pattern (probably a funnel shape), it would imply non-normal distribution of errors.

Solution: Follow the solution for heteroskedasticity given in plot 1.

 

4. Residuals vs Leverage Plot

residual vs leverage regression plot interpretation

It is also known as Cook’s Distance plot. Cook’s distance attempts to identify the points which have more influence than other points. Such influential points tends to have a sizable impact of the regression line. In other words, adding or removing such points from the model can completely change the model statistics.

But, can these influential observations be treated as outliers? This question can only be answered after looking at the data. Therefore, in this plot, the large values marked by cook’s distance might require further investigation.

Solution: For influential observations which are nothing but outliers, if not many, you can remove those rows. Alternatively, you can scale down the outlier observation with maximum value in data or else treat those values as missing values.

 

Case Study: How I improved my regression model using log transformation

 

End Notes

You can leverage the true power of regression analysis by applying the solutions described above. Implementing these fixes in R is fairly easy. If you want to know about any specific fix in R, you can drop a comment, I’d be happy to help you with answers.

My motive of this article was to help you gain the underlying knowledge and insights of regression assumptions and plots. This way, you would have more control on your analysis and would be able to modify the analysis as per your requirement.

Did you find this article useful ? Have you used these fixes in improving model’s performance? Share your experience / suggestions in the comments.

You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.

 

You can also read this article on Analytics Vidhya's Android APP Get it on Google Play

Share this:

  • Click to share on LinkedIn (Opens in new window)
  • 653Click to share on Facebook (Opens in new window)653
  • Click to share on Google+ (Opens in new window)
  • Click to share on Twitter (Opens in new window)
  • Click to share on Pocket (Opens in new window)
  • Click to share on Reddit (Opens in new window)

Like this:

Like Loading...
src="https://widgets.wp.com/likes/#blog_id=58948004&post_id=25961&origin=www.analyticsvidhya.com&obj_id=58948004-25961-5c32b45be9336" height="55px" width="100%" scrolling="no">

width="728" height="90" vspace="0" hspace="0" scrolling="no" allowfullscreen="true" id="aswift_0">

Tags : advanced regression , heteroskedasticity , Homoskedasticity , interpretation regression plots , Multiple Regression , ordinary least square , regression , residual plots , Residual sum of squares , residual vs leverage plot , scale location plot , total sum of squares
Next Article

Practical Guide on Data Preprocessing in Python using Scikit Learn

Previous Article

Manager – Underwriting – Gurgaon (4 – 5 Years of Experience)

Analytics Vidhya Content Team

Analytics Vidhya Content team

    Related Articles

    Dishashree Gupta , March 14, 2017
    Introduction to Conditional Probability and Bayes theorem for data science professionals

    Introduction to Conditional Probability and Bayes theorem for data science professionals

    Aarshay Jain , July 10, 2016
    10 Analytics / Data Science Masters Program by Top Universities in the US

    10 Analytics / Data Science Masters Program by Top Universities in the US

    Guest Blog , July 12, 2016
    3 Must Know Analytical Concepts For Every Professional / Fresher in Analytics

    3 Must Know Analytical Concepts For Every Professional / Fresher in Analytics

    Aarshay Jain , April 18, 2016
    Practical Guide to implementing Neural Networks in Python (using Theano)

    Practical Guide to implementing Neural Networks in Python (using Theano)

    Kunal Jain , August 17, 2015
    11 things you should know as a Data Scientist

    11 things you should know as a Data Scientist

    Sunil Ray , November 27, 2014
    Comprehensive guide to SAS PROC Format

    Comprehensive guide to SAS PROC Format

    确定要放弃本次机会?
    福利倒计时
    : :

    立减 ¥

    普通VIP年卡可用
    立即使用
    csiao_Bing
    关注 关注
    • 0
      点赞
    • 踩
    • 0
      收藏
      觉得还不错? 一键收藏
    • 0
      评论
    • 分享
      复制链接
      分享到 QQ
      分享到新浪微博
      扫一扫
    • 举报
      举报
    专栏目录
    残差分析与残差图
    statistics+insight+vista+power
    02-08 3万+
    从线性回归与非线性回归说起 均方误差(Mean Squared Error,MSE):真实值与预测值差的平方和的平均值。 RMSE指标 在某些情况下决定系数(coefficient of determination)R^2【R方】非常重要,可以将其看成一个MSE标准化版本,R^2是模型捕获响应方差的分数。 另一个诊断图是残差的直方图。 理想情况下,我们希望残差是正态分布的,这意味着模型在两个方向(高...
    Conceptual Challenges for Interpretable Machine Learning
    weixin_42786150的博客
    03-01 1597
    Conceptual Challenges for Interpretable Machine Learning David S. Watson1 'Department of Statistical Science, University College London, London, UK Email for correspondence: david.watson@ucl.ac.uk §0 Abstract As machine learning has gradually entere..
    参与评论 您还未登录,请先 登录 后发表或查看评论
    数据分析实例1(英文报告)--预测未来收入--SAS 逻辑回归--1994年美国人口普查数据
    m0_47089011的博客
    04-15 4810
    Prediction of Future Income by Using Logistic Regression Matthew LaFrance, Yu Zhang 1 Introduction Many factors could influence a person’s annual income, for example, age, gender, race, level of ed...
    【回归】回归分析中的假定:什么假定?为什么要满足?(为什么会违背?)违背假定的后果?怎样检验?如何改进?(未完)
    多学,多想,多总结
    12-03 1万+
    1. 回归方程的假定是什么? 大前提:回归分析模型是正确设立的,具有可解释的经济意义(即模型是通过研究经济理论、选择变量及函数、收集整理统计数据而建立的)。 (1)y 与 x 线性关系 (2)重复抽样中,x 是非随机的(固定的,因为“原因明确”),但 y 是随机的。 (3)对随机误差项 ξi 的假定:零均值、同方差、不相关(即不自相关) Eξi=0;Cov(...
    (R,线性回归)R语言里的模型诊断图(Residuals vs Fitted,Normal QQ , Scale-Location ,Residuals Leverage)
    热门推荐
    qq_35837578的博客
    03-08 7万+
    线性回归,是概率统计学里最重要的统计方法,也是机器学习中一类非常重要的算法。线性模型简单理解非常容易,但是内涵是非常深奥的。尤其是线性回归模型中的Diagnostics plot的阅读与理解一直被认为是线性回归中的一个难点。 在任何线性模型中,能够直接“lm”(模型有意义),既要考虑各个参数的t-test所得出的p-value,也要考虑总体模型F-检验得出的p-value。在这之后,还要清楚一个...
    R数据分析——回归分析
    Kidpea LAU ' BLOG
    10-24 5029
     回归分析:       回归分析可谓统计学的核心。回归分析是指一个或多个自变量(Xi)来预测因变量(Yi)的方法。        其基础思想是最少二乘法(OLS:ordinary least square):...............(1) 要得到拟合得最好的(1),要使其残差平方和(RSS:residual sum of squares/sum squared residual)要达...
    【Heteroscedasticity Inquiry】: The Impact and Solutions of Heteroscedasticity in Linear Regression
    # 1. What is Heteroscedasticity In statistics, heteroscedasticity refers to the property of random errors having different... Heteroscedasticity can affect linear regression models, leading to inaccurat
    【Bootstrap Method Practice】: Application and Practice of Bootstrap Method in Linear Regression
    In the fields of statistics and machine learning, the Bootstrap method is a resampling technique that involves generating multiple virtual datasets by sampling with replacement from the original data ...
    【In-Depth Analysis of the ARIMA Model】: Mastering Classical Methods for Time Series Forecasting
    Time series analysis is a crucial method in statistics used to study the patterns and characteristics of data points over time. The ARIMA model, which stands for Autoregressive Integrated Moving ...
    [译]回归分析的基本假设
    GG的专栏
    05-14 1万+
    原文地址:《Going Deeper into Regression Analysis with Assumptions, Plots &amp;amp; Solutions》 引言 回归分析标志着预测建模的第一步.毫无疑问,它很容易实现,而且它的参数不会造成任何混乱,但是仅仅运行一行代码并不能解决问题.不只是看R2R2R^2和MSE,回归能说明更多的问题. 为了模型效果的提升,必须首先了解回归...
    Autocorrelation
    09-22
    根据&rdquo;http://www.ltrr.arizona.edu/~dmeko/notes_3.pdf&ldquo;总结的一个关于时间序列自相关的PPT,供大家学习交流!
    用python进行图片处理和特征提取
    擦玻璃的程序员专栏
    01-29 4万+
    原文来自:http://www.analyticsvidhya.com/blog/2015/01/basics-image-processing-feature-extraction-python/ 毫无疑问,上面的那副图画看起来像一幅电脑背景图片。这些都归功于我的妹妹,她能够将一些看上去奇怪的东西变得十分吸引眼球。然而,我们生活在数字图片的年代,我们也很少去想这些图片是在怎么存储在存储
    [OpenCV-Python] OpenCV 中图像特征提取与描述 部分 V (一)
    weixin_30507481的博客
    02-13 2144
    部分 V图像特征提取与描述 OpenCV-Python 中文教程(搬运)目录 29 理解图像特征 目标本节我会试着帮你理解什么是图像特征,为什么图像特征很重要,为什么角点很重要等。29.1 解释  我相信你们大多数人都玩过拼图游戏吧。首先你们拿到一张图片的一堆碎片,要做的就是把这些碎片以正确的方式排列起来从而重建这幅图像。问题是,你怎样做到的呢?如果把你做游戏的原理写成计算机程序,那...
    2 Heteroskedasticity
    taojiea1014的博客
    05-31 646
    Suppose the noise variance is itself variable.Figure 2: Scatter-plot of n = 150 data points from the above model. (Here X isGaussian with mean 0 and variance 9.) Grey: True regression line. Dashed: or...
    R fitting R语言数据拟合总结
    zefi279175732的博客
    07-17 2万+
    本来想翻译过来,工作了比较大先附上原文,后续再来翻译,有兴趣的可以相互交流 QQ 279175732 TKS Fit models to data This page provides tips and recommendations for fitting linear and nonlinear models to data. Updated and revised
    R语言学习笔记(十三):时间序列
    diqi8140的博客
    11-05 1041
    #生成时间序列对象 sales&lt;-c(18,33,41,7,34,35,24,25,24,21,25,20,22,31,40,29,25,21,22,54,31,25,26,35) tsales&lt;-ts(sales,start=c(2003,1),frequency = 12) tsales Jan Feb Mar Apr May Jun Jul A...
    Python 图片处理介绍
    xijiao_jiao的博客
    07-20 5178
    Python 图像处理介绍 缩放图片 # 从PIL导入Image类 from PIL import Image #1.打开文件,返回一个文件对象 img=Image.open('hello.jpg') #2.获取已有图片的尺寸 print(img.size) #(2560, 1600)返回的是一个元祖,有两个值 width,height=img.size #将其得到的两个...
    残差residual VS 误差 error
    jmydream的专栏
    04-06 3万+
    In statistics and optimization, statistical errors and residuals are two closely related and easily confused measures of "deviation of a sample from the mean": the error of a sample is the deviation o
    U2-Net: Going deeper with nested U-structure for salient object detection的研究重点是
    06-11
    U2-Net: Going deeper with nested U-structure for salient object detection 的研究重点是提出了一种新的深度学习模型结构,用于显著目标检测。该模型采用了嵌套的U形结构,可以有效地提取图像中的显著目标信息,...
    csiao_Bing

    博客等级

    码龄18年
    9
    原创
    44
    点赞
    212
    收藏
    16
    粉丝
    关注
    私信

    TA的精选

    • 新 神经网络学习-高质量资源

      435 阅读

    • 新 开源k-v Tair

      316 阅读

    • 热 XGBRegressor 参数调优

      52882 阅读

    • 热 使用回归分析,样本过少时不妨好先看看散点图

      6762 阅读

    • 热 conda 设置python运行 虚拟环境

      4913 阅读

    查看更多

    2020年1篇
    2019年11篇
    2018年25篇

    大家在看

    • Windows-MCP 安装与使用全指南:让 AI 轻松操控你的电脑 628

    分类专栏

    • 架构设计
      1篇
    • 机器学习
      23篇
    • 成长大法
      3篇
    • python
      6篇
    • NLP
      4篇
    • 神经网络
      2篇

    展开全部 收起

    上一篇:
    实战经验分享-少量数据NLP场景下进行深度学习训练的建议
    下一篇:
    python 代码片段备忘

    目录

    展开全部

    收起

    目录

    展开全部

    收起

    上一篇:
    实战经验分享-少量数据NLP场景下进行深度学习训练的建议
    下一篇:
    python 代码片段备忘

    分类专栏

    • 架构设计
      1篇
    • 机器学习
      23篇
    • 成长大法
      3篇
    • python
      6篇
    • NLP
      4篇
    • 神经网络
      2篇

    展开全部 收起

    目录

    评论
    被折叠的  条评论 为什么被折叠? 到【灌水乐园】发言
    查看更多评论
    添加红包

    请填写红包祝福语或标题

    个

    红包个数最小为10个

    元

    红包金额最低5元

    当前余额3.43元 前往充值 >
    需支付:10.00元
    成就一亿技术人!
    领取后你会自动成为博主和红包主的粉丝 规则
    hope_wisdom
    发出的红包
    实付元
    使用余额支付
    点击重新获取
    扫码支付
    钱包余额 0

    抵扣说明:

    1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
    2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

    余额充值