详解简单线性回归：原理与实战应用-优快云博客

本文链接：https://blog.youkuaiyun.com/Sarah_07/article/details/125310429

线性回归-简单线性回归

线性回归是一个解释性很强的模型，它可以告诉我们哪个因素对被预测变量影响最大，也可以给定因变量的不同组合来判断被预测量的值。在业务上，运营同学可能想知道增加流量、价格变动等对销量的影响，如果数据量足够支持模型训练，可以通过建立一个线性模型来形象描述。

线性模型大类上我们简单分为简单线性回归模型和多元线性回归模型等。当然如果数据违背了线性回归的一些基本假设，也可以用ridge regression和lasso regression。博主打算搬一个系列，后续都介绍一下。本文主要focus 在简单线性回归模型。

线性回归基本假设

$\beta_0 + \beta_1x_1 + \epsilon$

说到线性回归，顾名思义，因变量y和变量 $x_i$ 之间要符合线性关系
模型残差 $\epsilon$ 要符合normal distribution，下一篇文章介绍下如何判断normal distribution
模型残差 $\epsilon$ 的方差是齐次的，也就是残差不随x值变大而变化
observation( $x_i$ , $y_i$ )之间是彼此独立的，最一般不符合这个假设的就是时序序列,下一篇文章介绍下如何判断是否是auto-regression
变量之间没有共线性，下一篇文章介绍下如何判断共线性和如何处理

Sklearn vs. statsmodels 线性回归library

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import linear_model as lm
import statsmodels.api as sm
from sklearn.datasets import fetch_california_housing

data_set = fetch_california_housing(as_frame=True)

data = data_set.frame

data.describe()

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	MedHouseVal
count	20640.000000	20640.000000	20640.000000	20640.000000	20640.000000	20640.000000	20640.000000	20640.000000	20640.000000
mean	3.870671	28.639486	5.429000	1.096675	1425.476744	3.070655	35.631861	-119.569704	2.068558
std	1.899822	12.585558	2.474173	0.473911	1132.462122	10.386050	2.135952	2.003532	1.153956
min	0.499900	1.000000	0.846154	0.333333	3.000000	0.692308	32.540000	-124.350000	0.149990
25%	2.563400	18.000000	4.440716	1.006079	787.000000	2.429741	33.930000	-121.800000	1.196000
50%	3.534800	29.000000	5.229129	1.048780	1166.000000	2.818116	34.260000	-118.490000	1.797000
75%	4.743250	37.000000	6.052381	1.099526	1725.000000	3.282261	37.710000	-118.010000	2.647250
max	15.000100	52.000000	141.909091	34.066667	35682.000000	1243.333333	41.950000	-114.310000	5.000010

我们这里用HouseAge做因变量，来预测MedHouseVal

sklearn library 来处理

x = data[['HouseAge']]
y= data['MedHouseVal']

# fit the linear model
model = lm.LinearRegression()
results = model.fit(x,y)
# print the coefficient

print('截距是：{0} 系数是：{1}'.format(model.intercept_,model.coef_))

截距是：1.7911991658938475 系数是：[0.0096845]

from sklearn.metrics import r2_score

y_pred = results.predict(x)
print("R2是：{}".format(r2_score(y,y_pred)))

R2是：0.011156305266710742

statsmodels 来处理

X_ = sm.add_constant(x)
model = sm.OLS(y,X_).fit()
model.summary()

OLS Regression Results
Dep. Variable:	MedHouseVal	R-squared:	0.011
Model:	OLS	Adj. R-squared:	0.011
Method:	Least Squares	F-statistic:	232.8
Date:	Sun, 12 Jun 2022	Prob (F-statistic):	2.76e-52
Time:	14:36:59	Log-Likelihood:	-32126.
No. Observations:	20640	AIC:	6.426e+04
Df Residuals:	20638	BIC:	6.427e+04
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	1.7912	0.020	90.218	0.000	1.752	1.830
HouseAge	0.0097	0.001	15.259	0.000	0.008	0.011

Omnibus:	2269.585	Durbin-Watson:	0.325
Prob(Omnibus):	0.000	Jarque-Bera (JB):	3093.615
Skew:	0.938	Prob(JB):	0.00
Kurtosis:	3.281	Cond. No.	77.8

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

ols 的 summary里有几个很重要的metrics，这里说明一下

R-squared：也就是R2，一般来衡量因变量的变化有多少可以被变量解释，这里是0.011，也就是说MedHouseVal的变化只有1%可以被HouseAge来解释，是很差的模型了。区间是0~1
Prob (F-statistic)：F test的原假设是模型的预测效果和只用constant来预测的效果一样。这里Prob<0.05,拒绝原假设，也就是说现有模型比常数预测是好的。
coef：变量系数，HouseAge的系数是0.0097，也就是说HouseAge增加1个unit，MedHouseVal增加0.0097
P>|t|：衡量变量是否和因变量真实相关，原假设是变量系数为0，当<0.05是拒绝原假设，也就是系数是有意义的
Prob(Omnibus)和Prob(JB)：都是衡量残差是否符合normal distribution.