1、回忆回归问题的判定?
- 房价预测
- 销售额预测
- 贷款额度
目标值是一个连续的值,就是回归问题。
目标值是一个离散的值,就是分类问题。
2、线性回归
线性回归:寻找⼀一种能预测的趋势
2.1 线性关系模型
定义
自变量就是特征值,因变量就是目标值
矩阵:是⼤多数算法计算基础,接下来介绍矩阵和数组的区别
2.2 矩阵和数组
数组有很多维数,0维就是一个数值,1维、2维等,矩阵就只是有二维。矩阵的二维和数组的二维有啥区别?
主要区别在于数学上的运算,数组有加法、乘法运算,矩阵也有乘法运算,矩阵的乘法不同于二维数组的乘法,因为矩阵乘法刚好满足特定的运算需求,所以这是矩阵没有被数组代替的原因。
矩阵的乘法:
(m行,l列) * (l行,n列) = (m行, n列)
3、线性回归的评估策略
预测结果与真实值是有一定的误差
损失函数
如何去求模型当中的W,使得损失最小?(目的是找到最小损失对应的W值)
这就需要优化方法,在线性回归中,以下有两种方法。
4、最小二乘法的优化方法
4.1 最小二乘法之正规方程(不做要求)
正规方程是一不求解出最小值。
4.2 最小二乘法之梯度下降(理解过程)
5、 sklearn线性回归正规方程、梯度下降API
• sklearn.linear_model.LinearRegression
• 正规方程
• sklearn.linear_model.SGDRegressor
• 梯度下降
示例:波士顿房价数据案例
1、波士顿地区房价数据获取
2、波士顿地区房价数据分割
3、训练与测试数据标准化处理
4、使用最简单的线性回归模型LinearRegression和
梯度下降估计SGDRegressor对房价进行预测
这里需要注意的是,这里不仅仅是训练集需要标准化处理,测试集也需要标准化处理,因为有些特征值可能会很大。
代码:
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression,SGDRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
#线性回归预测房价
#获取数据
lb = load_boston()
#分割数据集到训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(lb.data, lb.target, test_size=0.25)
print(y_test.shape) #一维的
#进行标准化处理(特征值和目标值都需要标准化处理)
std_x = StandardScaler()
x_train = std_x.fit_transform(x_train)
x_test = std_x.transform(x_test)
#目标值
std_y = StandardScaler()
y_train = std_y.fit_transform(y_train.reshape(-1,1)) #sklearn 0.19之后必须要求穿进去的数组是二维的
y_test = std_y.transform(y_test.reshape(-1,1))
# estimator预测
# 正规方程求解方式预测结果
lr = LinearRegression()
lr.fit(x_train,y_train)
print(lr.coef_)
#预测测试集房子的价格
y_predict = lr.predict(x_test)
y_predict = std_y.inverse_transform(y_predict) #反标准化
print('房子的价格',y_predict)
print("正规方程的均方误差:", mean_squared_error(std_y.inverse_transform(y_test), y_predict))
# # 梯度下降去进行房价预测
sgd = SGDRegressor()
sgd.fit(x_train, y_train)
print(sgd.coef_)
# 预测测试集的房子价格
y_sgd_predict = std_y.inverse_transform(sgd.predict(x_test))
print("梯度下降测试集里面每个房子的预测价格:", y_sgd_predict)
print("梯度下降的均方误差:", mean_squared_error(std_y.inverse_transform(y_test), y_sgd_predict))
结果:
(127,)
[[-0.0667716 0.11883464 -0.02109212 0.07422857 -0.20158071 0.28425706
-0.00491429 -0.35639096 0.31728699 -0.25591251 -0.19254476 0.09473498
-0.41342329]]
房子的价格 [[30.14398631]
[13.5315351 ]
[28.75274122]
[ 6.47823311]
[25.22448859]
[16.69318416]
[34.63247332]
[12.99957964]
[22.80302117]
[22.68245295]
[27.06273744]
[35.6204826 ]
[20.58502529]
[33.0475308 ]
[37.61394493]
[22.65278122]
[36.40168793]
[26.49284053]
[15.46708366]
[20.04963044]
[13.50323398]
[34.22907919]
[34.18505808]
[17.16928941]
[38.963182 ]
[25.32185703]
[19.76647968]
[35.32716516]
[15.94027402]
[27.8445039 ]
[11.31468901]
[28.6103759 ]
[28.74968652]
[13.33156609]
[25.25298946]
[34.53403083]
[16.66159263]
[27.79366246]
[34.51287958]
[22.22604897]
[14.82979685]
[26.2670969 ]
[16.55320494]
[13.09527707]
[29.55079535]
[14.10206423]
[27.6356477 ]
[ 0.67429689]
[13.28476962]
[32.80294259]
[17.93712138]
[32.37003663]
[12.97591161]
[ 8.3285626 ]
[19.13903778]
[22.18583957]
[23.87385875]
[18.7049566 ]
[27.4552403 ]
[40.25601439]
[23.60828539]
[ 8.1280217 ]
[19.90279593]
[21.10708476]
[15.8391661 ]
[20.02561438]
[ 4.72956896]
[27.96090097]
[20.68000445]
[36.30793577]
[36.14883611]
[17.40836088]
[35.26823418]
[15.85738584]
[36.74101594]
[20.24811305]
[24.15330852]
[ 9.73222269]
[11.28867539]
[17.42247594]
[16.7619766 ]
[19.72906581]
[24.77562727]
[26.58304416]
[23.47468482]
[28.55675215]
[30.08477362]
[17.94761363]
[23.49879431]
[13.56217905]
[25.26088436]
[ 5.80932927]
[31.71544952]
[37.02513583]
[31.05819834]
[19.61916179]
[30.72569769]
[22.75017396]
[17.36868276]
[30.39245544]
[14.80621262]
[27.2886414 ]
[20.4667433 ]
[16.05506076]
[14.3459953 ]
[ 8.47020605]
[20.97343885]
[22.75958097]
[30.21503376]
[32.14397478]
[ 9.04632779]
[20.07562617]
[35.74455623]
[19.39460212]
[16.52070398]
[18.61608306]
[19.67124238]
[26.8150728 ]
[28.84705604]
[22.63397684]
[27.61075055]
[18.82858683]
[22.8117643 ]
[24.77976585]
[18.72989181]
[36.14169274]
[ 2.08811194]]
正规方程的均方误差: 18.494293280746486
[-0.05171736 0.08327086 -0.06848233 0.07899582 -0.13540074 0.32465736
-0.02281551 -0.29568952 0.15028145 -0.10610387 -0.17798597 0.10618241
-0.39363663]
梯度下降测试集里面每个房子的预测价格: [30.97151296 13.8110728 29.67379143 5.80711839 25.47941344 16.23687045
34.62748658 13.72137414 22.750925 23.09289502 26.55518314 34.75819779
20.52114746 33.07120958 36.97718922 22.79044317 36.52304098 27.22279821
16.0073122 20.35850313 12.34465343 34.82116802 33.47478087 16.80533008
38.56765819 25.14720759 20.47131611 34.89093427 16.5204105 28.38613706
12.4034583 27.79614967 28.90044242 13.79979714 25.42080544 34.05477845
16.82993534 27.95765127 33.99637458 24.44146197 15.1453002 25.96642402
15.48259095 12.49734497 28.99733915 15.20185738 27.2233481 -0.3840882
13.03994019 32.62718077 18.30398275 32.0064074 12.49832104 8.27672673
19.14817504 22.3858615 23.86616916 18.60179835 27.57253198 40.72899338
23.20252769 7.71587044 20.27238268 21.35085386 15.01394647 20.03864867
4.70018911 27.99106758 20.5634348 35.27606097 37.05940058 17.14091126
34.99795532 15.56052266 37.14358311 20.5530678 24.96006388 10.23338303
8.93194851 16.83009384 16.36896881 19.72699333 25.01136789 26.94641317
23.49325946 27.90417482 31.17771171 17.57192733 22.69888521 13.09463112
25.26639813 5.46489054 32.34103856 37.840646 31.49259841 20.09087332
31.20902053 22.72782029 17.32022944 30.13478019 15.33334573 27.10737667
20.93594052 16.6174145 15.24942697 6.99625964 21.62010449 22.56316132
31.46832526 32.34682217 9.91856666 20.20672054 36.30589409 19.11700197
17.8963358 17.83579369 19.99000555 25.48978079 28.23585845 23.131733
27.84071959 19.05329709 22.93278467 24.31837181 18.58461894 35.95343484
4.0250773 ]
梯度下降的均方误差: 18.46011780177345
sklearn0.19版本后转换器以及估计器(estimator )输入数据必须要是二维的,特别注意的是目标的标准化,因为目标值一般都是一维的
6、回归性能评估
sklearn回归评估API
• sklearn.metrics.mean_squared_error
具体示例代码如上节代码所示
正规方程和梯度下降的比较
1、LinearRegression与SGDRegressor评估
2、特点:线性回归器是最为简单、易用的回归模型。从某种程度上限制了使用,尽管如此,在不知道特征之间关系的前提下,我们仍然使用线性回归器作为大多数系统的首要选择。
小规模数据:LinearRegression(不能解决拟合问题)以及其它
大规模数据:SGDRegressor