监督学习7-回归模型-线性回归器（Linear Regression）

最新推荐文章于 2022-10-19 21:24:20 发布

原创

最新推荐文章于 2022-10-19 21:24:20 发布 · 1.2k 阅读

3 ·

CC 4.0 BY-SA版权

文章标签：

#回归模型 #线性回归器 #监督学习

本文探讨了线性回归在回归问题中的应用，解释了模型的目标是最小化预测值与真实值的差异，并通过美国波士顿地区房价预测举例说明编程实践。性能测评中，介绍了MAE、MSE和R-squared等评价指标来衡量模型的预测能力。

回归问题和分类问题的区别在于，其待测目标是连续变量，比如：价格、降水量等等。

模型介绍

线性分类器为了便于将原本在实数域上的结果映射到（0，1）区间，引入了逻辑斯蒂函数。而在线性回归问题中，由于预测目标直接是实数域上的数值，因此优化目标就更为简单，即最小化预测结果与真实值之间的差异。

当使用一组 $m$ 个用于训练的特征向量 $X=<x^1,x^2,···x^m>$ 和其对应的回归目标 $y=<y^1,y^2,···y^m>$ 时，我们希望线性回归模型可以最小二乘（Generalized Least Squares）预测的损失 $L (w, b)$ ，这样一来，线性回归器的常见优化目标如式（13）所示。
$\underset{w, b}{\operatorname{argmin}} L(w, b)=\underset{w, b}{\operatorname{argmin}} \sum_{m}^{k-1}\left(f(w, x, b)-y^{k}\right)^{2} \qquad (13)$
同样，为了学习到决定模型的参数，即系数 $w$ 和截距 $b$ ，仍然可以使用一种精确计算的解析算法和一种快速的随机梯度下降估算方法（Stochastic Gradient Descend）

编程实践

美国波士顿地区房价预测

# 从sklearn.datasets导入波士顿房价数据读取器。
from sklearn.datasets import load_boston
# 从读取房价数据存储在变量boston中。
boston = load_boston()
# 输出数据描述。
print boston.DESCR

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

# 从sklearn.cross_validation导入数据分割器。
from sklearn.cross_validation import train_test_split

# 导入numpy并重命名为np。
import numpy as np

X = boston.data
y = boston.target

# 随机采样25%的数据构建测试样本，其余作为训练样本。
X_train, X_test, y_train, y_test = train_test_split(X, y

最低0.47元/天解锁文章