（三）机器学习入门与经典算法之多项式回归

最新推荐文章于 2025-02-28 11:08:58 发布

青峰不长存

最新推荐文章于 2025-02-28 11:08:58 发布

阅读量398

点赞数

分类专栏：机器学习入门与经典算法文章标签：机器学习 python

本文链接：https://blog.youkuaiyun.com/qq_44644355/article/details/106816280

版权

机器学习入门与经典算法专栏收录该内容

5 篇文章

订阅专栏

文章目录

1.多项式回归

（1）什么是多项式回归？
答：多项式方程是值方程的最高次项为k的方程，前面我们讲的回归方程：
可以看出，它们的次数都为1，次数大于1的方程叫做多项式方程。
（2）线性回归要求数据是线性的，而现实中的数据很多都不是线性的。线性回归的图形是一条直线，而多项式回归的图形是一条曲线。当数据不是线性的，却使用线性回归去预测，效果就会很差，使用多项式模拟会比较好一点。
（3）如下图，可以很明显的看出这些数据点不是线性的，图中红线是使用线性回归预测的，而绿线是使用多项式回归预测的。
（4）说了这么多，怎么实现多项式回归呢？
答：其实也挺简单的，就是给数据x添加一个一个特征就好了。假设原来的数据x有1个特征x1，多项式最高项为2，则我们给x添加一个特征x^2,（一般最高次为多少，就添加x的几次方）这样数据x就有6个特征了（x1,x ^2）。然后再将数据喂入训练就可以得出结果了。
下面的例子数据最开始是只有1个特征的

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression


x = np.random.uniform(-3, 3, size=100)
X = x.reshape(-1, 1)
y = 0.5 * x ** 2 + x + 2 + np.random.normal(0, 1, size=100)

lin_reg = LinearRegression()
lin_reg.fit(X, y)

y_predict = lin_reg.predict(X)
plt.scatter(x, y)
plt.plot(x, y_predict, color='r')


# 多项式回归的思路
# 添加一个特征x^2
x2 = np.hstack([X, X ** 2])
print(x2)
lin_reg2 = LinearRegression()
lin_reg2.fit(x2, y)
y_predict2 = lin_reg2.predict(x2)

plt.scatter(x, y)
# np.sort(x)是为了对x从小到大进行排序，不排序的话，画出来的图是乱的
plt.plot(np.sort(x), y_predict2[np.argsort(x)], color='g')
plt.show()

（5）上面是手动添加特征，我们还可以通过python给定的一些库来更方便的添加。通过poly=PolynomialFeatures(degree=3),他会自动的添加对应的项。如下：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures,StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline


x = np.random.uniform(-3, 3, size=100)
X = x.reshape(-1, 1)
y = 0.5 * x ** 2 + x + 2 + np.random.normal(0, 1, size=100)

# degree=2表示添加最多二次幂的特征
"""如果原数据有两个特征x1, x2
degree=3
则经过poly.transform会生成
0次幂：1,
1次幂：x1,x2,
2次幂：x1^2,x2^2,x1*x2,
3次幂：x1^3,x2^3,x1^2 * x2,x2^2 * x1"""
poly = PolynomialFeatures(degree=3)
poly.fit(X)
x2 = poly.transform(X)
print(x2)

lin_reg = LinearRegression()
lin_reg.fit(x2, y)
y_predict = lin_reg.predict(x2)

plt.scatter(x, y)
# np.sort(x)是为了对x从小到大进行排序，不排序的话，画出来的图是乱的
plt.plot(np.sort(x), y_predict[np.argsort(x)], color='r')
plt.show()

2.管道pipeline

（1）管道是什么，有什么用？
答：可以认为管道是一个容器，在对一些不同的对象进行相同的操作的时候，我们可以把那些对象丢到管道里面，然后就可以一次性对所有对象进行操作，就不用一个一个调用了.
例如：对于上面的代码，PolynomialFeatures，StandardScaler，LinearRegression都要进行fit操作，那我们就可以吧这三个函数丢进管道里面，这样只用调用一次fit就可以对他们都进行fit了。代码如下：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures,StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

x3 = np.random.uniform(-3, 3, size=100)
X3 = x3.reshape(-1, 1)
y3 = 0.5 * x3 ** 2 + x3 + 2 + np.random.normal(0, 1, size=100)

poly_reg = Pipeline([
    ("poly", PolynomialFeatures(degree=2)),
    ("std_scaler", StandardScaler()),
    ("lin_reg", LinearRegression())
])
# 使用管道（pipeline)可以让我们一次使用poly_reg里面的三种方法，这样子我们就不用一个一个的 调用了
poly_reg.fit(X3, y3)
y_predict = poly_reg.predict(X3)

3.过拟合

（1）什么是过拟合？
答：过拟合就是指模型对训练数据的拟合效果很强，但是到了测试集上效果却不怎么好。（通常数据太少或者添加特征次数过高，也就是上面代码的degree的值过大会导致过拟合）
我们还用上面的代码，但是把degree改成200，会得到这样一个图像：

可以发现这线条很陡峭，对于同一个x,有些点的y值真实值与预测值区别很大，这就是因为过拟合导致的。过拟合会导致参数过大或者过小，为了防止过拟合，我么通常会在loss函数后面添加一个惩罚项，主要有两种：一种是L1正则项，二是L2正则项。
添加L1正则项的回归叫做Lasso回归，添加L2正则项的回归叫做Ridge回归。

4.数据验证与交叉验证

（1）为什么要进行数据验证或交叉验证？
答：如果我们在进行数据分割时，只把数据分成训练集和测试集，那么我们在训练的时候只能训练出一个模型，在数据集比较少的情况下，很容易出现过拟合的情况，为了防止过拟合，我们需要将数据进行交叉验证。
（2）什么是交叉验证以及怎么进行交叉验证？
答：简单一点来说就是，把数据集分成训练集和测试集，然后再把训练集分成k部分。在这里假设k=3，我们把训练集分成A,B,C三部分。所以有三种组合方式，（A验证，BC训练）（B验证，AC训练）（C验证，AB训练）。这样就得到了三个模型，然后把三个模型的平均值作为最终的模型。
代码如下：通过cross_val_score(knn_clf2, x_train, y_train, cv=10)进行交叉验证，其中cv=10表示把训练集分成10份

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier

"""
1.将数据分成两部分，训练数据，测试数据
又将训练数据集分成K部分，不妨假设k=3
A,B,C
所以有三种组合方式，A验证，BC训练。。。。。。。。。
然后将k个模型的均值作为参数
"""
digits = datasets.load_digits()
x = digits.data
y = digits.target

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4)

# 使用交叉验证
from sklearn.model_selection import cross_val_score
knn_clf2 = KNeighborsClassifier()
best_score2, best_k2, best_p2 = 0, 0, 0
for k in range(2, 11):
    for p in range(1, 5):
        knn_clf2 = KNeighborsClassifier(weights="distance", n_neighbors=k, p=p)
        scores = cross_val_score(knn_clf2, x_train, y_train, cv=10)
        score2 = np.mean(scores)
        if score2 > best_score2:
            best_score2, best_p2, best_k2 = score2, p, k


print("best_k2=", best_k2)
print("best_p2=", best_p2)
print("best_score2=", best_score2)
# 因为模型经过交叉验证没有使用到测试数据
best_knn_clf = KNeighborsClassifier(weights="distance", n_neighbors=best_k2, p=best_p2)
best_knn_clf.fit(x_train, y_train)
r2 = best_knn_clf.score(x_test, y_test)
print("准确率：", r2)

5.Ridge回归与Lasso回归

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge, Lasso
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error

np.random.seed(666)
x3 = np.random.uniform(-3, 3, size=100)
X3 = x3.reshape(-1, 1)
y3 = 0.5 * x3 ** 2 + x3 + 2 + np.random.normal(0, 1, size=100)


def RidgeRegression(degree, alpha):
    return Pipeline([
        ("poly", PolynomialFeatures(degree=degree)),
        ("std_scaler", StandardScaler()),
        ("ridge_reg", Ridge(alpha=alpha))
    ])


def LassoRegression(degree, alpha):
    return Pipeline([
        ("poly", PolynomialFeatures(degree=degree)),
        ("std_scaler", StandardScaler()),
        ("ridge_reg", Lasso(alpha=alpha))
    ])


x_train, x_test, y_train, y_test = train_test_split(X3, y3, test_size=0.4)

ridge_reg = RidgeRegression(degree=2, alpha=0.01)
ridge_reg.fit(x_train, y_train)
y_predict1 = ridge_reg.predict(x_test)
mean1 = mean_squared_error(y_test, y_predict1)
print("均方误差", mean1)

lasso_reg = LassoRegression(degree=5, alpha=0.01)
lasso_reg.fit(x_train, y_train)
y_predict2 = lasso_reg.predict(x_test)
mean2 = mean_squared_error(y_test, y_predict2)
print("均方误差", mean2)