多项式拟合(polyfit)及局部加权回归(Lowess)对二维数据基础规律和离群特征学习的分析对比

最新推荐文章于 2023-12-11 14:27:06 发布

置顶山高月小水落石出

最新推荐文章于 2023-12-11 14:27:06 发布

阅读量2.7k

点赞数 2

分类专栏： # python 机器学习文章标签：统计学多项式拟合局部加权回归二维数据拟合

本文链接：https://blog.youkuaiyun.com/qq_22828175/article/details/119041432

版权

python 同时被 2 个专栏收录

17 篇文章

订阅专栏

机器学习

12 篇文章

订阅专栏

概述：
1.当计算序列中离群区间的效应系数时，左右两侧增加的非离群区间应该尽量长些，离群程度越强，增加的非离群区间应越长，
多项式和Lowess才可能不被拉起。由于无法准确控制拉起的程度，则统一不拉起是更好的。
2.当需要学习序列的基本规律时，多项式可选尽量大的次数（在拟合自变量点上不会振荡），Lowess则是较大的frac和it。
解释：
对于多项式：
1.（特别重要）当多项式拟合次数过高时，新的应变量会在除了原自变量点之外的区间发生振荡，所以在应用拟合得到的该多项式时，新的自变量不能落在这些区间，
但若落在任意原自变量点上，则永远不会发生振荡。即虽然次数越高多项式越会过拟合，但若新自变量落在原自变量点上，则新应变量只会越来越趋近原应变量取值，
但永远不会振荡。
当拟合次数较低时，在所有区间和自变量点上都不易发生振荡，函数的跟随性较差，规律性较强。当新自变量落在原自变量点上时，新因变量也会偏离原因变量较远。
由此可知，当新应变量的取值不超过原自变量的集合时，拟合次数可适当取高，因为不用担心振荡；若拟合次数较低，新应变量会偏离原应变量更远，从而较不利于反映基础规律。
2.当拟合次数较高时，多项式容易在拟合时的自变量区间两端或一端及更外侧发生振荡，当取较中部区间的应变量时，是可以不发生振荡的。
3.当拟合点数过多，次数较高时，容易出现做最小二乘时奇异值分解不收敛的情况，从而无法拟合出多项式。
当拟合点数不多，拟合次数可以超过拟合点数，即未知数个数超过方程个数，虽然无法通过求解线性方程组得到未知数，但可通过梯度下降等最优化方法得到未知数，
即多项式的系数。此时容易发生振荡，中部非振荡区间也更窄。
对于局部加权回归：
1.Lowess方法返回的是拟合值，而不是函数，所以返回值与自变量x一一对应，而不能得到更密集或更稀疏或自变量区间外的返回值。
2.frac越接近1，拟合滑动窗口越接近全长，回归值越平滑，但frac不能大于1；it数越大，重新计算权重的最近点数目越多，受最近离群点影响越小；
因为此时已经比较平滑，所以所需it数较小。
3.frac越接近0，拟合滑动窗口中最近样本数越少，回归值跟随性越强，规律性越差；因为it数不宜大于滑动窗口中的样本数，所以此时it数也应较小。
4.当序列存在一段一段的离群值时，it值可取为最长离群段的样本数+1，这样无论用哪种Lowess方法，都不会受离群值的影响。
实践：

from math import ceil
import numpy as np
from scipy import linalg
import statsmodels.api as sm
import matplotlib.pyplot as plt
import copy
plt.style.use('seaborn-white')


# Defining the bell shaped kernel function - used for plotting later on
def kernel_function(xi, x0, tau=.005):
    return np.exp(- (xi - x0) ** 2 / (2 * tau))


def lowess_bell_shape_kern(x, y, tau=.005):
    """
    lowess_bell_shape_kern(x, y, tau = .005) -> yest
    Locally weighted regression: fits a nonparametric regression curve to a scatterplot.
    The arrays x and y contain an equal number of elements; each pair
    (x[i], y[i]) defines a data point in the scatterplot. The function returns
    the estimated (smooth) values of y.
    The kernel function is the bell shaped function with parameter tau. Larger tau will result in a
    smoother curve.
    """
    n = len(x)
    yest = np.zeros(n)

    # Initializing all weights from the bell shape kernel function
    w = np.array([np.exp(- (x - x[i]) ** 2 / (2 * tau)) for i in range(n)])

    # Looping through all x-points
    for i in range(n):
        weights = w[:, i]
        b = np.array([np.sum(weights * y), np.sum(weights * y * x)])
        A = np.array([[np.sum(weights), np.sum(weights * x)],
                      [np.sum(weights * x), np.sum(weights * x * x)]])
        theta = linalg.solve(A, b)
        yest[i] = theta[0] + theta[1] * x[i]

    return yest


def lowess_ag(x, y, frac=2/3, it=3):
    """
    lowess(x, y, frac=2./3., it=3) -> yest
    Lowess smoother: Robust locally weighted regression.
    The lowess function fits a nonparametric regression curve to a scatterplot.
    The arrays x and y contain an equal number of elements; each pair
    (x[i], y[i]) defines a data point in the scatterplot. The function returns
    the estimated (smooth) values of y.
    The smoothing span is given by frac. A larger value for frac will result in a
    smoother curve. The number of robustifying iterations is given by it. The
    function will run faster with a smaller number of iterations.
    """
    n = len(x)
    r = int(ceil(frac * n))
    h = [np.sort(np.abs(x - x[i]))[r] for i in range(n)]
    w = np.clip(np.abs((x[:, None] - x[None, :]) / h), 0.0, 1.0)
    w = (1 - w ** 3) ** 3
    yest = np.zeros(n)
    delta = np.ones(n)
    for iteration in range(it):
        for i in range(n):
            weights = delta * w[:, i]
            b = np.array([np.sum(weights * y), np.sum(weights * y * x)])
            A = np.array([[np.sum(weights), np.sum(weights * x)],
                          [np.sum(weights * x), np.sum(weights * x * x)]])
            beta = linalg.solve(A, b)
            yest[i] = beta[0] + beta[1] * x[i]

        residuals = y - yest
        s = np.median(np.abs(residuals))
        delta = np.clip(residuals / (6.0 * s), -1, 1)
        delta = (1 - delta ** 2) ** 2

    return yest


lowess_sm = sm.nonparametric.lowess
tau = 0.005  # abs(tau)越大，lowess_bell_shape_kern的回归值越平滑；tau→0+，回归值跟随性越强，规律性越差。
samples_in_window = 24  # samples_in_window越接近样本全体数量，回归值越平滑；越接近1，回归值跟随性越强，规律性越差。
frac = 0.9
n = 31  # 多项式最高次数，未知数（待拟合变量）有n+1个

# 测试第一种正弦数据集
scale = 5
x_plot = np.random.uniform(low=0, high=1*np.pi, size=200*scale)
x_plot.sort()
y_plot = np.sin(x_plot * 1.5 * np.pi)
y_noise_plot = y_plot + np.random.normal(scale=1/2, size=len(x_plot))
x = np.array([x_plot[5*i] for i in range(int(len(x_plot)/scale))])
y = np.sin(x * 1.5 * np.pi)
y_noise = y + np.random.normal(scale=1/2, size=len(x))

# lowess方法返回的是拟合值，而不是函数，所以返回值与自变量x一一对应，而不能得到更密集或更稀疏的返回值
yest_bell_plot = lowess_bell_shape_kern(x_plot, y_noise_plot, tau=tau)
# frac越接近1，拟合滑动窗口越接近全长，回归值越平滑；it数越大，重新计算权重的最近点数目越多，受最近离群点影响越小；因为此时已经比较平滑，所以所需it数较小。
# frac越接近0，拟合滑动窗口中最近样本数越少，回归值跟随性越强，规律性越差；因为it数不宜大于滑动窗口中的样本数，所以此时it数也应较小。
yest_ag_plot = lowess_ag(x_plot, y_noise_plot, frac=samples_in_window/len(x), it=3)
yest_sm_plot = lowess_sm(y_noise_plot, x_plot, frac=samples_in_window/len(x), it=3, return_sorted=False)
# polyfit与poly1d返回的是函数p，所以对这个函数p赋任何x，都可以得到相应的因变量y
c = np.polyfit(x, y_noise, n)
p = np.poly1d(c)
y_poly_plot = p(x_plot)

# plot to compare three kinds of Lowess method
plt.figure(figsize=(20, 8))
plt.plot(x, y, color='g', label='sin()')
plt.scatter(x, y_noise, facecolors='none', edgecolor='darkblue', label='sin()+np.random.normal()')
plt.plot(x, y_noise, color='darkblue')
plt.fill(x[:], kernel_function(x[:], max(x)/2, tau), color='lime', alpha=0.5, label='Bell shape kernel')
plt.plot(x_plot, yest_ag_plot, color='orange', label='Lowess: A. Gramfort')
plt.plot(x_plot, yest_bell_plot, color='red', label='Lowess: bell shape kernel')
plt.plot(x_plot, yest_sm_plot, color='magenta', label='Lowess: statsmodel')
plt.plot(x_plot, y_poly_plot, color='k', label='polynomial of degree {}'.format(n))
plt.ylim(min(y_noise) - abs(min(y_noise) / 2), max(y_noise) * 1.5)  # 使y坐标的上下限关于y的最大最小值对称，且不受多项式拟合函数的振荡值影响
plt.legend()
plt.title('Sine with noise: Lowess regression comparing with polynomial')


# 测试第二种带离群值的数据集
x = [np.linspace(1, 14, 14), np.linspace(1, 38, 38)]
# x_plot使绘图时的x取值比多项式拟合时，自变量x的取值更密集，以观察到可能出现的振荡
x_plot = [np.linspace(1, int(max(x[0])), int(max(x[0]))*5), np.linspace(1, int(max(x[1])), int(max(x[1]))*5)]
y = [np.random.randn(len(x[0])), np.random.randn(len(x[1]))]
# 设置离群值
for i in range(len(x)):
    y[i][int(len(x[i])/2-2):int(len(x[i])/2+2)] = 5*(i+1)
n = [[3, 4], [3, 6]]
weights = [(max(y[i])*2-y[i])/sum(max(y[i])*2-y[i]) for i in range(len(x))]  # y[i]越大，weights越小；让离群点占较小权重，正常点占较大权重。

c = copy.deepcopy(n)  # c与n具有相同的结构；一定要用copy.deepcopy，不能用浅拷贝copy，否则后面对c的赋值也会改变n
p = copy.deepcopy(n)  # p与n具有相同的结构
yest_sm, yest_ag = [], []
for i in range(len(x)):
    for j in range(len(n[i])):
        c[i][j] = np.polyfit(x[i], y[i], n[i][j], full=True, w=weights[i])
        p[i][j] = np.poly1d(c[i][j][0])
        print('序列长度{0}个点，拟合次数为{1}，即目标函数未知数个数为{2}，即目标函数对自变量偏导数的方程个数为{2}：'.format(len(x[i]), n[i][j], n[i][j]+1))
        print('拟合函数的系数（次数由高到低）：', '\n', c[i][j][0])
        print('拟合值与实际值的MAPE：', sum(abs((p[i][j](x[i])-y[i])/y[i]))/len(x[i]))
        print('目标函数对自变量偏导数方程组的系数矩阵的秩：', c[i][j][2])
        print('目标函数对自变量偏导数方程组的系数矩阵的奇异值：', '\n', c[i][j][3])
        print('拟合的相关条件数：', c[i][j][4], '\n')
    plt.figure(figsize=(10, 6))
    plt.scatter(x[i], y[i], facecolors='none', edgecolor='darkblue', label='original_data')
    plt.plot(x[i], y[i], color='darkblue')
    plt.plot(x_plot[i], p[i][0](x_plot[i]), '-', color='g', label='polynomial of degree {}'.format(n[i][0]))
    plt.plot(x_plot[i], p[i][1](x_plot[i]), '--', color='orange', label='polynomial of degree {}'.format(n[i][1]))
    yest_sm.append(lowess_sm(y[i], x[i], frac=frac, it=5, return_sorted=False))
    yest_ag.append(lowess_ag(x[i], y[i], frac=frac, it=5))
    plt.plot(x[i], yest_sm[i], color='k', label='Lowess: statsmodel')
    plt.plot(x[i], yest_ag[i], color='r', label='Lowess: A. Gramfort')
    plt.ylim(min(y[i])-abs(min(y[i])/2), max(y[i])*1.5)  # 使y坐标的上下限关于y的最大最小值对称，且不受多项式拟合函数的振荡值影响
    plt.legend()
    plt.title('series length: {0}, degree of polyfit: {1}, {2}'.format(len(x[i]), n[i][0], n[i][1]))