-
概述:
1.当计算序列中离群区间的效应系数时,左右两侧增加的非离群区间应该尽量长些,离群程度越强,增加的非离群区间应越长,
多项式和Lowess才可能不被拉起。由于无法准确控制拉起的程度,则统一不拉起是更好的。
2.当需要学习序列的基本规律时,多项式可选尽量大的次数(在拟合自变量点上不会振荡),Lowess则是较大的frac和it。 -
解释:
-
对于多项式:
1.(特别重要)当多项式拟合次数过高时,新的应变量会在除了原自变量点之外的区间发生振荡,所以在应用拟合得到的该多项式时,新的自变量不能落在这些区间,
但若落在任意原自变量点上,则永远不会发生振荡。即虽然次数越高多项式越会过拟合,但若新自变量落在原自变量点上,则新应变量只会越来越趋近原应变量取值,
但永远不会振荡。
当拟合次数较低时,在所有区间和自变量点上都不易发生振荡,函数的跟随性较差,规律性较强。当新自变量落在原自变量点上时,新因变量也会偏离原因变量较远。
由此可知,当新应变量的取值不超过原自变量的集合时,拟合次数可适当取高,因为不用担心振荡;若拟合次数较低,新应变量会偏离原应变量更远,从而较不利于反映基础规律。
2.当拟合次数较高时,多项式容易在拟合时的自变量区间两端或一端及更外侧发生振荡,当取较中部区间的应变量时,是可以不发生振荡的。
3.当拟合点数过多,次数较高时,容易出现做最小二乘时奇异值分解不收敛的情况,从而无法拟合出多项式。
当拟合点数不多,拟合次数可以超过拟合点数,即未知数个数超过方程个数,虽然无法通过求解线性方程组得到未知数,但可通过梯度下降等最优化方法得到未知数,
即多项式的系数。此时容易发生振荡,中部非振荡区间也更窄。 -
对于局部加权回归:
1.Lowess方法返回的是拟合值,而不是函数,所以返回值与自变量x一一对应,而不能得到更密集或更稀疏或自变量区间外的返回值。
2.frac越接近1,拟合滑动窗口越接近全长,回归值越平滑,但frac不能大于1;it数越大,重新计算权重的最近点数目越多,受最近离群点影响越小;
因为此时已经比较平滑,所以所需it数较小。
3.frac越接近0,拟合滑动窗口中最近样本数越少,回归值跟随性越强,规律性越差;因为it数不宜大于滑动窗口中的样本数,所以此时it数也应较小。
4.当序列存在一段一段的离群值时,it值可取为最长离群段的样本数+1,这样无论用哪种Lowess方法,都不会受离群值的影响。 -
实践:
from math import ceil
import numpy as np
from scipy import linalg
import statsmodels.api as sm
import matplotlib.pyplot as plt
import copy
plt.style.use('seaborn-white')
# Defining the bell shaped kernel function - used for plotting later on
def kernel_function(xi, x0, tau=.005):
return np.exp(- (xi - x0) ** 2 / (2 * tau))
def lowess_bell_shape_kern(x, y, tau=.005):
"""
lowess_bell_shape_kern(x, y, tau = .005) -> yest
Locally weighted regression: fits a nonparametric regression curve to a scatterplot.
The arrays x and y contain an equal number of elements; each pair
(x[i], y[i]) defines a data point in the scatterplot. The function returns
the estimated (smooth) values of y.
The kernel function is the bell shaped function with parameter tau. Larger tau will result in a
smoother curve.
"""
n = len(x)
yest = np.zeros(n)
# Initializing all weights from the bell shape kernel function
w = np.array([np.exp(- (x - x[i]) ** 2 / (2 * tau)) for i in range(n)])
# Looping through all x-points
for i in range(n):
weights = w[:, i]
b = np.array([np.sum(weights * y), np.sum(weights * y * x)])
A = np.array([[np.sum(weights), np.sum(weights * x)],
[np.sum(weights * x), np.sum(weights * x * x)]])
theta = linalg.solve(A, b)
yest[i] = theta[0] + theta[1] * x[i]
return yest
def lowess_ag(x, y, frac=2/3, it=3):
"""
lowess(x, y, frac=2./3., it=3) -> yest
Lowess smoother: Robust locally weighted regression.
The lowess function fits a nonparametric regression curve to a scatterplot.
The arrays x and y contain an equal number of elements; each pair
(x[i], y[i]) defines a data point in the scatterplot. The function returns
the estimated (smooth) values of y.
The smoothing span is given by frac. A larger value for frac will result in a
smoother curve. The number of robustifying iterations is given by it. The
function will run faster with a smaller number of iterations.
"""
n = len(x)
r = int(ceil(frac * n))
h = [np.sort(np.abs(x - x[i]))[r] for i in range(n)]
w = np.clip(np.abs((x[:, None] - x[None, :]) / h), 0.0, 1.0)
w = (1 - w ** 3) ** 3
yest = np.zeros(n)
delta = np.ones(n)
for iteration in range(it):
for i in range(n):
weights = delta * w[:, i]
b = np.array([np.sum(weights * y), np.sum(weights * y * x)])
A = np.array([[np.sum(weights), np.sum(weights * x)],
[np.sum(weights * x), np.sum(weights * x * x)]])
beta = linalg.solve(A, b)
yest[i] = beta[0] + beta[1] * x[i]
residuals = y - yest
s = np.median(np.abs(residuals))
delta = np.clip(residuals / (6.0 * s), -1, 1)
delta = (1 - delta ** 2) ** 2
return yest
lowess_sm = sm.nonparametric.lowess
tau = 0.005 # abs(tau)越大,lowess_bell_shape_kern的回归值越平滑;tau→0+,回归值跟随性越强,规律性越差。
samples_in_window = 24 # samples_in_window越接近样本全体数量,回归值越平滑;越接近1,回归值跟随性越强,规律性越差。
frac = 0.9
n = 31 # 多项式最高次数,未知数(待拟合变量)有n+1个
# 测试第一种正弦数据集
scale = 5
x_plot = np.random.uniform(low=0, high=1*np.pi, size=200*scale)
x_plot.sort()
y_plot = np.sin(x_plot * 1.5 * np.pi)
y_noise_plot = y_plot + np.random.normal(scale=1/2, size=len(x_plot))
x = np.array([x_plot[5*i] for i in range(int(len(x_plot)/scale))])
y = np.sin(x * 1.5 * np.pi)
y_noise = y + np.random.normal(scale=1/2, size=len(x))
# lowess方法返回的是拟合值,而不是函数,所以返回值与自变量x一一对应,而不能得到更密集或更稀疏的返回值
yest_bell_plot = lowess_bell_shape_kern(x_plot, y_noise_plot, tau=tau)
# frac越接近1,拟合滑动窗口越接近全长,回归值越平滑;it数越大,重新计算权重的最近点数目越多,受最近离群点影响越小;因为此时已经比较平滑,所以所需it数较小。
# frac越接近0,拟合滑动窗口中最近样本数越少,回归值跟随性越强,规律性越差;因为it数不宜大于滑动窗口中的样本数,所以此时it数也应较小。
yest_ag_plot = lowess_ag(x_plot, y_noise_plot, frac=samples_in_window/len(x), it=3)
yest_sm_plot = lowess_sm(y_noise_plot, x_plot, frac=samples_in_window/len(x), it=3, return_sorted=False)
# polyfit与poly1d返回的是函数p,所以对这个函数p赋任何x,都可以得到相应的因变量y
c = np.polyfit(x, y_noise, n)
p = np.poly1d(c)
y_poly_plot = p(x_plot)
# plot to compare three kinds of Lowess method
plt.figure(figsize=(20, 8))
plt.plot(x, y, color='g', label='sin()')
plt.scatter(x, y_noise, facecolors='none', edgecolor='darkblue', label='sin()+np.random.normal()')
plt.plot(x, y_noise, color='darkblue')
plt.fill(x[:], kernel_function(x[:], max(x)/2, tau), color='lime', alpha=0.5, label='Bell shape kernel')
plt.plot(x_plot, yest_ag_plot, color='orange', label='Lowess: A. Gramfort')
plt.plot(x_plot, yest_bell_plot, color='red', label='Lowess: bell shape kernel')
plt.plot(x_plot, yest_sm_plot, color='magenta', label='Lowess: statsmodel')
plt.plot(x_plot, y_poly_plot, color='k', label='polynomial of degree {}'.format(n))
plt.ylim(min(y_noise) - abs(min(y_noise) / 2), max(y_noise) * 1.5) # 使y坐标的上下限关于y的最大最小值对称,且不受多项式拟合函数的振荡值影响
plt.legend()
plt.title('Sine with noise: Lowess regression comparing with polynomial')
# 测试第二种带离群值的数据集
x = [np.linspace(1, 14, 14), np.linspace(1, 38, 38)]
# x_plot使绘图时的x取值比多项式拟合时,自变量x的取值更密集,以观察到可能出现的振荡
x_plot = [np.linspace(1, int(max(x[0])), int(max(x[0]))*5), np.linspace(1, int(max(x[1])), int(max(x[1]))*5)]
y = [np.random.randn(len(x[0])), np.random.randn(len(x[1]))]
# 设置离群值
for i in range(len(x)):
y[i][int(len(x[i])/2-2):int(len(x[i])/2+2)] = 5*(i+1)
n = [[3, 4], [3, 6]]
weights = [(max(y[i])*2-y[i])/sum(max(y[i])*2-y[i]) for i in range(len(x))] # y[i]越大,weights越小;让离群点占较小权重,正常点占较大权重。
c = copy.deepcopy(n) # c与n具有相同的结构;一定要用copy.deepcopy,不能用浅拷贝copy,否则后面对c的赋值也会改变n
p = copy.deepcopy(n) # p与n具有相同的结构
yest_sm, yest_ag = [], []
for i in range(len(x)):
for j in range(len(n[i])):
c[i][j] = np.polyfit(x[i], y[i], n[i][j], full=True, w=weights[i])
p[i][j] = np.poly1d(c[i][j][0])
print('序列长度{0}个点,拟合次数为{1},即目标函数未知数个数为{2},即目标函数对自变量偏导数的方程个数为{2}:'.format(len(x[i]), n[i][j], n[i][j]+1))
print('拟合函数的系数(次数由高到低):', '\n', c[i][j][0])
print('拟合值与实际值的MAPE:', sum(abs((p[i][j](x[i])-y[i])/y[i]))/len(x[i]))
print('目标函数对自变量偏导数方程组的系数矩阵的秩:', c[i][j][2])
print('目标函数对自变量偏导数方程组的系数矩阵的奇异值:', '\n', c[i][j][3])
print('拟合的相关条件数:', c[i][j][4], '\n')
plt.figure(figsize=(10, 6))
plt.scatter(x[i], y[i], facecolors='none', edgecolor='darkblue', label='original_data')
plt.plot(x[i], y[i], color='darkblue')
plt.plot(x_plot[i], p[i][0](x_plot[i]), '-', color='g', label='polynomial of degree {}'.format(n[i][0]))
plt.plot(x_plot[i], p[i][1](x_plot[i]), '--', color='orange', label='polynomial of degree {}'.format(n[i][1]))
yest_sm.append(lowess_sm(y[i], x[i], frac=frac, it=5, return_sorted=False))
yest_ag.append(lowess_ag(x[i], y[i], frac=frac, it=5))
plt.plot(x[i], yest_sm[i], color='k', label='Lowess: statsmodel')
plt.plot(x[i], yest_ag[i], color='r', label='Lowess: A. Gramfort')
plt.ylim(min(y[i])-abs(min(y[i])/2), max(y[i])*1.5) # 使y坐标的上下限关于y的最大最小值对称,且不受多项式拟合函数的振荡值影响
plt.legend()
plt.title('series length: {0}, degree of polyfit: {1}, {2}'.format(len(x[i]), n[i][0], n[i][1]))
- 图形展示:
sin()+noise:polynomial及Lowess均可较好地学习
两侧较短的非离群区间:polynomial及Lowess均被拉起
两侧较长的非离群区间:polynomial及Lowess均可不被拉起