机器学习实践四--正则化线性回归 和 偏差vs方差

本文探讨了利用水库水位变化预测大坝出水量的机器学习模型,通过线性回归和多项式回归分析,诊断并解决高偏差和高方差问题,展示了如何通过调整正则化参数优化模型。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

这次实践的前半部分是,用水库水位的变化,来预测大坝的出水量。
给数据集拟合一条直线,可能得到一个逻辑回归拟合,但它并不能很好地拟合数据,这是高偏差(high bias)的情况,也称为“欠拟合”(underfitting)
相反,如果我们拟合一个非常复杂的分类器,比如深度神经网络或含有隐藏单元的神经网络,可能非常适用于这个数据,但是这看起来也不是一种很好的拟合方式分类器----方差较高(high variance)
数据过拟合(over fitting)

  • 高偏差和高方差是两种不同的情况,通常会用训练验证集来诊断算法是否存在偏差或方差问题。
  • 在机器学习初级阶段,会有很多关于偏差方差的讨论,能尝试的方法很多。在当前深度学习和大数据时代,只需持续训练更大的网络,就能不影响方差减少偏差;准备更多的数据,就能不影响偏差减少方差。

Regularized Linear Regression

Visualizing the dataset
data = loadmat('ex5data1.mat')
# Training set
X, y = data['X'], data['y']
# Cross validation set
Xval, yval = data['Xval'], data['yval']
# Test set
Xtest, ytest = data['Xtest'], data['ytest']

X = np.insert(X, 0, 1, axis=1)
Xval = np.insert(Xval, 0, 1, axis=1)
Xtest = np.insert(Xtest, 0, 1, axis=1)


def plot_data():
    plt.figure(figsize=(6, 4))
    plt.scatter(X[:, 1:], y, c='r', marker='x')
    plt.xlabel('change in water level(x)')
    plt.ylabel('Water flowing out of the dam (y)')
    plt.grid(True)


plot_data()
plt.show()
Regularized linear regression cost function
def regularized_cost(theta, X, y, l):
    cost = ((X.dot(theta) - y.flatten()) **2).sum()/(2*len(X))
    regularized_theta = l * (theta[1:].dot(theta[1:]))/(2 * len(X))
    return cost + regularized_theta


theta = np.ones(X.shape[1])
print(regularized_cost(theta, X, y, 1))
Regularized linear regression gradient
def regularized_gradient(theta, X, y, l):
    grad = (X.dot(theta) - y.flatten()).dot(X)
    regularized_theta = l * theta
    return (grad + regularized_theta) / len(X)


print(regularized_gradient(theta, X, y, 1))
def train_linear_regularized(X, y, l):
    theta = np.zeros(X.shape[1])
    res = opt.minimize(fun=regularized_cost,
                       x0=theta,
                       args=(X, y, l),
                       method='TNC',
                       jac=regularized_gradient)
    return res.x

Fitting linear regression

拟合线性回归,画出拟合线

final_theta = train_linear_regularized(X, y, 0)
plot_data()
plt.plot(X[:, 1], X.dot(final_theta))
plt.show()

Bias-variance

Learning curves

画出学习曲线

def plot_learning_curve(X, y, Xval, yval, l):

    training_cost, cross_cost = [], []
    for i in range(1, len(X)):
        res = train_linear_regularized(X[:i], y[:i], l)
        training_cost_item = regularized_cost(res, X[:i], y[:i], 0)
        cross_cost_item = regularized_cost(res, Xval, yval, 0)
        training_cost.append(training_cost_item)
        cross_cost.append(cross_cost_item)

    plt.figure(figsize=(6, 4))
    plt.plot([i for i in range(1, len(X))], training_cost, label='training cost')
    plt.plot([i for i in range(1, len(X))], cross_cost, label='cross cost')
    plt.legend()
    plt.xlabel('Number of training examples')
    plt.ylabel('Error')
    plt.title('Learning curve for linear regression')
    plt.grid(True)

plot_learning_curve(X, y, Xval, yval, 0)
plt.show()

Polynomial regression

Learning Polynomial Regression

使用多项式回归,规定假设函数如下:
在这里插入图片描述

def genPolyFeatures(X, power):
    
    Xpoly = X.copy()
    for i in range(2, power + 1):
        Xpoly = np.insert(Xpoly, Xpoly.shape[1], np.power(Xpoly[:,1], i), axis=1)
    return Xpoly

#获取训练集的均值和误差
def get_means_std(X):
    means = np.mean(X, axis=0)
    stds = np.std(X, axis=0, ddof=1) # ddof=1  样本标准差
    return means, stds

# 标准化
def featureNormalize(myX, means, stds):
    
    X_norm = myX.copy()
    X_norm[:,1:] = X_norm[:,1:] - means[1:]
    X_norm[:,1:] = X_norm[:,1:] / stds[1:]
    return X_norm


power = 6  # 扩展到x的6次方

train_means, train_stds = get_means_std(genPolyFeatures(X,power))
X_norm = featureNormalize(genPolyFeatures(X,power), train_means, train_stds)
Xval_norm = featureNormalize(genPolyFeatures(Xval,power), train_means, train_stds)
Xtest_norm = featureNormalize(genPolyFeatures(Xtest,power), train_means, train_stds)


def plot_fit(means, stds, l):
    """拟合曲线"""
    theta = train_linear_regularized(X_norm, y, l)
    x = np.linspace(-75, 55, 50)
    xmat = x.reshape(-1, 1)
    xmat = np.insert(xmat, 0, 1, axis=1)
    Xmat = genPolyFeatures(xmat, power)
    Xmat_norm = featureNormalize(Xmat, means, stds)

    plot_data()
    plt.plot(x, Xmat_norm @ theta, 'b--')

plot_fit(train_means, train_stds, 0)
plot_learning_curve(X_norm, y, Xval_norm, yval, 0)
Adjusting the regularization parameter
plot_fit(train_means, train_stds, 1)
plot_learning_curve(X_norm, y, Xval_norm, yval, 1)
plt.show()
Selecting λ using a cross validation set

尝试用不同的lambda调试,进行交叉验证

lambdas = [0., 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1., 3., 10.]
errors_train, errors_val = [], []
for l in lambdas:
    theta = train_linear_regularized(X_norm, y, l)
    errors_train.append(regularized_cost(theta, X_norm, y, 0))
    errors_val.append(regularized_cost(theta, Xval_norm, yval, 0))

plt.figure(figsize=(8, 5))
plt.plot(lambdas, errors_train, label='Train')
plt.plot(lambdas, errors_val, label='Cross Validation')
plt.legend()
plt.xlabel('lambda')
plt.ylabel('Error')
plt.grid(True)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值