算法演示和完整代码见文末,下一期将介绍无监督聚类的经典K-Means算法。若需转载请注明来源,谢谢。
扫码关注公众号,了解最新领域算法知识
-
简要介绍
-
部分理论
-
数据清单
-
评价指标
-
工程复现
-
动图演示
-
总结展望
简要介绍
最近在李师兄的教学下开始巩固ML-NLP基础理论,将第一次课程部分理论融入自己的知识进行工程复现,内容涉及使用基本运算库实现线性回归、梯度下降法、MSE损失函数等。
部分理论
Model: Linear Regerssion
Loss: Mean Squared Error
Optimization: Batch gradient descent
关注公众号,了解最新最完整文章
数据清单
在线波士顿房价数据集
dataset.zip:包含506行,13列特征列,1列标签列
Fig.1 Partial feature set display
评价指标
指标:MSE,全称Mean Squared Error,即为均方误差。
作用:是用来衡量模型预测结果对标准结果的接近程度一种衡量方法,均方误差的值越小,说明预测数据与真实数据越接近。
工程复现
配置基本需求库
""" Import the basic requirements package """
import time
import math
import numpy as np
from sklearn.datasets import load_boston
配置数据集导出函数
""" Dataset export function """
def read_csv():
X, y = load_boston(return_X_y=True)
print(X.shape, y.shape) # Print feature set and label set length
return X, y
配置MSE损失函数
""" MSE loss function """
def mse_loss(y_true, y_pred):
return np.sum(np.power(y_true - y_pred, 2)) / y_true.shape[0] / 2
配置梯度下降法函数
""" Gradient descent function """
def gradient_descent(learning_rate, W, b, X_train, y_train, y_pred):
dW = np.dot(y_pred - y_train, X_train)
W = W - learning_rate * dW
db = np.sum(y_pred - y_train)
b = b - learning_rate * db
return W, b
配置训练集验证集划分函数
""" Trainset and Validset partition function """
def train_valid_split(X, y, split_rate):
n_split = int(X.shape[0] * (1 - split_rate))
return X[ :n_split], y[ :n_split], X[n_split: ], y[n_split: ]
配置线性回归模型参数
""" Linear Regression model training parameters """
lr_params = {
'learning_rate': 1e-08, # Set learning rate
'n_estimators': 10000, # Set the number of iterations
'validation_split': 0.2, # Set Set the proportion of the Validset in the Dataset
'verbose': 20, # Set how many iterations to keep the loss
'seed': 2021, # Set random seed
}
配置线性回归模型函数
""" Linear Regression model function """
# create model
def LinearRegression(learning_rate=1e-8, n_estimators=1000, validation_split=0.2, verbose=20, seed=0):
lr_params = {
'learning_rate': learning_rate,
'n_estimators': n_estimators,
'validation_split': validation_split,
'verbose': verbose,
'seed': seed,
}
return lr_params
# fit model
def fit(lr_params, X, y):
X_train, y_train, X_valid, y_valid = train_valid_split(X, y, lr_params['validation_split'])
print(X_train.shape, y_train.shape, X_valid.shape, y_valid.shape, '\n')
# Randomly set W and b
np.random.seed(lr_params['seed'])
W = np.random.rand(X.shape[1])
b = np.random.rand()
loss = -1
y_pred = np.dot(X_train, W) + b
print("[Linear Regression] [Training]")
# for in n_estimators
for i in range(lr_params['n_estimators']):
# Update parameters W and b
W, b = gradient_descent(lr_params['learning_rate'], W, b, X_train, y_train, y_pred)
train_mse = mse_loss(np.dot(X_train, W) + b, y_train)
valid_mse = mse_loss(np.dot(X_valid, W) + b, y_valid)
y_pred = np.dot(X_train, W) + b
# Refresh loss according to verbose
if i % lr_params['verbose'] != 0:
print("\r[{:<4}] train mse_0's: {:<8.2f} valid mse_1's: {:<8.2f}".format(i, train_mse, valid_mse), end='')
else:
print("\r[{:<4}] train mse_0's: {:<8.2f} valid mse_1's: {:<8.2f}".format(i, train_mse, valid_mse), end='\n')
# early stoping judgment and shake avoid
if (loss < 0 or loss * 10 >= valid_mse) or i < 10:
loss = valid_mse
else:
print("\nEarly stopping, best iteration is:")
print("[{:<4}] train mse_0's: {:<8.2f} valid mse_1's: {:<8.2f}".format(i-1, train_mse, loss), end='\n')
return None
return None
配置训练主进程
""" Linear Regression model training host process """
if __name__ == '__main__':
sta_time = time.time()
X, y = read_csv()
lr_model = LinearRegression(**lr_params)
fit(lr_model, X, y)
print("Time:", time.time() - sta_time)
动图演示
总结展望
在线性回归中理想状态的数据集为正定矩阵或满秩矩阵时可直接求解参数,但现实中往往难以使用此法,目前各类梯度下降法已经成为收敛算法模型的主流方法,本文采用经典的梯度下降法来收敛参数。其中对早停模块没有进行合理配置,导致验证集发生抖动立刻判定为最优解并停止模型向后迭代。
完整代码
后台回复"LR"获取完整代码
参考链接
Boston
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html
MSE Loss
https://rohanvarma.me/Loss-Functions/
Else Code
https://zhuanlan.zhihu.com/p/90844957
筝自然语言处理和推荐算法