题目:
一组women的实验数据,数据内容来自The World Almanac and Book of Facts 1975,该数据集给出了年龄在30-39岁的15名女性的身高和体重数据,主要属性如下:
(1)Height:身高;(2)weight:体重。(注意采用英寸和英镑为单位)
请建立简单线性回归模型,实现依据身高预测以为女性的体重,并对模型进行评估和优化。(优化方法可以使用多项式回归模型)
# -*- coding: utf-8 -*- #
"""
@Project :MachineLearning_exp
@File :LinearRegression.py
@Author :ZAY
@Time :2023/3/16 10:18
@Annotation : " "
"""
# packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
data = pd.read_csv(".//Data//women.csv", index_col = 0)
X = data["height"]
# 添加截距项
X = sm.add_constant(X)
y = data["weight"]
# 数据描述性分析
data.describe()
# 绘制散点图
plt.scatter(data["height"], data["weight"], color = 'magenta')
# 添加标题
plt.xlabel('height')
plt.ylabel('weight')
plt.title('women-height-weight_scatter')
plt.show()
# 最小二成模型
model = sm.OLS(y, X)
# 训练模型
result = model.fit()
# 输出训练结果
print(result.summary())
# 模型预测
y_pre = result.predict()
print(y_pre)
# 结果可视化
plt.rcParams['font.family'] = "simHei" # 汉字显示
plt.plot(data["height"], data["weight"], "o", color = 'magenta')
plt.plot(data["height"], y_pre)
plt.title('women-height-weight_scatter')
plt.show()
plt.savefig('.//Result//women-height-weight_LR_scatter.png')
散点图如下:
采用多项式回归模型进行优化:
# -*- coding: utf-8 -*- #
"""
@Project :MachineLearning_exp
@File :PolynomialRegression.py
@Author :ZAY
@Time :2023/3/16 10:43
@Annotation : " "
"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
data = pd.read_csv(".//Data//women.csv", index_col = 0)
X = data["height"]
y = data["weight"]
# 构造三阶多项式
X = np.column_stack((X, np.power(X, 2), np.power(X, 3)))
# 添加截距项
X = sm.add_constant(X)
model = sm.OLS(y, X)
result = model.fit()
print(result.summary())
y_pre = result.predict()
print(y_pre)
# 结果可视化
plt.rcParams['font.family'] = "simHei" # 汉字显示
plt.plot(data["height"], data["weight"], "o", color = 'magenta')
plt.plot(data["height"], y_pre)
plt.title('women-height-weight_scatter')
plt.show()
plt.savefig('.//Result//women-height-weight_PR_scatter.png')
散点图如下:
模型描述:
OLS Regression Results
==============================================================================
Dep. Variable: weight R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 1.679e+04
Date: Thu, 16 Mar 2023 Prob (F-statistic): 2.07e-20
Time: 11:52:53 Log-Likelihood: 1.3441
No. Observations: 15 AIC: 5.312
Df Residuals: 11 BIC: 8.144
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -896.7476 294.575 -3.044 0.011 -1545.102 -248.393
x1 46.4108 13.655 3.399 0.006 16.356 76.466
x2 -0.7462 0.211 -3.544 0.005 -1.210 -0.283
x3 0.0043 0.001 3.940 0.002 0.002 0.007
==============================================================================
Omnibus: 0.028 Durbin-Watson: 2.388
Prob(Omnibus): 0.986 Jarque-Bera (JB): 0.127
Skew: 0.049 Prob(JB): 0.939
Kurtosis: 2.561 Cond. No. 1.25e+09
==============================================================================
实验数据请私信,尽力提供!