【数据分析流程】

最新推荐文章于 2025-07-29 19:45:28 发布

好好学习_rich

最新推荐文章于 2025-07-29 19:45:28 发布

阅读量194

点赞数

CC 4.0 BY-SA版权

分类专栏：数据分析线性模型文章标签：数据分析 python

本文链接：https://blog.youkuaiyun.com/Four2017/article/details/128472281

数据分析同时被 2 个专栏收录

11 篇文章

订阅专栏

线性模型

8 篇文章

订阅专栏

本文通过一个具体案例，介绍了从数据导入到模型评估的完整数据分析过程。包括数据预处理、探索性分析、特征工程、模型训练及系数解读等内容。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

以岭回归为例。

导入数据集

本篇借用一个线性模型岭回归来解释，也指出了线性模型不适合拟合数据，或者特征之间具有相关性而引起的问题。

首先导入相关库：

import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

导入数据：

from sklearn.datasets import fetch_openml

survey = fetch_openml(data_id=534, as_frame=True)
X = survey.data[survey.feature_names]
y = survey.target.values.ravel()

在这个数据集中，target是WAGE单位是$/hour，features中既有数值型变量，又有分类型变量。

导入数据之和，先观察数据的类型，看看因变量和特征分别是什么类型的数据，比如浮点数，还是字符串等。

探索性分析(exploratory analysis)

这里，只用了train dataset探索，可以避免我们对测试集数据的了解而做出有偏的分析。

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

变量分布和相关性分析

只分析了数值型变量：

train_dataset = X_train.copy()
train_dataset.insert(0, "WAGE", y_train)
_ = sns.pairplot(train_dataset, kind="reg", diag_kind="kde")

其结果如下图：

在这里插入图片描述

看上图，WAGE明显有一个长尾分布。取对数将其化为正态分布，因为岭回归或者Lasso回归都要求误差是正态分布。

当Education提高时，WAGE增加。注意图中的相关关系，并没有保持其他变量不变。Education和Age有很强的相关关系。

machine-learning pipeline

首先查看数据类型：machine-learning pipeline。

特别注意：分类型变量不能直接用于线性模型，除非经过处理，比如说one-hot-encode处理(只针对非二分类的变量)，同时也避免吧分类变量看待成顺序的值，也就是有大小之分。

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder

categorical_columns = ["RACE", "OCCUPATION", "SECTOR", "MARR", "UNION", "SEX", "SOUTH"]#分类变量名称
numerical_columns = ["EDUCATION", "EXPERIENCE", "AGE"]#数值变量名称

preprocessor = make_column_transformer(
    (OneHotEncoder(drop="if_binary"), categorical_columns),
    remainder="passthrough",
    verbose_feature_names_out=False,  # avoid to prepend the preprocessor names
)

建模

初步建模

建立一个非常小的正则化的岭回归模型，且因变量为log(WAGE)。只对分类变量进行one-hot-encode处理。

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge
from sklearn.compose import TransformedTargetRegressor

model = make_pipeline(
    preprocessor,
    TransformedTargetRegressor(
        regressor=Ridge(alpha=1e-10), func=np.log10, inverse_func=sp.special.exp10
    ),
)

model.fit(X_train, y_train)#拟合模型

模型指标

对模型拟合的指标，这里是平均绝对值误差MAE。

from sklearn.metrics import median_absolute_error

mae_train = median_absolute_error(y_train, model.predict(X_train))
y_pred = model.predict(X_test)
mae_test = median_absolute_error(y_test, y_pred)
scores = {
    "MedAE on training set": f"{mae_train:.2f} $/hour",
    "MedAE on testing set": f"{mae_test:.2f} $/hour",
}

解释系数

建立模型系数的DataFrame。

feature_names = model[:-1].get_feature_names_out()

coefs = pd.DataFrame(
    model[-1].regressor_.coef_,
    columns=["Coefficients"],
    index=feature_names,
)

coefs

对于系数的解释，特别要注意的是：一定要规范化系数的范围，因为不同的特征的单位不同，数量级也不一样。下面将对比一下：

没有归一化之前：

coefs.plot.barh(figsize=(9, 7))
plt.title("Ridge model, small regularization")
plt.axvline(x=0, color=".5")
plt.xlabel("Raw coefficient values")
plt.subplots_adjust(left=0.3)
plt.show()

结果为：

在这里插入图片描述

在上图中，对影响WAGE最大的好像是变量Union，但是经验告诉我们，似乎experience才应该是最大的。系数范围不一样，可能会误导分析。根据 $y=\sum coef_i \times X_i=\sum (coef_i \times std_i)\times (X_i/std_i)$ ，对系数进行处理。

coefs = pd.DataFrame(
    model[-1].regressor_.coef_ * X_train_preprocessed.std(axis=0),
    columns=["Coefficient importance"],
    index=feature_names,
)
coefs.plot(kind="barh", figsize=(9, 7))
plt.xlabel("Coefficient values corrected by the feature's std. dev.")
plt.title("Ridge model, small regularization")
plt.axvline(x=0, color=".5")
plt.subplots_adjust(left=0.3)
plt.show()