机器学习速成第二集——监督学习之回归+数据处理（实践部分）！

本文链接：https://blog.youkuaiyun.com/2302_80644606/article/details/141105074

数据预处理

数据预处理是机器学习流程中非常重要的一步，它包括数据清洗、特征工程等步骤。

数据清洗

处理缺失值：

# 使用中位数填充缺失值
df['Age'].fillna(df['Age'].median(), inplace=True)

异常值检测与处理：

# 使用IQR方法检测异常值
Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df = df[(df['Age'] >= lower_bound) & (df['Age'] <= upper_bound)]

类别特征编码：

# 使用one-hot编码
df = pd.get_dummies(df, columns=['Gender'], drop_first=True)

特征工程

创建新特征：

df['Total_Pay'] = df['Base_Pay'] + df['Bonus']

特征缩放：

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[['Age', 'Salary']])
df[['Age', 'Salary']] = scaled_features

探索性数据分析 (EDA)

EDA 是为了更好地理解数据集的特性。我们可以使用可视化工具来辅助这一过程。

使用Matplotlib进行可视化

绘制直方图：

import matplotlib.pyplot as plt

plt.hist(df['Age'], bins=20)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

绘制箱线图：

df.boxplot(column='Salary')
plt.title('Salary Distribution')
plt.show()

绘制散点图：

plt.scatter(df['Age'], df['Salary'])
plt.title('Age vs Salary')
plt.xlabel('Age')
plt.ylabel('Salary')
plt.show()

构建简单的机器学习模型

现在我们有了清理过后的数据，可以开始构建机器学习模型了。这里我们将使用线性回归模型作为示例。

准备数据

划分训练集和测试集：

from sklearn.model_selection import train_test_split

X = df[['Age', 'Experience']]
y = df['Salary']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

特征缩放：

from sklearn.preprocessing import StandardSc