pandas处理泰坦尼克号数据集

原创已于 2025-06-12 10:38:03 修改 · 1.6k 阅读

45 ·

CC 4.0 BY-SA版权

文章标签：

#学习 #pandas

于 2025-03-25 15:08:56 首次发布

部署运行你感兴趣的模型镜像

1. 项目背景与环境准备

泰坦尼克数据集是机器学习领域中的经典案例，常用于二分类问题——预测乘客的生存情况。本项目将使用 Python 进行数据处理与建模，这里主要用到以下库：

pandas & numpy：用于数据加载、处理和数值计算；

matplotlib & seaborn：用于数据可视化，帮助我们直观理解数据分布和关系；

scikit-learn：提供数据集划分、模型训练、超参数调优、交叉验证和模型评估等功能；

joblib：用于保存训练好的模型；

此外，为了防止中文乱码，这里设置了 matplotlib 的中文字体为 SimHei，并调整了负号显示问题。

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib

# 设置中文字体，避免中文乱码
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False  # 解决负号显示问题

# scikit-learn 及其他机器学习工具
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import joblib

# 尝试导入 xgboost
try:
    from xgboost import XGBClassifier
    xgb_available = True
except ImportError:
    print("xgboost 没有安装，将跳过 XGBoost 模型。")
    xgb_available = False

2. 数据加载与初步查看

数据探索是整个项目的基础，通过初步了解数据结构、特征类型、缺失值情况等，可以为后续的数据清洗和特征工程打好基础。

2.1 数据加载

数据集存放在 GitHub 上，直接使用 pd.read_csv 读取数据文件。

data_url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
data = pd.read_csv(data_url)

2.2 数据基本信息查看

利用 info()、head() 和 describe() 方法快速查看数据结构、前几条记录和数值型特征的统计描述。

print("数据基本信息：")
print(data.info())
print("\n数据预览：")
print(data.head())
print("\n数值特征统计描述：")
print(data.describe())

info()：帮助了解每列的数据类型、非空数目，便于判断哪些列存在缺失值；
head()：直观观察数据样本，判断是否需要做数据清洗；
describe()：查看数值特征的均值、标准差、最大值、最小值等信息，了解数据分布。

3. 数据可视化与缺失值分析

数据可视化能帮助我们快速发现数据中的模式和异常情况。通过热图展示缺失值的分布，可以直观判断哪些特征缺失较多，哪些特征可继续保留或直接删除。

plt.figure(figsize=(10, 4))
sns.heatmap(data.isnull(), cbar=False, cmap="viridis")
plt.title("缺失值热图")
plt.show()

data.isnull() 生成布尔矩阵，标识每个单元格是否为空；
使用 sns.heatmap 将布尔矩阵可视化，帮助判断后续缺失值处理策略。

4. 数据清洗与特征工程

数据清洗和特征工程是提升模型性能的关键步骤。我们在这一步中将处理缺失值、提取新特征、删除无用特征并对分类变量进行编码。

4.1 缺失值处理

对 Age 列采用中位数填充，防止极端值影响结果；
对 Embarked 列采用众数填充；
由于 Cabin 缺失过多，直接删除该特征。

data['Age'].fillna(data['Age'].median(), inplace=True)
data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)
data.drop(columns=['Cabin'], inplace=True)

扩展知识：

填充策略选择：中位数对异常值不敏感，众数适合类别型数据；

缺失值剔除：当某一特征缺失率过高时，直接删除可能比填充更合理，避免引入过多噪声。

4.2 特征提取 —— 称谓（Title）

从乘客姓名中提取称谓信息,通过字符串拆分提取后，对出现次数较少的称谓归类为 "Rare"，减少类别稀疏性问题,对低频称谓归类有助于减少模型噪声，提高泛化能力。

def extract_title(name):
    if pd.isnull(name):
        return "Unknown"
    title = name.split(",")[1].split(".")[0].strip()
    return title

data['Title'] = data['Name'].apply(extract_title)
title_counts = data['Title'].value_counts()
rare_titles = title_counts[title_counts < 10].index
data['Title'] = data['Title'].replace(rare_titles, 'Rare')

4.3 构造新特征 —— 家庭规模

结合 SibSp（兄弟姐妹/配偶数量）和 Parch（父母/子女数量）构造家庭规模特征，有助于捕捉乘客家庭背景对生存率的影响。

data['FamilySize'] = data['SibSp'] + data['Parch'] + 1

4.4 删除无用特征与类别编码

删除对预测无直接帮助的特征（如 PassengerId、Name、Ticket），并利用 LabelEncoder 将类别变量转为数值型，方便算法处理，且可以减少模型的计算复杂度和过拟合风险。

data.drop(columns=['PassengerId', 'Name', 'Ticket'], inplace=True)

categorical_features = ['Sex', 'Embarked', 'Title']
le = LabelEncoder()
for col in categorical_features:
    data[col] = le.fit_transform(data[col])

LabelEncoder：将字符串标签转为整数

5. 深入数据可视化分析

在数据清洗后，通过更多可视化手段进一步挖掘数据间的关系和潜在模式，帮助我们理解数据如何影响乘客生存率。

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# 1. 舱位等级与生存情况
sns.countplot(x='Pclass', hue='Survived', data=data, ax=axes[0, 0])
axes[0, 0].set_title("舱位等级与生存情况")

# 2. 性别与生存情况
sns.countplot(x='Sex', hue='Survived', data=data, ax=axes[0, 1])
axes[0, 1].set_title("性别与生存情况")

# 3. 登船港口与生存情况
sns.countplot(x='Embarked', hue='Survived', data=data, ax=axes[1, 0])
axes[1, 0].set_title("登船港口与生存情况")

# 4. 家庭规模分布
sns.histplot(data['FamilySize'], kde=False, ax=axes[1, 1])
axes[1, 1].set_title("家庭规模分布")

plt.tight_layout()
plt.show()

countplot 展示不同类别的计数，通过 hue 观察生存情况，直观判断哪些特征与生存有关；
histplot 能够显示连续变量的分布情况，辅助判断数据分布是否存在偏态。

6. 模型训练、调参与评估

经过数据处理和探索后，下一步就是构建模型。本案例使用了三种模型：逻辑回归、随机森林和 XGBoost（如安装）。在此过程中，我们重点关注如何进行数据集拆分、超参数调优、交叉验证以及模型评估。

6.1 数据集拆分

利用 train_test_split 将数据划分为训练集和测试集，通常 80% 用于训练，20% 用于测试，保证模型评估的独立性。

X = data.drop('Survived', axis=1)
y = data['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

6.2 网格搜索与交叉验证

封装一个 train_and_evaluate 函数，通过 GridSearchCV 对不同参数组合进行穷举搜索，并利用交叉验证评估模型的稳定性。函数内部输出最佳参数、交叉验证平均准确率、测试集准确率、分类报告，并绘制混淆矩阵。

def train_and_evaluate(model, param_grid, model_name):
    # 网格搜索调参
    grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
    grid.fit(X_train, y_train)
    best_model = grid.best_estimator_
    print(f"\n{model_name} 最佳参数：", grid.best_params_)
    
    # 交叉验证得分，保证模型稳定性
    cv_scores = cross_val_score(best_model, X_train, y_train, cv=5, scoring='accuracy')
    print(f"{model_name} 交叉验证平均准确率: {np.mean(cv_scores):.4f}")
    
    # 测试集评估：准确率、分类报告和混淆矩阵
    y_pred = best_model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"{model_name} 测试集准确率: {acc:.4f}")
    print(f"{model_name} 分类报告：\n", classification_report(y_test, y_pred))
    
    cm = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
    plt.title(f"{model_name} 混淆矩阵")
    plt.xlabel("预测值")
    plt.ylabel("真实值")
    plt.show()
    
    return best_model

GridSearchCV：遍历预设参数组合，寻找最优参数，并利用 5 折交叉验证（cv=5）评估模型表现；
cross_val_score：计算交叉验证准确率，评估模型在不同数据划分下的稳定性；
classification_report：输出精确率、召回率、F1 分数等指标，帮助全面评估模型性能；

confusion_matrix：通过混淆矩阵分析模型在各类别上的预测表现。

6.3 模型构建与调参

逻辑回归

逻辑回归是基础的线性分类器，适用于线性可分问题。本例中调节参数 C（正则化强度）。

#逻辑回归
lr = LogisticRegression(max_iter=1000)
lr_param_grid = {'C': [0.1, 1, 10, 50]}
best_lr = train_and_evaluate(lr, lr_param_grid, "逻辑回归")

#随机森林
rf = RandomForestClassifier(random_state=42)
rf_param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5]
}
best_rf = train_and_evaluate(rf, rf_param_grid, "随机森林")

#XGBoost
if xgb_available:
    xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
    xgb_param_grid = {
        'n_estimators': [100, 200],
        'max_depth': [3, 5, 7],
        'learning_rate': [0.01, 0.1, 0.2]
    }
    best_xgb = train_and_evaluate(xgb, xgb_param_grid, "XGBoost")
else:
    best_xgb = None

逻辑回归是基础的线性分类器，适用于线性可分问题。本例中调节参数 C（正则化强度）。
随机森林是基于决策树集成的方法，通过调节树的数量（n_estimators）、树的最大深度（max_depth）以及分割节点的最小样本数（min_samples_split）来防止过拟合。
随机森林是基于决策树集成的方法，通过调节树的数量（n_estimators）、树的最大深度（max_depth）以及分割节点的最小样本数（min_samples_split）来防止过拟合。

7. 模型对比、特征重要性与结果解释

模型训练完成后，比较不同模型在测试集上的表现，通常用准确率来衡量。同时，对于支持特征重要性属性的模型（如随机森林和 XGBoost），绘制特征重要性图，直观展示各特征对预测的贡献。

# 比较模型准确率
models = {'逻辑回归': best_lr, '随机森林': best_rf}
if best_xgb:
    models['XGBoost'] = best_xgb

accuracy_dict = {}
for name, model in models.items():
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    accuracy_dict[name] = acc

print("\n各模型测试集准确率比较：")
for name, acc in accuracy_dict.items():
    print(f"{name}: {acc:.4f}")

# 绘制特征重要性图的函数
def plot_feature_importance(model, model_name):
    if hasattr(model, 'feature_importances_'):
        importance = model.feature_importances_
        feature_names = X.columns
        indices = np.argsort(importance)
        plt.figure(figsize=(8, 6))
        plt.title(f"{model_name} 特征重要性")
        plt.barh(range(len(indices)), importance[indices], align='center')
        plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
        plt.xlabel("重要性得分")
        plt.show()

print("\n随机森林特征重要性：")
plot_feature_importance(best_rf, "随机森林")
if best_xgb:
    print("\nXGBoost 特征重要性：")
    plot_feature_importance(best_xgb, "XGBoost")

特征重要性图能够帮助我们了解哪些特征对模型决策贡献较大，从而为特征选择和进一步的数据分析提供依据；

8. 模型保存与应用部署

在实际项目中，模型训练结束后常常需要保存模型以便于后续加载与部署。本文使用 joblib 将表现最好的模型（此处假设为随机森林模型）保存为 pkl 文件。

best_model = best_rf  # 假设随机森林模型表现最佳
model_filename = "best_titanic_model.pkl"
joblib.dump(best_model, model_filename)
print(f"\n最佳模型已保存为 {model_filename}")

您可能感兴趣的与本文相关的镜像