2025 AI 打卡 Day2：我踩了 5 个 Pandas 大坑后，终于画出泰坦尼克号 10 张图爆款 EDA 报告（完整排雷 + 终极可运行代码）-优快云博客

作者：brian
GitHub 打卡仓库：https://github.com/codeInbpm/ML-Learning-Journey-2025
本文对应提交：feat: Day2 完整特征工程 + 10图EDA + 5个经典报错排雷

真实踩坑全记录（带截图 + 报错信息）

坑 1：KeyError: ‘Age_Filled’（最经典！99%的人都会踩）

df['Age_Group'] = pd.cut(df['Age_Filled'], ...)

报错信息：

KeyError: 'Age_Filled'
ValueError: Could not interpret value `Age_Filled` for `x`

原因：直接用了还没生成的列！

坑 2：KeyError: ‘FamilySize’ / ‘Embarked_Filled’

同理，画图时用了 FamilySize、Embarked_Filled，但原始数据根本没有这些列。
在这里插入图片描述

坑 3：seaborn 黄色警告（FutureWarning）

sns.barplot(..., palette='Set2')

报错信息：

FutureWarning: Passing `palette` without assigning `hue` is deprecated...

原因：seaborn 新版本要求必须写 hue=，否则未来会报错。

坑 4：运行完没图？一片空白！

原因：在 .py 文件里写了 plt.savefig() 但忘了 plt.show()

坑 5：图例遮挡、颜色丑、字体小、布局乱

原因：默认参数太拉垮，专业报告必须手动调参。

终极解决方案：完整可运行代码（一次成功版）

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("../data/titanic.csv")

# 1. 提取称呼 Title
df['Title'] = df['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)

# 2. 智能填充 Age（按 Title 分组中位数）
def fill_age(row):
    if pd.notna(row['Age']):
        return row['Age']
    median_by_title = {'Mr': 30, 'Mrs': 35, 'Miss': 21, 'Master': 4, 'Dr': 46, 'Rev': 43}
    return median_by_title.get(row['Title'], 30)

df['Age_Filled'] = df.apply(fill_age, axis=1)

# 3. 填充 Embarked（众数）
df['Embarked_Filled'] = df['Embarked'].fillna(df['Embarked'].mode()[0])

# 4. 新增家庭人数
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

# 5. 新增对数票价（防偏态）
df['Log_Fare'] = np.log1p(df['Fare'])

# ====================== 2. 现在开始画10张图（全部成功！）======================
plt.style.use('seaborn-v0_8')
fig, axes = plt.subplots(2, 5, figsize=(28, 12))
fig.suptitle('泰坦尼克号生存预测：10关键洞察EDA报告（2025版）', fontsize=24, fontweight='bold')

# 图1: 总体生还率
survived_rate = df['Survived'].mean() * 100
axes[0,0].pie([survived_rate, 100-survived_rate],
              labels=['生还', '遇难'], autopct='%1.1f%%',
              colors=['#66b3ff', '#ff9999'], startangle=90)
axes[0,0].set_title(f'1. 总体生还率 {survived_rate:.1f}%', fontsize=14)

# 图2: 性别生还率
sns.barplot(data=df, x='Sex', y='Survived', ax=axes[0,1], hue='Sex', palette='Set3', legend=False)
axes[0,1].set_title('2. 性别生还率（女74.2% vs 男18.9%）')
axes[0,1].set_ylim(0, 1)

# 图3: 舱位生还率
sns.barplot(data=df, x='Pclass', y='Survived', ax=axes[0,2], hue='Pclass', palette='viridis', legend=False)
axes[0,2].set_title('3. 舱位生还率（1等舱最安全）')
axes[0,2].set_ylim(0, 1)

# 图4: 年龄分布 + 生还KDE
sns.histplot(data=df, x='Age_Filled', hue='Survived', kde=True,
             ax=axes[0,3], bins=30, alpha=0.7, palette='husl')
axes[0,3].set_title('4. 年龄分布（儿童&年轻女性更易生还）')

# 图5: 对数票价分布
sns.histplot(data=df, x='Log_Fare', hue='Survived', kde=True,
             ax=axes[0,4], bins=25, alpha=0.7)
axes[0,4].set_title('5. 票价分布（高票价=高生还）')

# 图6: 家庭人数 vs 生还率
sns.barplot(data=df, x='FamilySize', y='Survived', ax=axes[1,0], palette='coolwarm')
axes[1,0].set_title('6. 家庭大小（3-4人最佳）')
axes[1,0].set_ylim(0, 1)

# 图7: 登船港口
sns.barplot(data=df, x='Embarked_Filled', y='Survived', ax=axes[1,1], hue='Embarked_Filled', palette='Set2', legend=False)
axes[1,1].set_title('7. 登船港口影响（C港最高）')
axes[1,1].set_ylim(0, 1)

# 图8: 称呼生还率
title_surv = df.groupby('Title')['Survived'].mean().sort_values(ascending=False).head(10)
sns.barplot(x=title_surv.index, y=title_surv.values, ax=axes[1,2], palette='rocket')
axes[1,2].tick_params(axis='x', rotation=45)
axes[1,2].set_title('8. 称呼生还率（Mrs/Miss最高）')
axes[1,2].set_ylim(0, 1)

# 图9: 相关性热力图
num_cols = ['Age_Filled', 'Fare', 'FamilySize', 'Pclass', 'Survived']
corr = df[num_cols].corr()
sns.heatmap(corr, annot=True, cmap='RdBu_r', center=0, ax=axes[1,3], square=True)
axes[1,3].set_title('9. 特征相关性热力图')

# 图10: 关键洞察总结
insights = [
    "顶级洞察：",
    "1. 女性优先救援（74% vs 19%）",
    "2. 头等舱生还率3倍于三等舱",
    "3. 3-4人家庭生还率最高（72%）",
    "4. 高票价乘客资源更充足",
    "5. 儿童与年轻女性保护最强"
]
axes[1,4].axis('off')
axes[1,4].text(0.05, 0.95, '\n'.join(insights), fontsize=16, va='top', fontweight='bold',
               bbox=dict(boxstyle="round,pad=0.8", facecolor="#f0f0f0"))

plt.tight_layout()
plt.subplots_adjust(top=0.93)

# 保存高清图
plt.savefig('titanic_eda_10_plots_final.png', dpi=300, bbox_inches='tight', facecolor='white')
plt.show()

print("10张图EDA报告生成成功！")
print("已保存为：titanic_eda_10_plots_final.png")