Python训练营打卡——DAY9（2025.4.28）-优快云博客

本文链接：https://blog.youkuaiyun.com/LMH2002/article/details/147576879

一、热力图和子图的绘制

首先走一遍之前完整的流程：数据预处理与可视化分析

1. 读取数据

# 1. 读取数据
import pandas as pd
data  = pd.read_csv('data.csv') # 从CSV文件加载数据集

2. 查看数据

# 2. 查看数据
data.info() # 查看数据集结构信息（列名、非空值数量、数据类型等）

输出结果：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 18 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Id                            7500 non-null   int64  
 1   Home Ownership                7500 non-null   object 
 2   Annual Income                 5943 non-null   float64
 3   Years in current job          7129 non-null   object 
 4   Tax Liens                     7500 non-null   float64
 5   Number of Open Accounts       7500 non-null   float64
 6   Years of Credit History       7500 non-null   float64
 7   Maximum Open Credit           7500 non-null   float64
 8   Number of Credit Problems     7500 non-null   float64
 9   Months since last delinquent  3419 non-null   float64
 10  Bankruptcies                  7486 non-null   float64
 11  Purpose                       7500 non-null   object 
 12  Term                          7500 non-null   object 
 13  Current Loan Amount           7500 non-null   float64
 14  Current Credit Balance        7500 non-null   float64
 15  Monthly Debt                  7500 non-null   float64
 16  Credit Score                  5943 non-null   float64
 17  Credit Default                7500 non-null   int64  
dtypes: float64(12), int64(2), object(4)
memory usage: 1.0+ MB

data.head(5) # 显示前5行数据，初步了解数据格式

输出结果：

Id	Home Ownership	Annual Income	Years in current job	Tax Liens	Years of Credit History	Maximum Open Credit	Number of Credit Problems	Months since last delinquent	Bankruptcies	Purpose	Term	Current Loan Amount	Current Credit Balance	Monthly Debt	Credit Score	Credit Default
0	0	Own Home	482087.0	NaN	11.0	26.3	685960.0	1.0	NaN	1.0	debt consolidation	Short Term	99999999.0	47386.0	7914.0	749.0	0
1	1	Own Home	1025487.0	10+ years	15.0	15.3	1181730.0	0.0	NaN	0.0	debt consolidation	Long Term	264968.0	394972.0	18373.0	737.0	1
2	2	Home Mortgage	751412.0	8 years	11.0	35.0	1182434.0	0.0	NaN	0.0	debt consolidation	Short Term	99999999.0	308389.0	13651.0	742.0	0
3	3	Own Home	805068.0	6 years	8.0	22.5	147400.0	1.0	NaN	1.0	debt consolidation	Short Term	121396.0	95855.0	11338.0	694.0	0
4	4	Rent	776264.0	8 years	13.0	13.6	385836.0	1.0	NaN	0.0	debt consolidation	Short Term	125840.0	93309.0	7180.0	719.0	0

3. 数据预处理

# 3. 数据预处理 - 分类变量编码
# 查看需要转换的列的取值分布
print("\n当前工作年限分布:")
print(data["Years in current job"].value_counts())
print("\n房屋所有权状况分布:")
print(data["Home Ownership"].value_counts())

输出结果：

当前工作年限分布:
Years in current job
10.0 2332
2.0 705
3.0 620
0.0 563
5.0 516
1.0 504
4.0 469
6.0 426
7.0 396
8.0 339
9.0 259
Name: count, dtype: int64

房屋所有权状况分布:
Home Ownership
0 3637
1 3204
2 647
3 12
Name: count, dtype: int64

# 定义分类变量映射规则
mappings = {
    "Years in current job": {
        "10+ years": 10, "9 years": 9, "8 years": 8, "7 years": 7,
        "6 years": 6, "5 years": 5, "4 years": 4, "3 years": 3,
        "2 years": 2, "1 year": 1, "< 1 year": 0  # 将文本描述转换为数值
    },
    "Home Ownership": {
        "Home Mortgage": 0,  # 有房贷
        "Rent": 1,           # 租房
        "Own Home": 2,        # 自有住房
        "Have Mortgage": 3    # 有抵押贷款（与房贷不同）
    }
}

# 执行映射转换
data["Years in current job"] = data["Years in current job"].map(
    mappings["Years in current job"])  # 转换工作年限
data["Home Ownership"] = data["Home Ownership"].map(
    mappings["Home Ownership"])        # 转换房屋所有权

# 验证转换结果
data.info()  # 检查转换后的数据类型
data.head()  # 查看转换后的前几行数据

输出结果：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 18 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Id                            7500 non-null   int64  
 1   Home Ownership                0 non-null      float64
 2   Annual Income                 5943 non-null   float64
 3   Years in current job          0 non-null      float64
 4   Tax Liens                     7500 non-null   float64
 5   Number of Open Accounts       7500 non-null   float64
 6   Years of Credit History       7500 non-null   float64
 7   Maximum Open Credit           7500 non-null   float64
 8   Number of Credit Problems     7500 non-null   float64
 9   Months since last delinquent  3419 non-null   float64
 10  Bankruptcies                  7486 non-null   float64
 11  Purpose                       7500 non-null   object 
 12  Term                          7500 non-null   object 
 13  Current Loan Amount           7500 non-null   float64
 14  Current Credit Balance        7500 non-null   float64
 15  Monthly Debt                  7500 non-null   float64
 16  Credit Score                  5943 non-null   float64
 17  Credit Default                7500 non-null   int64  
dtypes: float64(14), int64(2), object(2)
memory usage: 1.0+ MB

Id	Home Ownership	Annual Income	Years in current job	Tax Liens	Years of Credit History	Maximum Open Credit	Number of Credit Problems	Months since last delinquent	Bankruptcies	Purpose	Term	Current Loan Amount	Current Credit Balance	Monthly Debt	Credit Score	Credit Default
0	0	NaN	482087.0	NaN	11.0	26.3	685960.0	1.0	NaN	1.0	debt consolidation	Short Term	99999999.0	47386.0	7914.0	749.0	0
1	1	NaN	1025487.0	NaN	15.0	15.3	1181730.0	0.0	NaN	0.0	debt consolidation	Long Term	264968.0	394972.0	18373.0	737.0	1
2	2	NaN	751412.0	NaN	11.0	35.0	1182434.0	0.0	NaN	0.0	debt consolidation	Short Term	99999999.0	308389.0	13651.0	742.0	0
3	3	NaN	805068.0	NaN	8.0	22.5	147400.0	1.0	NaN	1.0	debt consolidation	Short Term	121396.0	95855.0	11338.0	694.0	0
4	4	NaN	776264.0	NaN	13.0	13.6	385836.0	1.0	NaN	0.0	debt consolidation	Short Term	125840.0	93309.0	7180.0	719.0	0

4. 连续特征相关性分析

import seaborn as sns
import matplotlib.pyplot as plt

# 定义连续型特征列
continuous_features = [
    'Annual Income', 'Years in current job', 'Tax Liens',
    'Number of Open Accounts', 'Years of Credit History',
    'Maximum Open Credit', 'Number of Credit Problems',
    'Months since last delinquent', 'Bankruptcies',
    'Current Loan Amount', 'Current Credit Balance', 'Monthly Debt',
    'Credit Score'
]

# 计算相关系数矩阵
correlation_matrix = data[continuous_features].corr()

# 设置图形参数
plt.rcParams['figure.dpi'] = 300  # 设置输出分辨率

# 绘制热力图
plt.figure(figsize=(12, 10))  # 设置画布大小
sns.heatmap(
    correlation_matrix,
    annot=True,        # 在格子中显示数值
    cmap='coolwarm',   # 使用蓝-红渐变色
    vmin=-1, vmax=1    # 固定颜色映射范围[-1,1]
)
plt.title('连续特征相关性热力图')  # 添加标题
plt.show()

输出结果：

5. 分布分析-箱线图（手动版）

# 5. 分布分析 - 箱线图（手动版）
features = ['Annual Income', 'Years in current job', 
           'Tax Liens', 'Number of Open Accounts']

plt.rcParams['figure.dpi'] = 300  # 保持分辨率设置

# 创建2x2的子图布局
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# 第一个子图（左上）
i = 0
feature = features[i]
axes[0, 0].boxplot(data[feature].dropna())  # 删除缺失值后绘制
axes[0, 0].set_title(f'{feature}箱线图')
axes[0, 0].set_ylabel(feature)

# 第二个子图（右上）
i = 1
feature = features[i]
axes[0, 1].boxplot(data[feature].dropna())
axes[0, 1].set_title(f'{feature}箱线图')
axes[0, 1].set_ylabel(feature)

# 第三个子图（左下）
i = 2
feature = features[i]
axes[1, 0].boxplot(data[feature].dropna())
axes[1, 0].set_title(f'{feature}箱线图')
axes[1, 0].set_ylabel(feature)

# 第四个子图（右下）
i = 3
feature = features[i]
axes[1, 1].boxplot(data[feature].dropna())
axes[1, 1].set_title(f'{feature}箱线图')
axes[1, 1].set_ylabel(feature)

plt.tight_layout()  # 自动调整子图间距
plt.show()

输出结果：

6. 分布分析-箱线图（循环优化版）

# 6. 分布分析 - 箱线图（循环优化版）
features = ['Annual Income', 'Years in current job',
           'Tax Liens', 'Number of Open Accounts']

plt.rcParams['figure.dpi'] = 300

# 创建子图布局（返回figure对象和axes二维数组）
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# 循环绘制每个特征
for i, feature in enumerate(features):
    row = i // 2  # 计算行索引：0//2=0, 1//2=0, 2//2=1, 3//2=1
    col = i % 2   # 计算列索引：0%2=0, 1%2=1, 2%2=0, 3%2=1
    
    # 在对应位置绘制箱线图
    axes[row, col].boxplot(data[feature].dropna())
    axes[row, col].set_title(f'{feature}箱线图')
    axes[row, col].set_ylabel(feature)

plt.tight_layout()  # 自动调整子图间距
plt.show()

输出结果：

7. enumerate()函数

最后介绍一个非常实用的函数，我们后面会经常看到

enumerate()函数：enumerate()函数返回一个迭代对象，该对象包含索引和值。

语法：enumerate(iterable, start=0)

参数：iterable -- 迭代对象，迭代对象可以是列表、元组、字典、字符串等。 start -- 索引的开始值

返回值：返回一个迭代对象，该对象包含索引和值。

这个函数之所以很有用，是因为它允许我们同时迭代一个序列，并获取每个元素的索引和值。

features = ['Annual Income', 'Years in current job', 'Tax Liens', 'Number of Open Accounts']

for i, feature in enumerate(features):
    print(f"索引 {i} 对应的特征是: {feature}")

输出结果：

索引 0 对应的特征是: Annual Income
索引 1 对应的特征是: Years in current job
索引 2 对应的特征是: Tax Liens
索引 3 对应的特征是: Number of Open Accounts

# 定义要绘制的特征
features = ['Annual Income', 'Years in current job', 'Tax Liens', 'Number of Open Accounts']

# 设置图片清晰度
plt.rcParams['figure.dpi'] = 300

# 创建一个包含 2 行 2 列的子图布局，其中
fig, axes = plt.subplots(2, 2, figsize=(12, 8))#返回一个Figure对象和Axes对象
# 这里的axes是一个二维数组，包含2行2列的子图
# 这里的fig是一个Figure对象，表示整个图形窗口
# 你可以把fig想象成一个画布，axes就是在这个画布上画的图形

# 遍历特征并绘制箱线图
for i, feature in enumerate(features):
    row = i // 2
    col = i % 2
    axes[row, col].boxplot(data[feature].dropna())
    axes[row, col].set_title(f'Boxplot of {feature}')
    axes[row, col].set_ylabel(feature)

# 调整子图之间的间距
plt.tight_layout()

# 显示图形
plt.show()

输出结果：

二、作业

更换心脏病数据集绘制热力图和单特征分布图

# 1. 读取数据
import pandas as pd
data = pd.read_csv('heart.csv')  # 加载数据集

# 2. 数据探索
print("数据集结构信息:")
data.info()  # 查看列名、数据类型和非空值数量
print("\n前5行数据:")
print(data.head(5))  # 展示数据格式

# 3. 数据预处理 - 检查分类变量（本数据集分类变量已为数值型，无需额外编码）
# 例如：sex、cp、fbs等列已经是数值标签，通常不需要转换
# 若需调整分类变量编码规则，可在此添加映射逻辑

# 4. 连续特征相关性分析
import seaborn as sns
import matplotlib.pyplot as plt

# 定义连续型特征列（根据数据集列名选择）
continuous_features = [
    'age',         # 年龄
    'trestbps',    # 静息血压
    'chol',        # 血清胆固醇
    'thalach',     # 最大心率
    'oldpeak'      # 运动诱发ST段压低值
]

# 计算相关系数矩阵
correlation_matrix = data[continuous_features].corr()

# 设置图形参数
plt.rcParams['figure.dpi'] = 300  # 高分辨率输出
plt.rcParams['font.sans-serif'] = ['SimHei']  # 支持中文显示
plt.rcParams['axes.unicode_minus'] = False   # 解决负号显示问题

# 绘制热力图
plt.figure(figsize=(10, 8))
sns.heatmap(
    correlation_matrix,
    annot=True,        # 显示数值
    cmap='coolwarm',   # 蓝-红渐变色
    vmin=-1, vmax=1,   # 固定颜色范围
    fmt=".2f"         # 数值格式保留两位小数
)
plt.title('连续特征相关性热力图')
plt.show()

# 5. 分布分析 - 箱线图（循环优化版）
features = [
    'age', 
    'trestbps', 
    'chol', 
    'oldpeak'  # 选择需要分析的4个特征
]

# 创建2x2子图
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# 循环绘制箱线图
for i, feature in enumerate(features):
    row = i // 2  # 行索引：0,0,1,1
    col = i % 2   # 列索引：0,1,0,1
    
    # 绘制并设置标题和坐标轴
    axes[row, col].boxplot(data[feature].dropna())
    axes[row, col].set_title(f'{feature}分布')
    axes[row, col].set_ylabel(feature)

plt.tight_layout()  # 自动调整子图间距
plt.show()

输出结果：

@浙大疏锦行