python-for-data-analysis_2nd 第九章 绘图和可视化

本文介绍Python中的数据可视化库matplotlib和seaborn的基本使用方法,包括绘图API、Figure和Subplot的操作、调整间距、颜色、标记和线型的设置,以及如何使用pandas和seaborn进行高效的数据可视化。

第九章 绘图和可视化

信息可视化(也叫绘图)是数据分析中最重要的⼯作之⼀。做一个可交互的数据可视化是我们工作的最终目标。python中有很多静态和动态的数据可视化(matplotlib)。

matplotlib该项⽬是由John Hunter于2002年启动的,其⽬的是为Python构建⼀个MATLAB式的绘图接⼝。matplotlib和IPython社区进⾏合作,简化了从IPython shell(包括现在的Jupyternotebook)进⾏交互式绘图。matplotlib⽀持各种操作系统上许多不同的GUI后端,⽽且还能将图⽚导出为各种常⻅的⽮量(vector)和光栅(raster)图:PDF、SVG、JPG、PNG、BMP、GIF等。

seaborn是使用matplotlib作为底层的可视化工具。

9.1matplotlib API⼊⻔

导入matplotlib操作
import matplotlib.pyplot as plt

书没有详细地讨论matplotlib的各种功能,但⾜以将你引⼊⻔。matplotlib的示例库和⽂档是学习⾼级特性的最好资源。

Figure和Subplot
  • matplotlib的图像都位于Figure对象中。你可以⽤plt.figure创建⼀个新的Figure。(默认尺寸大小:<Figure size 720x432 with 0 Axes>)
  • 绘制多图(多个figure)
    • add_subplot()
    • subplots()
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure()
ax1 = fig.add_subplot(2, 2, 1)
ax2 = fig.add_subplot(2, 2, 2)
ax3 = fig.add_subplot(2, 2, 3)
ax3.plot(np.random.randn(50).cumsum(), 'k--')
ax1.hist(np.random.randn(100), bins=20, color='k', alpha=0.3)
ax2.scatter(np.arange(30), np.arange(30) + 3 * np.random.randn(30))


fig, axes = plt.subplots(2, 3)
axes


🕸plt.subplots(),它可以创建⼀个新的Figure,并返回⼀个含有已创建的subplot对象的NumPy数组。这是⾮常实⽤的,因为可以轻松地对axes数组进⾏索引,就好像是⼀个⼆维数组⼀样,例如axes[0,1]。你还可以通过sharex和sharey指定subplot应该具有相同的X轴或Y轴。(这一块可以做出很多非常高级的图,但是我不想学了,你懂了吗?)

🕸下表是pyplot.subplots()的属性参数设置

参数说明
nrowssubplots的行数
ncolssubplots的列数
sharex所有subplot应该使用相同的x轴(调节xlim将影响所有的subplots)
sharey所有subplot应该使用相同的y轴(调节ylim将影响所有的subplots)
subplot_kw创建各subplots的关键字字典
**fig_kw创建figure时的其他关键字,如plt.subplots(2,2,figsize=(8,6)) 可做尝试
调整subplot周围的间距
  • 默认情况下,matplotlib会在subplot外和subplot间留间距。间距跟图像(figure)的⾼度和宽度有关,因此,调整图像⼤⼩(编程还是⼿⼯),间距会调整。利⽤Figure的subplots_adjust⽅法可以修改间距,其是顶级函数(跟第五章中的value_counts()是顶级函数)。
  • pyplot.subplots_adjust()
fig, axes = plt.subplots(2, 2, sharex=True, sharey=True)
for i in range(2):
    for j in range(2):
        axes[i, j].hist(np.random.randn(500), bins=50, color='k', alpha=0.5)
plt.subplots_adjust(wspace=0, hspace=0)

🕸上述图像可以看出各subplot之间没间距,所以需要自己设置刻度位置和刻度标签

颜⾊、标记和线型
  • matplotlib的plot函数接受⼀组X和Y坐标,还可以接受⼀个表示颜⾊和线型的字符串缩写。
  • ax.plot()的属性:color,label,drawstyle,linestyle,marker等
plt.figure()
from numpy.random import randn
data = np.random.randn(30).cumsum()
plt.plot(data, 'k--', label='Default')
plt.plot(data, 'k-', drawstyle='steps-post', label='steps-post')
plt.legend(loc='best')

刻度、标签

pyplot接⼝的设计⽬的就是交互式使⽤,含有诸如xlim、xticks和xticklabels之类的⽅法。它们分别控制图表的范围、刻度位置、
刻度标签等。

属性说明
xlim图表的范围
xticks图表的刻度位置
xticklabels刻度标签
import matplotlib.pyplot as plt
import numpy as  np
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.plot(np.random.randn(1000).cumsum())
ticks = ax.set_xticks([0, 250, 500, 750, 1000])
labels = ax.set_xticklabels(['one', 'two', 'three', 'four', 'five'],\
                            rotation=30, fontsize='small')
ax.set_title('My first matplotlib plot')
ax.set_xlabel('Stages')
#————————————————————上面两行可以替代(字符串字典)
#props = { 'title': 'My first matplotlib plot', 'xlabel': 'Stages' } 
#ax.set(**props)

图例
  • legend和label是一起使用的,才能出现图例,二者缺一不可。
from numpy.random import randn
import matplotlib.pyplot as plt
fig = plt.figure(); ax = fig.add_subplot(1, 1, 1)
ax.plot(randn(1000).cumsum(), 'k', label='one')
ax.plot(randn(1000).cumsum(), 'k--', label='two')
ax.plot(randn(1000).cumsum(), 'k.', label='three')
ax.legend(loc='best')

注解(Annotation)、subplot上绘图
  • ax.annotate()

🕸直接由Series来做图

from datetime import datetime
import matplotlib.pyplot as plt
import pandas as pd
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)

data = pd.read_csv('examples/spx.csv', index_col=0, parse_dates=True)
spx = data['SPX']

spx.plot(ax=ax, style='k-')  #这个地方是否存在问题呢?

crisis_data = [
    (datetime(2007, 10, 11), 'Peak of bull market'),
    (datetime(2008, 3, 12), 'Bear Stearns Fails'),
    (datetime(2008, 9, 15), 'Lehman Bankruptcy')
]

for date, label in crisis_data:
    ax.annotate(label, xy=(date, spx.asof(date) + 75),
                xytext=(date, spx.asof(date) + 225),
                arrowprops=dict(facecolor='black', headwidth=4, width=2,
                                headlength=4),
                horizontalalignment='left', verticalalignment='top')

# Zoom in on 2007-2010
ax.set_xlim(['1/1/2007', '1/1/2011'])
ax.set_ylim([600, 1800])

ax.set_title('Important dates in the 2008-2009 financial crisis')

🕸文件地址

🕸这个非常实用。我在处理时间序列会用到annotation🙂😄😃😸

图表文件保存
  • plt.savefig()

plt.savefig('figpath.svg')

plt.savefig('figpath.png', dpi=400, bbox_inches='tight') #这里

from io import BytesIO 
buffer = BytesIO() 
plt.savefig(buffer) 
plot_data = buffer.getvalue()

🕸保存图⽚时⽤到两个选项是dpi(控制“每英⼨点数”分辨率)和 bbox_inches(可以剪除当前图表周围的空⽩部分)。

matplotlib配置
plt.rc('figure', figsize=(10, 10))

font_options = {'family' : 'monospace', \
					'weight' : 'bold', \
					'size' : 'small'} #这个地方要进一步学习,并设置
plt.rc('font', **font_options)
  • 写论文中配置图样式的可以这样来配置

  • rc(matplotlib resource configurations)的第⼀个参数是希望⾃定义的对象,如‘figure’、‘axes’、‘xtick’、‘ytick’、‘grid’、‘legend’等。

  • 请查阅matplotlib的配置⽂件matplotlibrc1(位于matplotlib/mpl-data⽬录中)。😄这个地方学一下,全局设置图形的样式

  • 通过rc()函数去配置相关属性

    ​ mpl.rc(‘lines’, linewidth=4, color=‘g’)

    ​ plt.rc(‘lines’, linewidth=4, color=‘g’)

    ​ matplotlib.rc(‘lines’, linewidth=4, color=‘g’)

9.2 使⽤pandas和seaborn绘图

  • seaborn。由MichaelWaskom创建的静态图形库。

  • matplotlib实际上是⼀种⽐较低级的⼯具。要绘制⼀张图表,你组装⼀些基本组件就⾏:数据展示(即图表类型:线型图、柱状图、盒形图、散布图、等值线图等)、图例、标题、刻度标签以及其他注解型信息。

  • 在pandas中,我们有多列数据,还有⾏和列标签。pandas⾃身就有内置的⽅法,⽤于简化从DataFrame和Series绘制图形。

Series.plot方法的参数
方法说明
label图例标签
ax要在其上进行绘制的matplotlib subplot对象。如果没有设置,则使用当前matplotlib subplot
style线条风格字符串(如‘ko–’)
alpha不透明度(0-1之间)
color颜色 (我喜欢用cyan
kind图类型(‘line’ ‘bar’ ‘barh’ ‘kde‘)​ ​ ​ ​ ​ ​ ​ 🕸df.plot()等价于df.plot.line()
logy对数坐标
use_index索引作为刻度标签
rot or rotation刻度旋转
xticks,yticksx、y轴刻度的值
xlim,ylim界限
grid网格
DataFrame.plot方法的参数
方法说明
subplotsDataFrame列绘制到单独的subplot中
sharex共用一个x轴
sharey共用一个y轴
figsize图像大小(12,8) or (8,8)
title标题
legend添加subplot图例
sort_columns以字母表的顺序绘制各列,默认为当前列表

🕸有关时间序列的绘图还要见十一章的内容,十一章内容我几乎已看过,后面需要在总结一下!😸

Line Plots
df = pd.DataFrame(np.random.randn(10, 4).cumsum(0),
                  columns=['A', 'B', 'C', 'D'],
                  index=np.arange(0, 100, 10))
df.plot()  #DataFrame  plot,前面的series plot

Bar Plots
  • subplot() barh()
fig, axes = plt.subplots(2, 1)
data = pd.Series(np.random.rand(16), index=list('abcdefghijklmnop'))  
#要点或者思路可以将频数统计得到series,设置index,随后画图即可
data.plot.bar(ax=axes[0], color='k', alpha=0.7)
data.plot.barh(ax=axes[1], color='k', alpha=0.7)
#___other example
df = pd.DataFrame(np.random.rand(6, 4),
                  index=['one', 'two', 'three', 'four', 'five', 'six'],
                  columns=pd.Index(['A', 'B', 'C', 'D'], name='Genus'))
df
df.plot.bar()

在这里插入图片描述

Stacking bar chart
import pandas as pd
import numpy as np 

df = pd.DataFrame(np.random.rand(6, 4),
                  index=['one', 'two', 'three', 'four', 'five', 'six'],
                  columns=pd.Index(['A', 'B', 'C', 'D'], name='Genus'))
df
df.plot.bar(figsize=(8,8),title='stacking barchart',sort_columns=True,rot=20,stacked=True)

在这里插入图片描述

🕸柱状图:利⽤value_counts图形化显示Series中各值的出现频率,⽐如s.value_counts().plot.bar()。(计算企业的类型来做bar)

🕸我今天刚试过好用的,只不过我的比较复杂,😢可能第一次实现,思路不清晰。

在这里插入图片描述

🕸交叉表crosstab小案例:这里的内容涉及到第8章的内容,代码如下

tips = pd.read_csv('examples/tips.csv')
party_counts = pd.crosstab(tips['day'], tips['size'])
party_counts
# Not many 1- and 6-person parties
party_counts = party_counts.loc[:, 2:5]
# Normalize to sum to 1
party_pcts = party_counts.div(party_counts.sum(1), axis=0)
party_pcts
plt.figure()
party_pcts.plot.bar()
import seaborn as sns
tips['tip_pct'] = tips['tip'] / (tips['total_bill'] - tips['tip'])
tips.head()
sns.barplot(x='tip_pct', y='day', data=tips, orient='h')#带有误差条(95%的置信区间)的bar
#——————————根据天和时间的bar
sns.barplot(x='tip_pct', y='day', hue='time', data=tips, orient='h')
Histograms and Density Plots

直⽅图(histogram)是⼀种可以对值频率进行离散化显示的柱状图。数据点被拆分到离散的、间隔均匀的⾯元中,绘制的是各⾯元中数据点的数量。

#histograms
plt.figure()
tips['tip_pct'].plot.hist(bins=50)
#______________
#density plots
tips['tip_pct'].plot.density()
Scatter or Point Plots
macro = pd.read_csv('examples/macrodata.csv')
data = macro[['cpi', 'm1', 'tbilrate', 'unemp']]
trans_data = np.log(data).diff().dropna()
trans_data[-5:]
plt.figure()
#regression  (Make a scatter plot and add a linear regression line)
sns.regplot('m1', 'unemp', data=trans_data)
plt.title('Changes in log %s versus log %s' % ('m1', 'unemp'))
#___________________
#In exploratory data analysis, it is meaningful to observe the scatter diagram of a group of variables at the same time, which is also called scatter plot matrix(散布图矩阵)
sns.pairplot(trans_data, diag_kind='kde', plot_kws={'alpha': 0.2})

🕸plot_kws参数,关于更详细的配置参数,可以查阅seaborn.pairplot文档字符串。(现在我还用不到

Facet Grids and Categorical Data(分面网格(facet grid)和类型数据)

多个分类变量的数据可视化的⼀种⽅法是使用小面网格。seaborn有⼀个有⽤的内置函数factorplot,可以简化制作多种分⾯图(有意思,在数据分析中,可以一试,作为创新点

#代码续上
sns.factorplot(x='day', y='tip_pct', hue='time', col='smoker',\
               kind='bar', data=tips[tips.tip_pct < 1])
sns.factorplot(x='tip_pct', y='day', kind='box',\
               data=tips[tips.tip_pct < 0.5])

使⽤更通⽤的seaborn.FacetGrid类,你可以创建⾃⼰的分⾯⽹格。请查阅seaborn的⽂档(https://seaborn.pydata.org/)。当然我现在用不到,后面可视化时候需要在关注一下!😙

9.3 其它的Python可视化tool

利⽤tool如Boken(https://bokeh.pydata.org/en/latest/)和Plotly(https://github.com/plotly/plotly.py),现在可以创建动态交互图形,⽤于网页浏览器。

还有百度的echart,我感觉图是真的漂亮!⛄️

9.4科研绘图scienceplot

  • ( https://github.com/garrettj403/SciencePlots )clone到本地,直接将*.mplstyle的所有文件放到matplotlibrc(位于matplotlib/mpl-data⽬录中)2
  • 在python中绘图过程上加一行代码即可: with plt.style.context([‘science’,‘no-latex’]):。就可以实现了!是不是很神奇啊!,绘图的其他功能就可以不需要使用了

  1. 详解matplotlib的配置文件以及配置方式 ↩︎

  2. scienceplot ↩︎

import pandas as pd import numpy as np from sklearn.ensemble import RandomForestRegressor import matplotlib.pyplot as plt import os from typing import Dict, List, Optional # ----------------------------- # 1. 参数配置 # ----------------------------- def get_config() 返回项目配置参数(运动者4) return { 'FILES' { 'before_1st': r'C:\Users\asus\Desktop\E题\附件\附件3\姿势调整前\运动者4第1次的跳远位置信息.xlsx', 'before_2nd': r'C:\Users\asus\Desktop\E题\附件\附件3\姿势调整前\运动者4第2次的跳远位置信息.xlsx', 'after_1st': r'C:\Users\asus\Desktop\E题\附件\附件3\姿势调整后\运动者4调整后第1次的跳远位置信息.xlsx', 'after_2nd': r'C:\Users\asus\Desktop\E题\附件\附件3\姿势调整后\运动者4调整后第2次的跳远位置信息.xlsx' }, 'JUMP_SCORES_BEFORE' [1.80], # 运动者4调整前跳远成绩(1次) 'JUMP_SCORES_AFTER' [1.95, 1.97], # 运动者4调整后跳远成绩(2次) 'TI_ZHI_DATA' { # 👇 替换为附件4中运动者4的真实数据(示例为虚构数据,请根据实际填写) '体重' 66.65, # kg '身高' 172.5, # cm '体脂率' 17.1, # % '肌肉重量' 34.3, # kg '基础代谢' 1635, # kcal '骨骼肌重量' 36.3, # kg '去脂体重' 55.3, # kg '内脏脂肪' 7, # 等级 '水分率' 66.6 # % } } # ----------------------------- # 2. 工具函数定义 # ----------------------------- def calc_angle(ax float, ay float, bx float, by float, cx float, cy float) - float 计算三点夹角(B为顶点),返回角度值 ba = np.array([ax - bx, ay - by]) bc = np.array([cx - bx, cy - by]) if np.linalg.norm(ba) == 0 or np.linalg.norm(bc) == 0 return 0.0 cosine_angle = np.dot(ba, bc) (np.linalg.norm(ba) np.linalg.norm(bc)) cosine_angle = np.clip(cosine_angle, -1.0, 1.0) return np.degrees(np.arccos(cosine_angle)) def extract_features_from_file(file_path str) - Optional[Dict] 从Excel文件中提取跳远特征 if not os.path.exists(file_path) print(f❌ 文件不存在:{file_path}) return None try df = pd.read_excel(file_path, sheet_name='Sheet1') if '0_X' not in df.columns print(f⚠️ 文件格式错误,缺少关键列:{file_path}) return None # 向量化计算 hip_x, hip_y = df['0_X'], df['0_Y'] knee_x, knee_y = df['12_X'], df['12_Y'] ankle_x, ankle_y = df['13_X'], df['13_Y'] # 计算腿部折叠角 ba_x = hip_x - knee_x ba_y = hip_y - knee_y bc_x = ankle_x - knee_x bc_y = ankle_y - knee_y dot_product = ba_x bc_x + ba_y bc_y norm_ba = np.sqrt(ba_x 2 + ba_y 2) norm_bc = np.sqrt(bc_x 2 + bc_y 2) cosine_angle = dot_product (norm_ba norm_bc) leg_angle = np.degrees(np.arccos(np.clip(cosine_angle, -1.0, 1.0))) # 计算身体前倾角 dx = hip_x - ankle_x dy = hip_y - ankle_y body_tilt = np.arctan2(dx, dy) 180 np.pi # 构建特征 DataFrame feat_df = pd.DataFrame({ '腿部折叠角' leg_angle, '身体前倾角' body_tilt, '重心高度' hip_y, '身体伸展度' abs(hip_x - ankle_x) }) return { '最小腿部折叠角' feat_df['腿部折叠角'].min(), '最大腿部折叠角' feat_df['腿部折叠角'].max(), '平均身体前倾角' feat_df['身体前倾角'].mean(), '起跳瞬间前倾角' feat_df.iloc[-1]['身体前倾角'], '最小重心高度' feat_df['重心高度'].min(), '最大重心高度' feat_df['重心高度'].max(), '平均身体伸展度' feat_df['身体伸展度'].mean(), '最大身体伸展度' feat_df['身体伸展度'].max() } except Exception as e print(f❌ 读取或处理文件失败 {file_path}:{str(e)}) return None # ----------------------------- # 3. 主程序:数据整合与建模 # ----------------------------- def run_analysis() config = get_config() FILES = config['FILES'] JUMP_SCORES_BEFORE = config['JUMP_SCORES_BEFORE'] JUMP_SCORES_AFTER = config['JUMP_SCORES_AFTER'] TI_ZHI_DATA = config['TI_ZHI_DATA'] samples = [] # 添加调整前数据(仅1次) for i, (key, file) in enumerate([('before_1st', FILES['before_1st']), ('before_2nd', FILES['before_2nd'])]) feat = extract_features_from_file(file) if feat feat['跳远成绩'] = JUMP_SCORES_BEFORE[i] if i len(JUMP_SCORES_BEFORE) else np.nan feat['阶段'] = '调整前' samples.append(feat) # 添加调整后数据(保持2次不变) for i, (key, file) in enumerate([('after_1st', FILES['after_1st']), ('after_2nd', FILES['after_2nd'])]) feat = extract_features_from_file(file) if feat feat['跳远成绩'] = JUMP_SCORES_AFTER[i] feat['阶段'] = '调整后' samples.append(feat) # 检查样本数量 if len(samples) 2 print(❌ 错误:有效样本不足(需至少2个),无法建模。) return # 构建数据集 data = pd.DataFrame(samples) for k, v in TI_ZHI_DATA.items() data[k] = v X = data.drop(['跳远成绩', '阶段'], axis=1) y = data['跳远成绩'] # 构建模型 model = RandomForestRegressor(n_estimators=50, random_state=42, max_depth=5, min_samples_split=2) model.fit(X, y) # 特征重要性排序 importances = model.feature_importances_ feature_names = X.columns indices = np.argsort(importances)[-1] # 提取前10个特征 top_n = 10 top_indices = indices[top_n] top_features = [feature_names[i] for i in top_indices] top_importances = importances[top_indices] # 单位映射(参考附件4) UNIT_MAP = { '体重' 'kg', '身高' 'cm', '体脂率' '%', '肌肉重量' 'kg', '基础代谢' 'kcal', '骨骼肌重量' 'kg', '去脂体重' 'kg', '内脏脂肪' '等级', '水分率' '%', '最小腿部折叠角' '°', '最大腿部折叠角' '°', '平均身体前倾角' '°', '起跳瞬间前倾角' '°', '最小重心高度' 'px', '最大重心高度' 'px', '平均身体伸展度' 'px', '最大身体伸展度' 'px' } # 带单位的特征名称 labeled_features = [f{feat} ({UNIT_MAP.get(feat, '')}) for feat in top_features] # 设置中文字体 plt.rcParams['font.sans-serif'] = ['SimHei'] # 使用黑体显示中文 plt.rcParams['axes.unicode_minus'] = False # 正常显示负号 # 绘图 plt.figure(figsize=(12, 8)) colors = ['lightcoral' if '调整' in f else 'steelblue' for f in top_features] bars = plt.barh(labeled_features, top_importances, color=colors) # 设置横坐标刻度为 0.05 plt.xticks(np.arange(0, max(top_importances) + 0.05, 0.05)) # 添加数值标签 for bar in bars width = bar.get_width() plt.text(width + 0.005, bar.get_y() + 0.2, f'{width.4f}', va='center', fontsize=10) # 标题与坐标轴标签 plt.xlabel('特征重要性', fontsize=12) plt.ylabel('载荷变量(含单位)', fontsize=12) plt.title('运动者4主要的前10个载荷变量', fontsize=14) # 布局优化 plt.gca().invert_yaxis() plt.tight_layout() # 显示图像 plt.show() # 输出结果 feature_importance_df = pd.DataFrame({ '特征' [feature_names[i] for i in indices], '重要性' importances[indices] }).round(6) print(✅ 特征重要性排序:) print(feature_importance_df) print(fn📊 成绩对比:) print(f调整前平均成绩:{np.mean(JUMP_SCORES_BEFORE).2f} 米) print(f调整后平均成绩:{np.mean(JUMP_SCORES_AFTER).2f} 米) print(f成绩变化:{np.mean(JUMP_SCORES_AFTER) - np.mean(JUMP_SCORES_BEFORE)+.2f} 米) # 启动分析 if __name__ == __main__ run_analysis() 把这个代码前面的代码结合学习一下,优化代码
09-08
# 本代码由 23311801128-陈雯婧 编写 import pandas as cwjpandas import numpy as cwjnp import matplotlib.pyplot as cwjplt import seaborn as cwjsns from sklearn.ensemble import RandomForestClassifier as cwjRFC from sklearn.svm import SVC as cwjSVC from sklearn.linear_model import LogisticRegression as cwjLR from sklearn.neighbors import KNeighborsClassifier as cwjKNN from sklearn.model_selection import train_test_split as cwj_split, cross_val_score from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc from sklearn.preprocessing import LabelEncoder import warnings warnings.filterwarnings('ignore') # 修复中文显示问题 cwjplt.rcParams['font.sans-serif'] = ['SimHei', 'Microsoft YaHei', 'DejaVu Sans'] # 支持中文的字体 cwjplt.rcParams['axes.unicode_minus'] = False # 正常显示负号 cwjsns.set_style("whitegrid") print("=== Titanic Survival Data Analysis ===") print("Author: 23311801128-Chen Wenjing") # 1. Data Import print("\n1. Data Import and Overview") try: train_data = cwjpandas.read_csv('train.csv') test_data = cwjpandas.read_csv('test.csv') print("Data imported successfully from CSV files!") except: print("Files not found, creating simulated data with Kaggle dataset structure...") cwjnp.random.seed(42) n_train, n_test = 891, 418 def generate_titanic_data(n_samples): data = { 'PassengerId': range(1, n_samples+1), 'Survived': cwjnp.random.choice([0, 1], n_samples, p=[0.62, 0.38]), 'Pclass': cwjnp.random.choice([1, 2, 3], n_samples, p=[0.24, 0.21, 0.55]), 'Sex': cwjnp.random.choice(['male', 'female'], n_samples, p=[0.65, 0.35]), 'Age': cwjnp.clip(cwjnp.random.normal(29, 14, n_samples), 0.5, 80), 'SibSp': cwjnp.random.poisson(0.5, n_samples), 'Parch': cwjnp.random.poisson(0.4, n_samples), 'Fare': cwjnp.clip(cwjnp.random.exponential(32, n_samples), 0, 512), 'Embarked': cwjnp.random.choice(['S', 'C', 'Q'], n_samples, p=[0.72, 0.19, 0.09]) } return cwjpandas.DataFrame(data) train_data = generate_titanic_data(891) test_data = generate_titanic_data(418) test_data['Survived'] = -999 # Mark test set print(f"Training set: {train_data.shape}, Test set: {test_data.shape}") print("\nFirst 5 rows of training data:") print(train_data.head()) # 2. Data Preprocessing print("\n2. Data Preprocessing") # Combine datasets train_data['IsTrain'] = True test_data['IsTrain'] = False combined_data = cwjpandas.concat([train_data, test_data], ignore_index=True) print(f"Combined data dimension: {combined_data.shape}") # Handle missing values print("\nHandling missing values...") combined_data['Age'].fillna(combined_data['Age'].median(), inplace=True) combined_data['Fare'].fillna(combined_data['Fare'].median(), inplace=True) combined_data['Embarked'].fillna(combined_data['Embarked'].mode()[0], inplace=True) print("Missing values handled successfully") # 3. Descriptive Statistical Analysis Visualizations print("\n3. Descriptive Statistical Analysis Visualizations") # Create comprehensive visualization figure fig = cwjplt.figure(figsize=(20, 16)) # 3.1 Gender vs Survival Rate ax1 = cwjplt.subplot(3, 4, 1) gender_survival = combined_data[combined_data['IsTrain'] == True].groupby('Sex')['Survived'].mean() colors = ['#3498db', '#e74c3c'] bars = ax1.bar(['Male', 'Female'], gender_survival.values, color=colors, alpha=0.8) ax1.set_title('Gender vs Survival Rate', fontsize=14, fontweight='bold') ax1.set_ylabel('Survival Rate') for bar, value in zip(bars, gender_survival.values): ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, f'{value:.2f}', ha='center', va='bottom', fontweight='bold') # 3.2 Passenger Class vs Survival Rate by Gender ax2 = cwjplt.subplot(3, 4, 2) pclass_survival = combined_data[combined_data['IsTrain'] == True].groupby(['Pclass', 'Sex'])['Survived'].mean().unstack() pclass_survival.plot(kind='bar', ax=ax2, color=['#3498db', '#e74c3c'], alpha=0.8) ax2.set_title('Passenger Class & Gender vs Survival', fontsize=14, fontweight='bold') ax2.set_ylabel('Survival Rate') ax2.legend(['Male', 'Female']) ax2.set_xticklabels(['1st Class', '2nd Class', '3rd Class'], rotation=0) # 3.3 Age vs Survival Rate ax3 = cwjplt.subplot(3, 4, 3) bins = [0, 12, 18, 35, 60, 100] labels = ['Child(<12)', 'Teen(12-18)', 'Young(18-35)', 'Middle(35-60)', 'Senior(>60)'] combined_data['AgeGroup'] = cwjpandas.cut(combined_data['Age'], bins=bins, labels=labels) age_survival = combined_data[combined_data['IsTrain'] == True].groupby('AgeGroup')['Survived'].mean() age_survival.plot(kind='bar', ax=ax3, color='#2ecc71', alpha=0.8) ax3.set_title('Age Group vs Survival Rate', fontsize=14, fontweight='bold') ax3.set_ylabel('Survival Rate') ax3.tick_params(axis='x', rotation=45) # 3.4 Embarkation Port vs Survival Rate ax4 = cwjplt.subplot(3, 4, 4) embarked_survival = combined_data[combined_data['IsTrain'] == True].groupby('Embarked')['Survived'].mean() port_names = {'S': 'Southampton', 'C': 'Cherbourg', 'Q': 'Queenstown'} embarked_survival.index = [port_names.get(x, x) for x in embarked_survival.index] embarked_survival.plot(kind='bar', ax=ax4, color='#9b59b6', alpha=0.8) ax4.set_title('Embarkation Port vs Survival', fontsize=14, fontweight='bold') ax4.set_ylabel('Survival Rate') # 3.5 Family Size vs Survival Rate ax5 = cwjplt.subplot(3, 4, 5) combined_data['FamilySize'] = combined_data['SibSp'] + combined_data['Parch'] + 1 family_survival = combined_data[combined_data['IsTrain'] == True].groupby('FamilySize')['Survived'].mean() family_survival.plot(kind='bar', ax=ax5, color='#e67e22', alpha=0.8) ax5.set_title('Family Size vs Survival Rate', fontsize=14, fontweight='bold') ax5.set_xlabel('Family Members Count') ax5.set_ylabel('Survival Rate') # 3.6 Passenger Class vs Survival Rate ax6 = cwjplt.subplot(3, 4, 6) pclass_survival_overall = combined_data[combined_data['IsTrain'] == True].groupby('Pclass')['Survived'].mean() pclass_survival_overall.plot(kind='bar', ax=ax6, color=['#e74c3c', '#3498db', '#2ecc71'], alpha=0.8) ax6.set_title('Passenger Class vs Survival', fontsize=14, fontweight='bold') ax6.set_xticklabels(['1st Class', '2nd Class', '3rd Class'], rotation=0) ax6.set_ylabel('Survival Rate') # 3.7 Fare Distribution vs Survival ax7 = cwjplt.subplot(3, 4, 7) survived_fare = combined_data[(combined_data['IsTrain'] == True) & (combined_data['Survived'] == 1)]['Fare'] not_survived_fare = combined_data[(combined_data['IsTrain'] == True) & (combined_data['Survived'] == 0)]['Fare'] ax7.hist([survived_fare, not_survived_fare], bins=30, alpha=0.7, label=['Survived', 'Not Survived'], color=['#2ecc71', '#e74c3c']) ax7.set_title('Fare Distribution by Survival', fontsize=14, fontweight='bold') ax7.set_xlabel('Fare') ax7.set_ylabel('Frequency') ax7.legend() # 3.8 Age Distribution vs Survival ax8 = cwjplt.subplot(3, 4, 8) survived_age = combined_data[(combined_data['IsTrain'] == True) & (combined_data['Survived'] == 1)]['Age'] not_survived_age = combined_data[(combined_data['IsTrain'] == True) & (combined_data['Survived'] == 0)]['Age'] ax8.hist([survived_age, not_survived_age], bins=30, alpha=0.7, label=['Survived', 'Not Survived'], color=['#2ecc71', '#e74c3c']) ax8.set_title('Age Distribution by Survival', fontsize=14, fontweight='bold') ax8.set_xlabel('Age') ax8.set_ylabel('Frequency') ax8.legend() # 3.9 Siblings/Spouses vs Survival Rate ax9 = cwjplt.subplot(3, 4, 9) sibsp_survival = combined_data[combined_data['IsTrain'] == True].groupby('SibSp')['Survived'].mean() sibsp_survival.plot(kind='bar', ax=ax9, color='#1abc9c', alpha=0.8) ax9.set_title('Siblings/Spouses vs Survival', fontsize=14, fontweight='bold') ax9.set_xlabel('Siblings/Spouses Count') ax9.set_ylabel('Survival Rate') # 3.10 Parents/Children vs Survival Rate ax10 = cwjplt.subplot(3, 4, 10) parch_survival = combined_data[combined_data['IsTrain'] == True].groupby('Parch')['Survived'].mean() parch_survival.plot(kind='bar', ax=ax10, color='#d35400', alpha=0.8) ax10.set_title('Parents/Children vs Survival', fontsize=14, fontweight='bold') ax10.set_xlabel('Parents/Children Count') ax10.set_ylabel('Survival Rate') # 3.11 Alone vs Survival Rate ax11 = cwjplt.subplot(3, 4, 11) combined_data['IsAlone'] = (combined_data['FamilySize'] == 1).astype(int) alone_survival = combined_data[combined_data['IsTrain'] == True].groupby('IsAlone')['Survived'].mean() alone_survival.plot(kind='bar', ax=ax11, color=['#e74c3c', '#2ecc71'], alpha=0.8) ax11.set_title('Traveling Alone vs Survival', fontsize=14, fontweight='bold') ax11.set_xticklabels(['With Family', 'Alone'], rotation=0) ax11.set_ylabel('Survival Rate') # 3.12 Correlation Heatmap ax12 = cwjplt.subplot(3, 4, 12) numeric_columns = ['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'FamilySize'] corr_matrix = combined_data[combined_data['IsTrain'] == True][numeric_columns].corr() cwjsns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, ax=ax12) ax12.set_title('Feature Correlation Heatmap', fontsize=14, fontweight='bold') cwjplt.tight_layout() cwjplt.suptitle('Titanic Survival Data Descriptive Analysis - Chen Wenjing', fontsize=20, fontweight='bold', y=1.02) cwjplt.show() # 4. Feature Engineering print("\n4. Feature Engineering") # Gender encoding combined_data['Sex_encoded'] = combined_data['Sex'].map({'male': 0, 'female': 1}) # Passenger class dummies pclass_dummies = cwjpandas.get_dummies(combined_data['Pclass'], prefix='Pclass') combined_data = cwjpandas.concat([combined_data, pclass_dummies], axis=1) # Embarkation port dummies embarked_dummies = cwjpandas.get_dummies(combined_data['Embarked'], prefix='Embarked') combined_data = cwjpandas.concat([combined_data, embarked_dummies], axis=1) # Fare grouping combined_data['FareGroup'] = cwjpandas.qcut(combined_data['Fare'], 4, labels=[0, 1, 2, 3]) print("Feature engineering completed") # 5. Optimized Algorithm Combination print("\n5. Optimized Algorithm Combination Training") # Select features feature_columns = [ 'Sex_encoded', 'Age', 'SibSp', 'Parch', 'Fare', 'FamilySize', 'IsAlone', 'Pclass_1', 'Pclass_2', 'Pclass_3', 'Embarked_C', 'Embarked_Q', 'Embarked_S' ] X_train = combined_data[combined_data['IsTrain'] == True][feature_columns] y_train = combined_data[combined_data['IsTrain'] == True]['Survived'] X_test = combined_data[combined_data['IsTrain'] == False][feature_columns] # Split validation set X_train_split, X_val, y_train_split, y_val = cwj_split( X_train, y_train, test_size=0.2, random_state=42, stratify=y_train ) # Optimized algorithm combination models = { 'Random Forest': cwjRFC(n_estimators=200, max_depth=10, random_state=42), 'Logistic Regression': cwjLR(C=1.0, random_state=42, max_iter=1000), 'Support Vector Machine': cwjSVC(probability=True, random_state=42), 'K-Nearest Neighbors': cwjKNN(n_neighbors=7), 'Gradient Boosting': cwjRFC(n_estimators=100, max_depth=6, random_state=42) # Simplified GBDT } # Train and evaluate models results = {} print("\nModel Training Results:") for name, model in models.items(): print(f"\nTraining {name}...") model.fit(X_train_split, y_train_split) # Predictions y_pred = model.predict(X_val) y_prob = model.predict_proba(X_val)[:, 1] if hasattr(model, 'predict_proba') else cwjnp.zeros_like(y_pred) # Cross-validation cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy') accuracy = accuracy_score(y_val, y_pred) results[name] = { 'model': model, 'predictions': y_pred, 'probabilities': y_prob, 'accuracy': accuracy, 'cv_mean': cv_scores.mean(), 'cv_std': cv_scores.std() } print(f"{name:20} | Accuracy: {accuracy:.4f} | CV: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}") # 6. Model Evaluation and Visualization print("\n6. Model Evaluation and Visualization") # Create model evaluation visualization fig, axes = cwjplt.subplots(2, 3, figsize=(18, 12)) # 6.1 Model Accuracy Comparison ax1 = axes[0, 0] model_names = list(results.keys()) accuracies = [results[name]['accuracy'] for name in model_names] cv_means = [results[name]['cv_mean'] for name in model_names] cv_stds = [results[name]['cv_std'] for name in model_names] x_pos = cwjnp.arange(len(model_names)) width = 0.35 bars1 = ax1.bar(x_pos - width/2, accuracies, width, label='Validation Accuracy', alpha=0.8, color='#3498db') bars2 = ax1.bar(x_pos + width/2, cv_means, width, label='Cross-Validation Accuracy', alpha=0.8, color='#e74c3c') # Add error bars for i, (bar, std) in enumerate(zip(bars2, cv_stds)): ax1.errorbar(bar.get_x() + bar.get_width()/2, bar.get_height(), yerr=std, fmt='k', capsize=5) ax1.set_xlabel('Model') ax1.set_ylabel('Accuracy') ax1.set_title('Model Performance Comparison', fontsize=14, fontweight='bold') ax1.set_xticks(x_pos) ax1.set_xticklabels(model_names, rotation=45) ax1.legend() # Add value labels for bars in [bars1, bars2]: for bar in bars: height = bar.get_height() ax1.text(bar.get_x() + bar.get_width()/2, height + 0.01, f'{height:.3f}', ha='center', va='bottom', fontsize=9) # 6.2 ROC Curves ax2 = axes[0, 1] ax2.plot([0, 1], [0, 1], 'k--', alpha=0.5, label='Random Guessing') for name in results.keys(): if len(cwjnp.unique(y_val)) > 1 and len(results[name]['probabilities']) > 0: fpr, tpr, _ = roc_curve(y_val, results[name]['probabilities']) roc_auc = auc(fpr, tpr) ax2.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.3f})', linewidth=2) ax2.set_xlabel('False Positive Rate (FPR)') ax2.set_ylabel('True Positive Rate (TPR)') ax2.set_title('ROC Curves', fontsize=14, fontweight='bold') ax2.legend() # 6.3 Best Model Confusion Matrix best_model_name = max(results, key=lambda x: results[x]['accuracy']) best_result = results[best_model_name] cm = confusion_matrix(y_val, best_result['predictions']) ax3 = axes[0, 2] cwjsns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax3, xticklabels=['Predicted Death', 'Predicted Survival'], yticklabels=['Actual Death', 'Actual Survival']) ax3.set_title(f'{best_model_name} - Confusion Matrix', fontsize=14, fontweight='bold') # 6.4 Feature Importance (Random Forest) ax4 = axes[1, 0] if hasattr(results['Random Forest']['model'], 'feature_importances_'): feature_importance = results['Random Forest']['model'].feature_importances_ feature_importance_df = cwjpandas.DataFrame({ 'feature': feature_columns, 'importance': feature_importance }).sort_values('importance', ascending=True) ax4.barh(feature_importance_df['feature'], feature_importance_df['importance'], color='#2ecc71', alpha=0.8) ax4.set_xlabel('Importance') ax4.set_title('Random Forest Feature Importance', fontsize=14, fontweight='bold') # 6.5 Model Prediction Probability Distribution ax5 = axes[1, 1] for name in results.keys(): if len(results[name]['probabilities']) > 0: ax5.hist(results[name]['probabilities'], bins=20, alpha=0.6, label=name, density=True) ax5.set_xlabel('Prediction Probability') ax5.set_ylabel('Density') ax5.set_title('Model Prediction Probability Distribution', fontsize=14, fontweight='bold') ax5.legend() # 6.6 Learning Curve Analysis ax6 = axes[1, 2] train_sizes = [0.3, 0.5, 0.7, 0.9] best_model = results[best_model_name]['model'] train_acc = [] val_acc = [] for size in train_sizes: n_samples = int(size * len(X_train_split)) X_subset = X_train_split[:n_samples] y_subset = y_train_split[:n_samples] best_model.fit(X_subset, y_subset) train_acc.append(accuracy_score(y_subset, best_model.predict(X_subset))) val_acc.append(accuracy_score(y_val, best_model.predict(X_val))) ax6.plot(train_sizes, train_acc, 'o-', label='Training Accuracy', color='#3498db', linewidth=2) ax6.plot(train_sizes, val_acc, 'o-', label='Validation Accuracy', color='#e74c3c', linewidth=2) ax6.set_xlabel('Training Data Proportion') ax6.set_ylabel('Accuracy') ax6.set_title(f'{best_model_name} Learning Curve', fontsize=14, fontweight='bold') ax6.legend() cwjplt.tight_layout() cwjplt.suptitle('Model Evaluation and Validation - Chen Wenjing', fontsize=16, fontweight='bold', y=1.02) cwjplt.show() # 7. Final Results Summary print("\n7. Final Analysis Results") print("=" * 60) print(f"🏆 Best Model: {best_model_name}") print(f"📊 Validation Accuracy: {results[best_model_name]['accuracy']:.4f}") print(f"🔍 Cross-Validation Accuracy: {results[best_model_name]['cv_mean']:.4f} ± {results[best_model_name]['cv_std']:.4f}") print("\n📈 Key Findings:") print("1. Gender is the most important predictor (females had significantly higher survival rate)") print("2. Strong correlation between passenger class and survival (1st class highest)") print("3. Passengers with moderate family size had higher survival rates") print("4. Clear priority for children and women in rescue operations") print("5. Higher fare passengers (typically 1st class) had better survival chances") print("\n🔧 Algorithm Combination Advantages:") print("✓ Random Forest: Handles non-linear relationships, provides feature importance") print("✓ Logistic Regression: Strong interpretability, stable probability output") print("✓ Support Vector Machine: Suitable for small samples, clear boundaries") print("✓ K-Nearest Neighbors: Simple and effective, no complex parameter tuning") print("✓ Ensemble Methods: Combines algorithm strengths, improves robustness") print("\n" + "=" * 60) print("✅ Analysis Completed - Author: 23311801128-Chen Wenjing") 解决这个代码的可视化结果乱码问题 以及图片遮挡问题
最新发布
11-15
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值