目录
环境:使用python+jupter nodebook
数据:本文数据来源2023年【教学赛】金融数据分析赛题1:银行客户认购产品预测
赛题(数据)网址:【教学赛】金融数据分析赛题1:银行客户认购产品预测-天池大赛-阿里云天池
一、数据探索:
1.1 读取数据
所需要的库包:
import pandas as pd
import numpy as np
trian=pd.read_csv("train.csv")
test=pd.read_csv("test.csv")
1.2查看数据
是否正常,有无异常值:
查看统计量
print(df.describe().T)
查看数据分布(散点图):
# 1 查看统计量
print(df.describe().T)
# 2 duration分箱展示
import matplotlib.pyplot as plt
import seaborn as sns
# 3.查看数据分布
# 分离数值变量与分类变量
Nu_feature = list(df.select_dtypes(exclude=['object']).columns)
Ca_feature = list(df.select_dtypes(include=['object']).columns)
Ca_feature.remove('subscribe')
col1=Ca_feature
plt.figure(figsize=(20,10))
j=1
for col in col1:
ax=plt.subplot(4,5,j)
ax=plt.scatter(x=range(len(df)),y=df[col],color='red')
plt.title(col)
j+=1
k=11
for col in col1:
ax=plt.subplot(4,5,k)
ax=plt.scatter(x=range(len(test)),y=test[col],color='cyan')
plt.title(col)
k+=1
plt.subplots_adjust(wspace=0.4,hspace=0.3)
plt.show()
数据相关图(热力图)
# # 4.数据相关图
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
cols = Ca_feature
for m in cols:
df[m] = lb.fit_transform(df[m])
test[m] = lb.fit_transform(test[m])
#
df['subscribe'] = df['subscribe'].replace(['no', 'yes'], [0, 1])
correlation_matrix = df.corr()
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, vmax=0.9, linewidths=0.05, cmap="RdGy")
plt.show()
查看数据是否有空值或者unkonw
#数据没有NA值但是有unknow值
train_set.isin(['unknown']).mean()*100
test_set.isin(['unknown']).mean()*100
# 工作,教育和沟通方式用众数填充
1.3 数据预处理
对训练集和测试集数据进行填充:
trian['default'].replace(['unknown'], test['default'].mode(), inplace=True)
trian['job'].replace(['unknown'], trian['job'].mode(), inplace=True)
trian['education'].replace(['unknown'], trian['education'].mode(), inplace=True)
trian['marital'].replace(['unknown'], trian['marital'].mode(), inpl