0 背景介绍
Give Me Some Credit https://www.kaggle.com/c/GiveMeSomeCredit/overview,是Kaggle上关于信用评分的项目,通过改进信用评分技术,预测未来两年借款人会遇到财务困境的可能性。并以此为依据来决定是否给予借贷人信用授权。目标是建立帮助银行做出最佳财务借贷决策的模型。今天这
数据类型如下:
其中:SeriousDlqin2yrs代表过去两年内的情况,也是test集要预测的字段。
第一部分:导入需要的包和数据
import numpy as np
import pandas as pd
import os, datetime, sys, random, time
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
plt.style.use('fivethirtyeight')
%matplotlib inline
from scipy import stats, special
import shap #
import warnings
warnings.filterwarnings('ignore')
train_data=pd.read_csv("./GiveMeSomeCredit/cs-training.csv",encoding="utf-8")
test_data=pd.read_csv("./GiveMeSomeCredit/cs-test.csv",encoding="utf-8")
print(train_data.head())
# print(test_data.head())
打印出来的train数据集如下:
Unnamed: 0 SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse NumberOfDependents
0 1 1 0.766127 45 2 0.802982 9120.0 13 0 6 0 2.0
1 2 0 0.957151 40 0 0.121876 2600.0 4 0 0 0 1.0
2 3 0 0.658180 38 1 0.085113 3042.0 2 1 0 0 0.0
3 4 0 0.233810 30 0 0.036050 3300.0 5 0 0 0 0.0
4 5 0 0.907239 49 1 0.024926 63588.0 7 0 1 0 0.0
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 12 columns):
Unnamed: 0 150000 non-null int64
SeriousDlqin2yrs 150000 non-null int64
RevolvingUtilizationOfUnsecuredLines 150000 non-null float64
age 150000 non-null int64
NumberOfTime30-59DaysPastDueNotWorse 150000 non-null int64
DebtRatio 150000 non-null float64
MonthlyIncome 120269 non-null float64
NumberOfOpenCreditLinesAndLoans 150000 non-null int64
NumberOfTimes90DaysLate 150000 non-null int64
NumberRealEstateLoansOrLines 150000 non-null int64
NumberOfTime60-89DaysPastDueNotWorse 150000 non-null int64
NumberOfDependents 146076 non-null float64
dtypes: float64(4), int64(8)
memory usage: 13.7 MB
总数据有150000条,其中有MonthlyIncome 和 NumberOfDependents,存在空值,需要进一步处理。
第二部分:对数据进行探索式分析及清理
例如:‘Unnamed: 0’列是指记录的ID,这里是无用的数据因此要去除;
# remove id
dev_train=train_data.drop("Unnamed: 0",axis=1)
# 测试集也做同样操作
print(test_data.info())
dev_test=test_data.drop("Unnamed: 0",axis=1)
(1)查看各列数据分布情况
print(dev_train.describe())
SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse NumberOfDependents
count 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 1.202690e+05 150000.000000 150000.000000 150000.000000 150000.000000 146076.000000
mean 0.066840 6.048438 52.295207 0.421033 353.005076 6.670221e+03 8.452760 0.265973 1.018240 0.240387 0.757222
std 0.249746 249.755371 14.771866 4.192781 2037.818523 1.438467e+04 5.145951 4.169304 1.129771 4.155179 1.115086
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.029867 41.000000 0.000000 0.175074 3.400000e+03 5.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.154181 52.000000 0.000000 0.366508 5.400000e+03 8.000000 0.000000 1.000000 0.000000 0.000000
75% 0.000000 0.559046 63.000000 0.000000 0.868254 8.249000e+03 11.000000 0.000000 2.000000 0.000000 1.000000
max 1.000000 50708.000000 109.000000 98.000000 329664.000000 3.008750e+06 58.000000 98.000000 54.000000 98.000000 20.000000
仔细分析发现:SeriousDlqin2yrs:的分布不是均衡的,这代表正负样本的比例有显著失衡
RevolvingUtilizationOfUnsecuredLines:的最大值和最小值很极端,但均值却很小,代表数据离散值较多。
age: 最小值有0,应该属于异常值,可能是空值导致的。最大值109也是不正常的。
NumberOfTime30-59DaysPastDueNotWorse, NumberOfTimes90DaysLate, NumberOfTime60-89DaysPastDueNotWorse三种的最大值都是98,标准差近似,可能具有相关性。
so,下面开始进行可视化分析。
# 检查数据正负样本是否平衡
fig,axes=plt.subplots(1,2,figsize=(12,6))
# pandas自带绘图
dev_train['SeriousDlqin2yrs'].value_counts().plot.pie(explode=[0,0.1],autopct="%1.1f%%",ax=axes[0])
axes[0].set_title("SeriousDlqin2yrs")
sns.countplot("SeriousDlqin2yrs",data=dev_train,ax=axes[1])
axes[1].set_title("SeriousDlqin2yrs")
plt.show()
由此可以看出正负样本失衡严重,这可以考虑通过欠采样解决。
lightgbm中,可以设置两个参数is_unbalance和scale_pos_weight。
is_unbalace:当其为True时,算法将尝试自动平衡占主导地位的标签的权重(使用列集中的pos/neg分数)
scale_pos_weight:默认1,即假设正负标签都是相等的。在不平衡数据集的情况下,建议使用以下公式:
sample_pos_weight = number of negative samples / number of positive samples