基于lightgbm的kaggle比赛实践：Give me some credit

最新推荐文章于 2024-04-23 12:14:13 发布

guyu1003

最新推荐文章于 2024-04-23 12:14:13 发布

阅读量4k

点赞数 3

分类专栏：机器学习文章标签： python 数据分析机器学习 kaggle

本文链接：https://blog.youkuaiyun.com/guyu1003/article/details/109092790

版权

0 背景介绍

Give Me Some Credit https://www.kaggle.com/c/GiveMeSomeCredit/overview，是Kaggle上关于信用评分的项目，通过改进信用评分技术，预测未来两年借款人会遇到财务困境的可能性。并以此为依据来决定是否给予借贷人信用授权。目标是建立帮助银行做出最佳财务借贷决策的模型。今天这

数据类型如下：

其中：SeriousDlqin2yrs代表过去两年内的情况，也是test集要预测的字段。

第一部分：导入需要的包和数据

import numpy as np
import pandas as pd
import os, datetime, sys, random, time

import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

plt.style.use('fivethirtyeight')
%matplotlib inline

from scipy import stats, special
import shap                # 

import warnings
warnings.filterwarnings('ignore')

train_data=pd.read_csv("./GiveMeSomeCredit/cs-training.csv",encoding="utf-8")
test_data=pd.read_csv("./GiveMeSomeCredit/cs-test.csv",encoding="utf-8")

print(train_data.head())
# print(test_data.head())

打印出来的train数据集如下：

Unnamed: 0	SeriousDlqin2yrs	RevolvingUtilizationOfUnsecuredLines	age	NumberOfTime30-59DaysPastDueNotWorse	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfTime60-89DaysPastDueNotWorse	NumberOfDependents
0	1	1	0.766127	45	2	0.802982	9120.0	13	0	6	0	2.0
1	2	0	0.957151	40	0	0.121876	2600.0	4	0	0	0	1.0
2	3	0	0.658180	38	1	0.085113	3042.0	2	1	0	0	0.0
3	4	0	0.233810	30	0	0.036050	3300.0	5	0	0	0	0.0
4	5	0	0.907239	49	1	0.024926	63588.0	7	0	1	0	0.0

train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 12 columns):
Unnamed: 0                              150000 non-null int64
SeriousDlqin2yrs                        150000 non-null int64
RevolvingUtilizationOfUnsecuredLines    150000 non-null float64
age                                     150000 non-null int64
NumberOfTime30-59DaysPastDueNotWorse    150000 non-null int64
DebtRatio                               150000 non-null float64
MonthlyIncome                           120269 non-null float64
NumberOfOpenCreditLinesAndLoans         150000 non-null int64
NumberOfTimes90DaysLate                 150000 non-null int64
NumberRealEstateLoansOrLines            150000 non-null int64
NumberOfTime60-89DaysPastDueNotWorse    150000 non-null int64
NumberOfDependents                      146076 non-null float64
dtypes: float64(4), int64(8)
memory usage: 13.7 MB

总数据有150000条，其中有MonthlyIncome 和 NumberOfDependents，存在空值，需要进一步处理。

第二部分：对数据进行探索式分析及清理

例如：‘Unnamed: 0’列是指记录的ID，这里是无用的数据因此要去除；

# remove id 
dev_train=train_data.drop("Unnamed: 0",axis=1)
# 测试集也做同样操作
print(test_data.info())
dev_test=test_data.drop("Unnamed: 0",axis=1)

(1)查看各列数据分布情况

print(dev_train.describe())

SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age    	NumberOfTime30-59DaysPastDueNotWorse	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfTime60-89DaysPastDueNotWorse	NumberOfDependents
count	150000.000000	150000.000000	    150000.000000	150000.000000	150000.000000	1.202690e+05	150000.000000	150000.000000	150000.000000	150000.000000	146076.000000
mean	0.066840	6.048438	    52.295207	0.421033	353.005076	6.670221e+03	8.452760	0.265973	1.018240	0.240387	0.757222
std		0.249746	249.755371	    14.771866	4.192781	2037.818523	1.438467e+04	5.145951	4.169304	1.129771	4.155179	1.115086
min		0.000000	0.000000	    0.000000	0.000000	0.000000	0.000000e+00	0.000000	0.000000	0.000000	0.000000	0.000000
25%		0.000000	0.029867	    41.000000	0.000000	0.175074	3.400000e+03	5.000000	0.000000	0.000000	0.000000	0.000000
50%		0.000000	0.154181	    52.000000	0.000000	0.366508	5.400000e+03	8.000000	0.000000	1.000000	0.000000	0.000000
75%		0.000000	0.559046	    63.000000	0.000000	0.868254	8.249000e+03	11.000000	0.000000	2.000000	0.000000	1.000000
max		1.000000	50708.000000	109.000000	98.000000	329664.000000	3.008750e+06	58.000000	98.000000	54.000000	98.000000	20.000000

仔细分析发现：SeriousDlqin2yrs：的分布不是均衡的，这代表正负样本的比例有显著失衡

RevolvingUtilizationOfUnsecuredLines：的最大值和最小值很极端，但均值却很小，代表数据离散值较多。

age: 最小值有0，应该属于异常值，可能是空值导致的。最大值109也是不正常的。

NumberOfTime30-59DaysPastDueNotWorse, NumberOfTimes90DaysLate, NumberOfTime60-89DaysPastDueNotWorse三种的最大值都是98，标准差近似，可能具有相关性。

so，下面开始进行可视化分析。

# 检查数据正负样本是否平衡
fig,axes=plt.subplots(1,2,figsize=(12,6))
# pandas自带绘图
dev_train['SeriousDlqin2yrs'].value_counts().plot.pie(explode=[0,0.1],autopct="%1.1f%%",ax=axes[0])
axes[0].set_title("SeriousDlqin2yrs")
sns.countplot("SeriousDlqin2yrs",data=dev_train,ax=axes[1])
axes[1].set_title("SeriousDlqin2yrs")
plt.show()

由此可以看出正负样本失衡严重，这可以考虑通过欠采样解决。

lightgbm中，可以设置两个参数is_unbalance和scale_pos_weight。
is_unbalace：当其为True时，算法将尝试自动平衡占主导地位的标签的权重(使用列集中的pos/neg分数)
scale_pos_weight：默认1，即假设正负标签都是相等的。在不平衡数据集的情况下，建议使用以下公式：
sample_pos_weight = number of negative samples / number of positive samples