1.应用调研
贷款业务是银行最基本、最主要的资产业务,是银行获得利润的主要来源,也是一项风险性较大的资产。其风险性在于如果被贷款人没有偿还贷款的能力,那么银行就会产生坏账,造成亏损。因此在银行业务中常常需要做很多是否发放贷款的调研。本课程设计旨在利用python课堂上学习到的numpy和pandas知识对网络上收集到的数据进行数据清洗,对清洗好的数据进行逻辑回归来预测是否发放贷款。
2.代码分析
2.1数据预处理
import pandas as pd
loans_2007 = pd.read_csv('./LoanStats3a.csv', skiprows=1,low_memory=False)
half_count = len(loans_2007) / 2
loans_2007 = loans_2007.dropna(thresh=half_count, axis=1)
loans_2007 = loans_2007.drop(['desc', 'url'],axis=1)
loans_2007.to_csv('./loans_2007.csv', index=False)
对数据进行初步的清洗,把数据量减少为之开始的一半。drop掉desc和url一些无关的信息,把初步清洗的数据保存为loans_2007.csv
import pandas as pd
loans_2007 = pd.read_csv("loans_2007.csv")
#loans_2007.drop_duplicates()
print(loans_2007.iloc[0])
print(loans_2007.shape[1])
打印观察数据,看看还有什么是无关项
id 1077501
member_id 1.2966e+06
loan_amnt 5000
funded_amnt 5000
funded_amnt_inv 4975
term 36 months
int_rate 10.65%
installment 162.87
grade B
sub_grade B2
emp_title NaN
emp_length 10+ years
home_ownership RENT
annual_inc 24000
verification_status Verified
issue_d Dec-2011
loan_status Fully Paid
pymnt_plan n
purpose credit_card
title Computer
zip_code 860xx
addr_state AZ
dti 27.65
delinq_2yrs 0
earliest_cr_line Jan-1985
inq_last_6mths 1
open_acc 3
pub_rec 0
revol_bal 13648
revol_util 83.7%
total_acc 9
initial_list_status f
out_prncp 0
out_prncp_inv 0
total_pymnt 5863.16
total_pymnt_inv 5833.84
total_rec_prncp 5000
total_rec_int 863.16
total_rec_late_fee 0
recoveries 0
collection_recovery_fee 0
last_pymnt_d Jan-2015
last_pymnt_amnt 171.62
last_credit_pull_d Nov-2016
collections_12_mths_ex_med 0
policy_code 1
application_type INDIVIDUAL
acc_now_delinq 0
chargeoff_within_12_mths 0
delinq_amnt 0
pub_rec_bankruptcies 0
tax_liens 0
Name: 0, dtype: object
52
以上是打印的信息,可以看出来,参数太多,如果我们直接拿52个特征去训练可能导致过拟合所以我们需要进一步的对特征值进行选择
loans_2007 = loans_2007.drop(["id", "member_id", "funded_amnt", "funded_amnt_inv", /"grade", "sub_grade", "emp_title", "issue_d"], axis=1)
loans_2007 = loans_2007.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", /"total_pymnt_inv", "total_rec_prncp"], axis=1)
这里是drop掉一些表示身份信息的id,以及一些压缩的编码,显然这些数字对于我们的训练是没有作用的。
print(loans_2007.iloc[0])
print(loans_2007.shape[1])
再次打印观察一下,结果如下:
loan_amnt 5000
term 36 months
int_rate 10.65%
installment 162.87
emp_length 10+ years
home_ownership RENT
annual_inc 24000
verification_status Verified
loan_status Fully Paid
pymnt_plan n
purpose credit_card
title Computer
addr_state AZ
dti 27.65
delinq_2yrs 0
earliest_cr_line Jan-1985
inq_last_6mths 1
open_acc 3
pub_rec 0
revol_bal 13648
revol_util 83.7%
total_acc 9
initial_list_status f
last_credit_pull_d Nov-2016
collections_12_mths_ex_med 0
policy_code 1
application_type INDIVIDUAL
acc_now_delinq 0
chargeoff_within_12_mths 0
delinq_amnt 0
pub_rec_bankruptcies 0
tax_liens 0
Name: 0, dtype: object
32
我们可以看到特征值已经变成32个了,基本上不能再去缩减特征值的数量了,这个时候我们需要去确定训练的目标值。显然,是否贷款就是我们的目标值,我们吧贷款状态打印出来看一下。
print(loans_2007['loan_status'].value_counts())#贷款状态
结果如下:
Fully Paid 33902
Charged Off 5658
Does