logistic回归分析
logistic回归分析主要用来做分类(尤其是二分类问题),下面以客户是否会及时还信用卡欠款的数据进行预测
https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients
信用卡是否违约数据集
from sklearn.linear_model import LogisticRegression
## 读取数据集
credit = pd.read_excel("D:\Desktop\python在机器学习中的应用\default of credit card clients.xls")
credit.head(5)
Y:1代表还款,0代表未还款,共计30000个样本,Y=1的样本约有6600个
## 检查是否有缺失值
credit.info()
## 6636个不违约,其余的违约
credit["Y"].sum()
## 数据集切分
trainx = ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10',
'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20',
'X21', 'X22', 'X23']
Target = ["Y"]
##将训练集切分为训练集和验证集
traindata_x,valdata_x,traindata_y,valdata_y = train_test_split(credit[trainx],
credit[Target],test_size = 0.25,random_state = 1)
traindata_x.sample(5)
clf_l1_LR = LogisticRegression(penalty = 'l1',solver='liblinear')
#penalty='l1',代表模型使用l1范数来约束自变量,和lasso回归中的惩罚函数一样,都有对自变量的控制作用
clf_l1_LR.fit(traindata_x,traindata_y)
pre_y = clf_l1_LR.predict(valdata_x)
print(metrics.classification_report(valdata_y,pre_y))
- precision:预测为正的样本中有多少是真正的正样本
- recall:召回率,样本中的正例有多少被预测正确了
- f1-score:综合评价指标
- support:预测为相应类别的数量
## plot ROC曲线
pre_y_p = clf_l1_LR.predict_proba(valdata_x)[:, 1]
fpr_LR, tpr_LR, _ = metrics.roc_curve(valdata_y, pre_y_p)
auc = metrics.auc(fpr_LR, tpr_LR)
plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_LR, tpr_LR,"r",linewidth = 2)
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.xlim(0, 1)
plt.ylim(0, 1)
plt.title('Logistic ROC curve')
plt.text(0.2,0.8,"auc = "+str(round(auc,4)))
plt.show()