算法实践DAT3：构建评估

最新推荐文章于 2020-08-28 10:24:40 发布

原创最新推荐文章于 2020-08-28 10:24:40 发布 · 342 阅读

0 ·

CC 4.0 BY-SA版权

算法实践专栏收录该内容

3 篇文章

订阅专栏

本文通过实践对比分析了7种常见机器学习算法（逻辑回归、SVM、决策树等）的性能，包括准确率、精确率、召回率、F1分数及AUC值，同时绘制了ROC曲线，展示了不同算法在特定数据集上的表现。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

算法实践DAT3：构建评估

1.任务：
记录7个模型（逻辑回归、SVM、决策树、随机森林、GBDT、XGBoost和LightGBM）关于accuracy、precision，recall和F1-score、auc值的评分表格，并画出ROC曲线。

2.评估参数：
采用for循环对4个模型依次进行训练和参数评估。评价指标用到Accuracy, Precision, Recall, F1 score, AUC。
在这里插入图片描述
准确率(accuracy)： 对于给定的测试数据集，分类器正确分类的样本数与总样本数之比。在正负样本不平衡的情况下，准确率这个评价指标有很大的缺陷。示例参照

accuracy = (TP+TN)/(TP+TN+FN+FP)

精确率(precision)： 精确率是针对预测结果而言的，它表示的是预测为正的样本中有多少是对的。它计算的是所有"正确归为此类的item(TP)"占所有"实际被检索到的(TP+FP)"的比例。

precision = TP/ (TP + FP)

召回率(recall)： 召回率是针对正确样本而言，它计算的是所有"正确被检索的item(TP)"占所有"应该检索到的item(TP+FN)"的比例。

recall = TP/(TP+FN)

F1值： P和R指标有时候会出现的矛盾的情况，这样就需要综合考虑他们，最常见的方法就是F-Measure（又称为F-Score）。就是精确率P和召回率R的调和均值。

F1 = 2* P*R/(P+R0)

ROC: 全名叫做Receiver Operating Characteristic.
ROC关注两个指标:

True Positive Rate ( TPR ) = TP / [ TP + FN] ，TPR 代表能将正例分对的概率
False Positive Rate( FPR ) = FP / [ FP + TN] ，FPR 代表将负例错分的概率

在ROC 空间中，每个点的横坐标是FPR，纵坐标是TPR，这也就描绘了分类器在TP（真正的正例）和FP（错误的正例）间的trade-off。

用ROC curve来表示分类器的performance很直观好用。可是，人们总是希望能有一个数值来标志分类器的好坏。
于是Area Under roc Curve(AUC)就出现了。顾名思义，AUC的值就是处于ROC curve下方的那部分面积的大小。通常，AUC的值介于0.5到1.0之间，较大的AUC代表了较好的Performance。

3.代码实现：

1）导入数据包：

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler #数据标准化数据包

2）三七分数据以及数据标准化

data_all = pd.read_csv('./data_all.csv')
X = data_all.drop(['status'],axis=1)
y = data_all['status']
print('The shape of X:', X.shape)
print('proportion of label 1:', len(y[y==1])/len(y))
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=2018)
print('For train,proportion of label 1:', len(y_train[y_train == 1])/len(y_train))
print('For test,proportion of label 1:', len(y_test[y_test==1])/len(y_test))
print(data_all.head())
#数据标准化
Sscaler = StandardScaler()
X_train = Sscaler.fit_transform(X_train)
X_test = Sscaler.fit_transform(X_test)

3）模型构建

#模型构建
lr_model = LogisticRegression(solver='liblinear')

svm_model = SVC(probability=True)

dt_model = DecisionTreeClassifier()

rf_model = RandomForestClassifier(n_estimators=100)

gbdt_model = GradientBoostingClassifier(n_estimators=100)

xgb_model = XGBClassifier(n_estimators=100)

lgbm_model = LGBMClassifier(n_estimators=100)

4）模型评估

models = {'LR': lr_model,
          'SVM': svm_model,
          'DT': dt_model,
          'RF': rf_model,
          'GBDT': gbdt_model,
          'XGBoost': xgb_model,
          'LightGBM': lgbm_model}

df_result = pd.DataFrame(columns=('Model','Accuracy','Precision','Recall',
                                  'F1 score','AUC'))
row = 0
for name, clf in models.items():
    clf.fit(X_train,y_train)
    y_test_pred = clf.predict(X_test)
    
    acc = metrics.accuracy_score(y_test, y_test_pred)
    p = metrics.precision_score(y_test, y_test_pred)
    r = metrics.recall_score(y_test, y_test_pred)
    f1 = metrics.f1_score(y_test, y_test_pred)
    
    y_test_proba = clf.predict_proba(X_test)
    fpr, tpr, thresholds = metrics.roc_curve(y_test, y_test_proba[:,1])
    auc = metrics.auc(fpr, tpr)
    
    df_result.loc[row] = [name, acc, p, r, f1, auc]
    print(df_result.loc[row])
    row += 1

    #lw = 2
    # 模型的ROC曲线
    plt.plot(fpr, tpr, lw=2, label=name+'ROC curve (auc = %0.2f)' % auc)
    
    # 画对角线
    plt.plot([0,1], [0,1], lw=2, linestyle='--')
    
    # 固定横轴和纵轴的范围
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.legend()
    
plt.show()
print(df_result)

输出结果：
在这里插入图片描述

      Model  Accuracy  Precision    Recall  F1 score     AUC

0        LR  0.783462   0.630208  0.337047  0.439201  0.755829

1       SVM  0.782761   0.709402  0.231198  0.348739  0.754259

2        DT  0.655221   0.358811  0.470752  0.407229  0.593990

3        RF  0.782060   0.673913  0.259053  0.374245  0.757208

4      GBDT  0.772950   0.600000  0.292479  0.393258  0.752277

5   XGBoost  0.776454   0.621951  0.284123  0.390057  0.764929

6  LightGBM  0.768746   0.592357  0.259053  0.360465  0.746740