【机器学习实战】kaggle 欺诈检测---如何解决欺诈数据中正负样本极度不平衡问题

机器学习司猫白

已于 2025-01-18 17:02:49 修改

阅读量1.2k

点赞数 47

分类专栏：机器学习实战文章标签：机器学习人工智能欺诈检测正负样本不平衡 python lightgbm

于 2025-01-16 17:25:29 首次发布

本文链接：https://blog.youkuaiyun.com/2302_79308082/article/details/145177242

版权

机器学习实战专栏收录该内容

14 篇文章

订阅专栏

**活动发起人@小虚竹

本次分享的是我在参与kaggle信贷欺诈竞赛中的一些心得，希望供大家批评与交流，也希望能有金融欺诈方向的大佬能够在评论区或者私信中指导我。

本人首页包含各种kaggle竞赛中的机器学习实战内容，并附有源码，希望大家多来交流。

任务描述

使用机器学习模型识别欺诈性信用卡交易，这样可以确保客户不会为未曾购买的商品承担费用。

数据集描述

在本次比赛中，需要预测在线交易欺诈的概率，如二进制目标所示isFraud。

文件
train.csv - 训练集
test.csv - 测试集
Sample_submission.csv - 格式正确的示例提交文件

建模思路

在处理极度不平衡的欺诈检测数据集时，构建模型时需要特别注意数据的偏斜性，以确保模型不仅能够识别大量的正常交易（负样本），也能够有效检测到少量的欺诈交易（正样本）。建模难度是比较大的，对此我认为有两种策略可以尝试。

数据方向：

使用采样技术或者虚拟生成技术，使少数类的样本数据更多，平衡正负样本比例。

欠采样（Under-sampling）：可以减少负样本的数量，以平衡正负样本的比例，但这样可能会丢失一些有价值的负样本信息。
过采样（Over-sampling）：通过复制正样本（欺诈行为）来增加其数量，常用的过采样方法包括SMOTE（Synthetic Minority Over-sampling Technique），即通过插值生成新的欺诈样本。
生成对抗网络（GAN）：可以尝试使用生成对抗网络生成虚拟的欺诈交易数据，这对于数据不平衡问题可能会比较有效。

模型方向：

选择模型在应对极度不平衡问题时，某些算法比其他算法更适合。

随机森林（Random Forest）：随机森林能够通过构建多棵决策树对数据进行分类，并且具有内置的样本权重机制，可以在训练时对正负样本进行加权处理。
梯度提升树（Gradient Boosting Trees, GBT）：例如XGBoost、LightGBM和CatBoost，这些算法非常适合不平衡问题，并且提供了丰富的超参数可以调整。
支持向量机（SVM）：特别是使用加权SVM，可以通过调整类别权重，使得模型对少数类别更加敏感。
集成方法（如Bagging和Boosting）能够通过结合多个基学习器（如决策树）来提升模型的鲁棒性和对少数类的预测能力。
深度学习：虽然深度神经网络（如LSTM、CNN）在大规模数据集上有非常好的表现，但由于数据集严重不平衡且样本数量较小，通常不会是首选。可以考虑使用较为简单的神经网络，如简单的全连接网络（ANN）。

简单概括一下就是通过调整模型对少数类样本的权重，使模型对少数类样本更敏感。

模型评估

在不平衡数据集上，传统的准确率（Accuracy）并不能有效反映模型的性能，因为多数样本是负类样本。这里提供两个评估指标。

AUC（Area Under the ROC Curve）：ROC曲线下的面积，越接近1说明模型区分正负类的能力越强。
PR-AUC（Precision-Recall AUC）：对不平衡数据集来说，PR-AUC比AUC更加有效，因为它关注的是正类（少数类）而非负类。

源码+解析

第一步，打开数据文件

import pandas as pd
import numpy as np

train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
train_df.info()

输出：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 34 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   id                  150000 non-null  int64         
 1   Time                150000 non-null  datetime64[ns]
 2   feat1               150000 non-null  float64       
 3   feat2               150000 non-null  float64       
 4   feat3               150000 non-null  float64       
 5   feat4               150000 non-null  float64       
 6   feat5               150000 non-null  float64       
 7   feat6               150000 non-null  float64       
 8   feat7               150000 non-null  float64       
 9   feat8               150000 non-null  float64       
 10  feat9               150000 non-null  float64       
 11  feat10              150000 non-null  float64       
 12  feat11              150000 non-null  float64       
 13  feat12              150000 non-null  float64       
 14  feat13              150000 non-null  float64       
 15  feat14              150000 non-null  float64       
 16  feat15              150000 non-null  float64       
 17  feat16              150000 non-null  float64       
 18  feat17              150000 non-null  float64       
 19  feat18              150000 non-null  float64       
 20  feat19              150000 non-null  float64       
 21  feat20              150000 non-null  float64       
 22  feat21              150000 non-null  float64       
 23  feat22              150000 non-null  float64       
 24  feat23              150000 non-null  float64       
 25  feat24              150000 non-null  float64       
 26  feat25              150000 non-null  float64       
 27  feat26              150000 non-null  float64       
 28  feat27              150000 non-null  float64       
 29  feat28              150000 non-null  float64       
 30  Transaction_Amount  150000 non-null  float64       
 31  IsFraud             150000 non-null  int64         
 32  hour                150000 non-null  int32         
 33  minute              150000 non-null  int32         
dtypes: datetime64[ns](1), float64(29), int32(2), int64(2)
memory usage: 37.8 MB

可以看到数据质量使比较好的，都是数值类型数据，且没有缺失值。
Time字段是时间戳类型，后续可以将其时间信息进行提取。
2. 数据处理。

test_df['Time'] = pd.to_datetime(test_df['Time'], unit='s')  # 将时间戳转为 datetime 格式
# 提取时间特征
test_df['hour'] = test_df['Time'].dt.hour
test_df['minute'] = test_df['Time'].dt.minute  # 这里修正为提取分钟

train_df['Time'] = pd.to_datetime(train_df['Time'], unit='s')  # 将时间戳转为 datetime 格式
# 提取时间特征
train_df['hour'] = train_df['Time'].dt.hour
train_df['minute'] = train_df['Time'].dt.minute  # 这里修正为提取分钟

这里是对时间列进行了处理，提取出了小时和分钟的信息，因为年月日的信息我已经提前知道了所有样本的年月日是相同的，因此没必要提取出这些信息。

划分标签和特征。

train_feature = train_df.drop(columns=['id','IsFraud','Time'])
test_feature = test_df.drop(columns=['id','Time'])

label = train_df['IsFraud']

对特征列进行分析。

import seaborn as sns
import matplotlib.pyplot as plt

# 计算相关性矩阵
correlation_matrix = train_feature.corr()

# 设置热图的大小
plt.figure(figsize=(15, 15))

# 绘制热图
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".1f", linewidths=0.5)

# 设置标题
plt.title('Correlation Heatmap of test_feature')

# 显示图形
plt.show()

在这里插入图片描述
这里的目的就是想通过相关性初步去除掉一些相关性极高的特征。

4.模型训练和评估

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import optuna

x = train_feature
y = label

# 切分数据集
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# 定义目标函数
def objective(trial):
    # 计算正负样本的比例
    scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()
    
    # 参数空间
    params = {
        'objective': 'binary',  # 二分类
        'scale_pos_weight': scale_pos_weight,  # 优化scale_pos_weight
        'boosting_type': 'gbdt',  # 使用传统的 GBDT
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'num_leaves': trial.suggest_int('num_leaves', 20, 150),
        'min_child_samples': trial.suggest_int('min_child_samples', 10, 100),
        'min_child_weight': trial.suggest_float('min_child_weight', 1e-3, 10.0),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
        'learning_rate': trial.suggest_float('learning_rate', 1e-4, 0.1),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-3, 10.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-3, 10.0),
    }
    
    model = lgb.LGBMClassifier(**params, random_state=42)
    model.fit(X_train, y_train)
    
    # 获取预测的类别1的概率
    y_proba = model.predict_proba(X_test)[:, 1]  # 取类别 1 的概率

    # 计算 AUC
    auc = roc_auc_score(y_test, y_proba)
    return auc

# 启动优化
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

# 使用最佳参数训练模型
best_params = study.best_trial.params
best_model = lgb.LGBMClassifier(**best_params, random_state=42)
best_model.fit(X_train, y_train)

# 评估
y_proba = best_model.predict_proba(X_test)[:, 1]  # 获取类别1的概率
auc = roc_auc_score(y_test, y_proba)

print("AUC分数: {:.5f}".format(auc))

输出：

AUC分数: 0.71782

这里使用的就是模型技巧，通过调整scale_pos_weight的参数，该参数就是负样本数量比正样本数量。通过调整scale_pos_weight使模型对少数类样本更加敏感，也就是增加了模型对少数类样本的权重。

输出特征重要性并进行分析

# 获取特征重要性
feature_importances = best_model.feature_importances_

# 创建一个 DataFrame 来显示特征和它们的重要性
feature_importance_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': feature_importances
})

# 按照重要性排序
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# 绘制特征重要性条形图
plt.figure(figsize=(10, 6))
plt.barh(feature_importance_df['Feature'], feature_importance_df['Importance'])
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Feature Importance from LightGBM Model')
plt.show()