数据挖掘 - task 3: 各类分类算法

本文通过构建逻辑回归、SVM、决策树、随机森林和XGBoost模型,对比了不同算法在相同数据集上的表现。实验结果显示,随机森林在训练集和测试集上均取得了最高准确率,分别为77.87%和75.21%。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

前言

用逻辑回归、svm和决策树;随机森林和XGBoost进行模型构建,评分方式任意,如准确率等。

决策树

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score


dataPath = 'D:\\Data\\dataAnalyse\\'
dataFile = 'dataProcessByTwoTask.csv'
data = pd.read_csv(dataPath+dataFile,encoding='gbk')

train_data, test_data = train_test_split(data, test_size=0.3, random_state=2018)
print("train_data",train_data.shape)
print("test_data",test_data.shape)


y_train = train_data['status']
x_train = train_data.drop(['status'],axis =1)

y_test = test_data['status']
x_test = test_data.drop(['status'],axis =1)

clf = DecisionTreeClassifier(max_depth=None, min_samples_split=2,random_state=0)
 # 训练样本
scores_train = cross_val_score(clf, x_train, y_train)
print('测试数据集准确率=',scores_train.mean())

scores_test = cross_val_score(clf, x_test, y_test)
print('测试数据集准确率=',scores_test.mean())

输出结果:

测试数据集准确率= 0.7012350276158706
测试数据集准确率= 0.6691776554344605

随机森林

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score


dataPath = 'D:\\Data\\dataAnalyse\\'
dataFile = 'dataProcessByTwoTask.csv'
data = pd.read_csv(dataPath+dataFile,encoding='gbk')

train_data, test_data = train_test_split(data, test_size=0.3, random_state=2018)
print("train_data",train_data.shape)
print("test_data",test_data.shape)


y_train = train_data['status']
x_train = train_data.drop(['status'],axis =1)

y_test = test_data['status']
x_test = test_data.drop(['status'],axis =1)

clf = RandomForestClassifier(n_jobs=2)
 # 训练样本
clf.fit(x_train, y_train)

clf.predict(x_test)
scores_train = cross_val_score(clf, x_train, y_train)
print('测试数据集准确率=',scores_train.mean())

scores_test = cross_val_score(clf, x_test, y_test)
print('测试数据集准确率=',scores_test.mean())


输出结果:

测试数据集准确率= 0.7787475721945868
测试数据集准确率= 0.7520794391727867

XGBoost

import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from xgboost import plot_importance
from sklearn import metrics

model = XGBClassifier(learning_rate=0.01,
                      n_estimators=10,           # 树的个数-10棵树建立xgboost
                      max_depth=4,               # 树的深度
                      min_child_weight = 1,      # 叶子节点最小权重
                      gamma=0.,                  # 惩罚项中叶子结点个数前的参数
                      subsample=1,               # 所有样本建立决策树
                      colsample_btree=1,         # 所有特征建立决策树
                      scale_pos_weight=1,        # 解决样本个数不平衡的问题
                      random_state=27,           # 随机数
                      slient = 0
                      )
model.fit(x_train,y_train)


#预测
y_pred = model.predict(x_test)
print("Accuracy : %.4g" % metrics.accuracy_score(y_test, y_pred)) 
y_train_proba = model.predict_proba(x_train)[:,1]
print("AUC Score (Train): %f" % metrics.roc_auc_score(y_train, y_train_proba))
y_proba = model.predict_proba(x_test)[:,1]
print("AUC Score (Test): %f" % metrics.roc_auc_score(y_test, y_proba))

结果:

Accuracy : 0.7551
AUC Score (Train): 0.807798
AUC Score (Test): 0.756977

参考博客

XGBoost:在Python中使用XGBoost

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值