##题目描述##
在salarytrain.dat文件中有超过三万行用户数据,每行中的用户个人数据(按逗号分隔)依次包括如下字段:
- Age: continuous
- Workclass: 8 values
- Fnlwgt: continuous.
- Education: 16 values
- Education-num: continuous.
- Marital-status: 7 values
- Occupation: 14 values
- Relationship: 6 values
- Race: 5 values
- Sex: Male, Female
- Capital-gain: continuous.
- Capital-loss: continuous.
- Hours-per-week: continuous.
- Native-country: 41 values
-
50K Income: Yes, No
其中第15个字段(Income)表示该用户的年薪是否超过了5万美元,现在请你根据salarytrain.dat中各个字段的数据,设计算法来预测salarytest.dat中哪些用户的年薪超过了5万美元。最终结果要求为:
- 按salarytest.dat中的数据,生成预测结果文件 salaryresults.dat,文件中每行为一个结果(<=50K或>50K),和salarytest.dat中每行一一对应。(格式可参考sampleresults.dat)
- 运行python /home/common/onboard/train3/evaluate.pyc ${predict_file}来获得算法准确率得分(注:准确率=预测正确的结果数/总结果数),要求得分至少要高于0.8,并且越高越好。
##题目分析##
该题目为标准的数据挖掘\机器学习中的预测问题,利用训练数据训练模型,然后基于测试数据评估模型的性能。解决该问题的基本步骤包括:
- 数据预处理(缺失值剔除,离散变量编码,连续型变量处理);
- 构建模型,可以采用多种算法,如:SVM、Logistic回归、GBRT、Adaboost、XGboost、神经网络等;
- 评估模型预测的准确率或其他指标;
这里主要采用Scikit-Learn库解决该问题,具体处理细节包括:
- 剔除缺失值;
- 对于连续型变量进行中心化或映射到[0,1][0,1][0,1]区间,比较两种变换效果,选择较好的一种;
- 对于离散型变量进行One-Hot编码,对于一个包含n个属性的变量,One-Hot编码使用n位二进制数编码属性,这样处理最重要的作用是可以保持各属性间的独立性(例. 变量为省份,包含34个不同的值,如采用单数值编码1、2、3…,则具有潜在顺序关系);
- 利用SVM、Logsitic回归、朴素贝叶斯方法、决策树方法、随机森林方法、梯度提升树方法、Adaboost方法、XGBoost、神经网络共9种算法训练数据,进行预测;
- 利用交叉验证方法进行模型选择,评估预测的准确率。
###代码###
示例代码:离散型变量One-Hot编码,连续型变量进行中心化;
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
# 加载相关包
import pandas as pd
from sklearn import preprocessing
import numpy as np
from sklearn.model_selection import cross_val_score
# Part1. 读入数据
Salary_Data = pd.read_table('/Users/vancl/Desktop/达观数据_新兵训练营/train3/salarytrain.dat.txt',
sep = ',', header = None, na_values = ' ?', index_col = False)
Salary_Data.columns = ['Age', 'Workclass', 'Fnlwgt', 'Education', 'Education-num',
'Marital-status', 'Occupation', 'Relationship', 'Race',
'Sex', 'Capital-gain', 'Capital-loss', 'Hours-per-week',
'Native-country', 'Income']
#变量
Continuous = ['Age', 'Fnlwgt', 'Education-num', 'Capital-gain', 'Capital-loss',
'Hours-per-week']
Discrete = ['Workclass', 'Education', 'Marital-status', 'Occupation', 'Relationship',
'Race', 'Sex', 'Native-country']
# Part1. 对变量进行数据预处理
# 缺失值的处理(删除)
# Null_Row = np.where(Salary_Data.isnull())[0]
Salary_Data.dropna(inplace = True)
# 二值化因变量
Salary_Data.loc[Salary_Data['Income'] == ' <=50K', 'Income'] = 1
Salary_Data.loc[Salary_Data['Income'] == ' >50K', 'Income'] = 0
# 连续型和离散型变量分离
ST_Continuous = Salary_Data[Continuous]
ST_Discrete = Salary_Data[Discrete]
# 连续型变量进行中心化处理
ST_Continuous_Norm = preprocessing.scale(ST_Continuous)
# Dataframe
ST_Continuous_Norm = pd.DataFrame(ST_Continuous_Norm)
ST_Continuous_Norm.columns = Continuous
# 对于离散随机变量先进性单值数值化编码
ST_Discrete_Encoder = np.zeros((8, len(ST_Continuous_Norm)))
count = 0
for variable in Discrete:
# 编码
EncoderVariable = preprocessing.LabelEncoder().fit_transform(Salary_Data[variable])
ST_Discrete_Encoder[count] = EncoderVariable
count = count + 1
# Transpose
ST_Discrete_Encoder = ST_Discrete_Encoder.transpose()
# 对于离散数据进行One-Hot编码
OHE = preprocessing.OneHotEncoder()
# One-Hot编码
ST_Discrete_OH_Encoder = OHE.fit_transform(ST_Discrete_Encoder).toarray()
ST_Discrete_OH_Encoder = pd.DataFrame(ST_Discrete_OH_Encoder)
# 变量组合
ST_Variable = pd.concat([ST_Continuous_Norm, ST_Discrete_OH_Encoder], axis = 1)
# 因变量
ST_Label = Salary_Data['Income']
ST_Label.index = pd.RangeIndex(start = 0, stop = 30162, step = 1)
# 模型预测
# 1. SVM
from sklearn import svm
svm_class = svm.SVC(kernel = 'rbf', gamma = 0.01, C = 1.0)
# 训练、交叉验证
Accuracy = cross_val_score(svm_class, ST_Variable, ST_Label, cv = 5)
# 输出
print('The accuracy rate of SVM is: ' + str(np.mean(Accuracy)))
## SVM调参
#from sklearn.model_selection import GridSearchCV
#Grid = GridSearchCV(svm.SVC(kernel = 'sigmoid'), param_grid = {'C':[0.1,1,10], 'gamma':[1,0.1,0.01]}, cv = 2)
#Grid.fit(ST_Variable, ST_Label)
# 2. Logistic回归
from sklearn import linear_model
Logit_class = linear_model.LogisticRegression(C=1.0, penalty = 'l1', tol = 1e-6)
# 训练、交叉验证
Accuracy = cross_val_score(Logit_class, ST_Variable, ST_Label, cv = 5)
# 输出
print('The accuracy rate of Logistic is: ' + str(np.mean(Accuracy)))
# 3.朴素贝叶斯方法
# 高斯方法
from sklearn.naive_bayes import GaussianNB
Bayes_gnb_class = GaussianNB()
# 训练、交叉验证
Accuracy = cross_val_score(Bayes_gnb_class, ST_Variable, ST_Label, cv = 5)
# 输出
print('The accuracy rate of Naive-Bayes is: ' + str(np.mean(Accuracy)))
# 贝叶斯方法
from sklearn.naive_bayes import BernoulliNB
Bayes_bnl_class = BernoulliNB()
# 训练、交叉验证
Accuracy = cross_val_score(Bayes_bnl_class, ST_Variable, ST_Label, cv = 5)
# 4. 决策树分类
from sklearn import tree
Tree_class = tree.DecisionTreeClassifier()
# 训练、交叉验证
Accuracy = cross_val_score(Tree_class, ST_Variable, ST_Label, cv = 5)
# 输出
print('The accuracy rate of Decision-Tree is: ' + str(np.mean(Accuracy)))
# 5. 随机森林方法
from sklearn.ensemble import RandomForestClassifier
Randomforest_class = RandomForestClassifier(n_estimators = 10)
# 训练、交叉验证
Accuracy = cross_val_score(Randomforest_class, ST_Variable, ST_Label, cv = 5)
# 输出
print('The accuracy rate of Random-Forest is: ' + str(np.mean(Accuracy)))
# 6. 梯度提升树(GBRT)
from sklearn.ensemble import GradientBoostingClassifier
GBRT_class = GradientBoostingClassifier(n_estimators = 200, learning_rate = 1.0, max_depth = 3, random_state = 0)
# 训练、交叉验证
Accuracy = cross_val_score(GBRT_class, ST_Variable, ST_Label, cv = 5)
# 输出
print('The accuracy rate of GBRT is: ' + str(np.mean(Accuracy)))
# 7. Adaboost
from sklearn.ensemble import AdaBoostClassifier
# 构建模型
Adaboost_class = AdaBoostClassifier(n_estimators = 500)
# 训练、交叉验证
Accuracy = cross_val_score(Adaboost_class, ST_Variable, ST_Label, cv = 5)
# 输出
print('The accuracy rate of Adaboost is: ' + str(np.mean(Accuracy)))
# 8. XGBoost
import xgboost as xgb
# 参数设置
Parameters = {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'binary:logistic'}
num_round = 5
# 准确率
Accuracy = []
# k-折交叉验证构造
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
KF = KFold(n_splits=5, random_state = 0)
# For loop
for train_index, test_index in KF.split(ST_Variable):
# 数据集
ST_Train, ST_Test = ST_Variable.iloc[train_index,], ST_Variable.iloc[test_index,]
ST_Label_Train, ST_Label_Test = ST_Label[train_index], ST_Label[test_index]
# 构造数据输入
Train_Data = xgb.DMatrix(ST_Train, label = ST_Label_Train)
# 构造测试数据
Test_Data = xgb.DMatrix(ST_Test)
# 训练模型
Xgboost_class = xgb.train(Parameters, Train_Data, num_round)
#预测
ST_Test_Pred = Xgboost_class.predict(Test_Data)
# 取整
ST_Test_Pred = [round(item) for item in ST_Test_Pred]
# 准确率
Accuracy.append(accuracy_score(ST_Label_Test, ST_Test_Pred))
# 输出
print('The accuracy rate of XGBoost is: ' + str(np.mean(Accuracy)))
# 9. 神经网络
from sklearn.neural_network import MLPClassifier
Neural_class = MLPClassifier(solver='adam', alpha = 1e-5, hidden_layer_sizes=(10), random_state=0)
# 训练、交叉验证
Accuracy = cross_val_score(Neural_class, ST_Variable, ST_Label, cv = 5)
# 输出
print('The accuracy rate of Neural-Network is: ' + str(np.mean(Accuracy)))
###运行结果###
在进行模型构建时,考虑了不同的变量预处理方式,并进行了对比,预测准确率如下: