题目
Assignment
Steps
1 Create a classification dataset (n samples ! 1000, n features ! 10)
2 Split the dataset using 10-fold cross validation
3 Train the algorithms
I GaussianNB
I SVC (possible C values [1e-02, 1e-01, 1e00, 1e01, 1e02], RBF kernel)
I RandomForestClassifier (possible n estimators values [10, 100, 1000])
4 Evaluate the cross-validated performance
I Accuracy
I F1-score
I AUC ROC
5 Write a short report summarizing the methodology and the results
代码
from sklearn import datasets, cross_validation, metrics
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
datasets = datasets.make_classification(n_samples=1000, n_features=10)
kf = cross_validation.KFold(len(datasets[0]), n_folds=10, shuffle=True)
clf = GaussianNB()
for train_index, test_index in kf:
X_train, y_train = datasets[0][train_index], datasets[1][train_index]
X_test, y_test = datasets[0][test_index], datasets[1][test_index]
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
acc = metrics.accuracy_score(y_test, pred)
f1 = metrics.f1_score(y_test, pred)
auc = metrics.roc_auc_score(y_test, pred)
print("Naive Bayes")
print("acc:", acc, "f1:", f1, "auc", auc)
for C_data in [1e-02, 1e-01, 1e00, 1e01, 1e02]:
clf = SVC(C=C_data, kernel='rbf', gamma=0.1)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
acc = metrics.accuracy_score(y_test, pred)
f1 = metrics.f1_score(y_test, pred)
auc = metrics.roc_auc_score(y_test, pred)
print("SVM C_data",C_data)
print("acc:", acc, "f1:", f1, "auc", auc)
for n_data in [10,100,1000]:
clf=RandomForestClassifier(n_estimators=n_data)
clf.fit(X_train,y_train)
pred=clf.predict(X_test)
acc = metrics.accuracy_score(y_test, pred)
f1 = metrics.f1_score(y_test, pred)
auc = metrics.roc_auc_score(y_test, pred)
print("Random Forest n_data", n_data)
print("acc:", acc, "f1:", f1, "auc", auc)
结果
Naive Bayes
acc: 0.93 f1: 0.9306930693069307 auc 0.930172068827531
SVM C_data 0.01
acc: 0.91 f1: 0.9032258064516129 auc 0.9117647058823529
SVM C_data 0.1
acc: 0.96 f1: 0.9591836734693878 auc 0.9607843137254901
SVM C_data 1.0
acc: 0.95 f1: 0.9494949494949494 auc 0.950580232092837
SVM C_data 10.0
acc: 0.94 f1: 0.9411764705882353 auc 0.9399759903961585
SVM C_data 100.0
acc: 0.92 f1: 0.9215686274509803 auc 0.9199679871948779
Random Forest n_data 10
acc: 0.97 f1: 0.9696969696969697 auc 0.9705882352941176
Random Forest n_data 100
acc: 0.98 f1: 0.98 auc 0.9803921568627452
Random Forest n_data 1000
acc: 0.97 f1: 0.9696969696969697 auc 0.9705882352941176
Naive Bayes
acc: 0.96 f1: 0.96 auc 0.96
SVM C_data 0.01
acc: 0.93 f1: 0.9263157894736842 auc 0.9299999999999999
SVM C_data 0.1
acc: 0.95 f1: 0.9484536082474226 auc 0.95
SVM C_data 1.0
acc: 0.94 f1: 0.9375 auc 0.94
SVM C_data 10.0
acc: 0.9 f1: 0.8936170212765958 auc 0.8999999999999999
SVM C_data 100.0
acc: 0.88 f1: 0.8749999999999999 auc 0.8799999999999999
Random Forest n_data 10
acc: 0.97 f1: 0.9702970297029702 auc 0.9699999999999999
Random Forest n_data 100
acc: 0.96 f1: 0.96 auc 0.96
Random Forest n_data 1000
acc: 0.96 f1: 0.96 auc 0.96
Naive Bayes
acc: 0.99 f1: 0.9894736842105264 auc 0.9895833333333333
SVM C_data 0.01
acc: 0.89 f1: 0.8910891089108911 auc 0.891826923076923
SVM C_data 0.1
acc: 0.97 f1: 0.967741935483871 auc 0.96875
SVM C_data 1.0
acc: 0.94 f1: 0.9347826086956522 auc 0.938301282051282
SVM C_data 10.0
acc: 0.94 f1: 0.9361702127659574 auc 0.9391025641025641
SVM C_data 100.0
acc: 0.93 f1: 0.924731182795699 auc 0.9286858974358976
Random Forest n_data 10
acc: 0.98 f1: 0.9787234042553191 auc 0.9791666666666667
Random Forest n_data 100
acc: 0.99 f1: 0.9894736842105264 auc 0.9895833333333333
Random Forest n_data 1000
acc: 0.98 f1: 0.9787234042553191 auc 0.9791666666666667
Naive Bayes
acc: 0.97 f1: 0.9719626168224299 auc 0.9727272727272727
SVM C_data 0.01
acc: 0.88 f1: 0.8775510204081634 auc 0.8909090909090909
SVM C_data 0.1
acc: 0.95 f1: 0.9532710280373831 auc 0.9525252525252524
SVM C_data 1.0
acc: 0.96 f1: 0.9629629629629629 auc 0.9616161616161615
SVM C_data 10.0
acc: 0.96 f1: 0.9629629629629629 auc 0.9616161616161615
SVM C_data 100.0
acc: 0.91 f1: 0.9142857142857144 auc 0.9141414141414141
Random Forest n_data 10
acc: 0.97 f1: 0.9719626168224299 auc 0.9727272727272727
Random Forest n_data 100
acc: 0.97 f1: 0.9719626168224299 auc 0.9727272727272727
Random Forest n_data 1000
acc: 0.97 f1: 0.9719626168224299 auc 0.9727272727272727
Naive Bayes
acc: 0.97 f1: 0.967741935483871 auc 0.96875
SVM C_data 0.01
acc: 0.89 f1: 0.8910891089108911 auc 0.891826923076923
SVM C_data 0.1
acc: 0.95 f1: 0.945054945054945 auc 0.9479166666666667
SVM C_data 1.0
acc: 0.95 f1: 0.946236559139785 auc 0.9487179487179486
SVM C_data 10.0
acc: 0.95 f1: 0.946236559139785 auc 0.9487179487179486
SVM C_data 100.0
acc: 0.93 f1: 0.924731182795699 auc 0.9286858974358976
Random Forest n_data 10
acc: 0.97 f1: 0.967741935483871 auc 0.96875
Random Forest n_data 100
acc: 0.97 f1: 0.967741935483871 auc 0.96875
Random Forest n_data 1000
acc: 0.97 f1: 0.967741935483871 auc 0.96875
Naive Bayes
acc: 0.99 f1: 0.9887640449438202 auc 0.9888888888888889
SVM C_data 0.01
acc: 0.88 f1: 0.8800000000000001 auc 0.888888888888889
SVM C_data 0.1
acc: 0.95 f1: 0.9411764705882353 auc 0.9444444444444444
SVM C_data 1.0
acc: 0.95 f1: 0.9425287356321839 auc 0.9464646464646465
SVM C_data 10.0
acc: 0.97 f1: 0.967032967032967 auc 0.9707070707070707
SVM C_data 100.0
acc: 0.87 f1: 0.853932584269663 auc 0.8676767676767676
Random Forest n_data 10
acc: 0.99 f1: 0.9887640449438202 auc 0.9888888888888889
Random Forest n_data 100
acc: 0.97 f1: 0.9655172413793104 auc 0.9666666666666667
Random Forest n_data 1000
acc: 0.99 f1: 0.9887640449438202 auc 0.9888888888888889
Naive Bayes
acc: 0.97 f1: 0.968421052631579 auc 0.9695512820512822
SVM C_data 0.01
acc: 0.92 f1: 0.9183673469387755 auc 0.920673076923077
SVM C_data 0.1
acc: 0.92 f1: 0.9111111111111111 auc 0.9174679487179487
SVM C_data 1.0
acc: 0.96 f1: 0.9574468085106383 auc 0.9591346153846154
SVM C_data 10.0
acc: 0.95 f1: 0.9484536082474228 auc 0.9503205128205129
SVM C_data 100.0
acc: 0.86 f1: 0.851063829787234 auc 0.858974358974359
Random Forest n_data 10
acc: 0.97 f1: 0.968421052631579 auc 0.9695512820512822
Random Forest n_data 100
acc: 0.97 f1: 0.968421052631579 auc 0.9695512820512822
Random Forest n_data 1000
acc: 0.97 f1: 0.968421052631579 auc 0.9695512820512822
Naive Bayes
acc: 0.97 f1: 0.970873786407767 auc 0.9703525641025642
SVM C_data 0.01
acc: 0.88 f1: 0.8695652173913044 auc 0.8846153846153846
SVM C_data 0.1
acc: 0.96 f1: 0.9607843137254902 auc 0.9607371794871795
SVM C_data 1.0
acc: 0.97 f1: 0.970873786407767 auc 0.9703525641025642
SVM C_data 10.0
acc: 0.95 f1: 0.9523809523809524 auc 0.9495192307692308
SVM C_data 100.0
acc: 0.9 f1: 0.9056603773584906 auc 0.8990384615384616
Random Forest n_data 10
acc: 0.97 f1: 0.970873786407767 auc 0.9703525641025642
Random Forest n_data 100
acc: 0.97 f1: 0.970873786407767 auc 0.9703525641025642
Random Forest n_data 1000
acc: 0.97 f1: 0.970873786407767 auc 0.9703525641025642
Naive Bayes
acc: 0.96 f1: 0.96 auc 0.96
SVM C_data 0.01
acc: 0.94 f1: 0.9375 auc 0.94
SVM C_data 0.1
acc: 0.94 f1: 0.9375 auc 0.94
SVM C_data 1.0
acc: 0.95 f1: 0.9484536082474226 auc 0.95
SVM C_data 10.0
acc: 0.95 f1: 0.9494949494949495 auc 0.95
SVM C_data 100.0
acc: 0.89 f1: 0.8952380952380952 auc 0.89
Random Forest n_data 10
acc: 0.94 f1: 0.94 auc 0.94
Random Forest n_data 100
acc: 0.96 f1: 0.96 auc 0.96
Random Forest n_data 1000
acc: 0.95 f1: 0.9504950495049505 auc 0.95
Naive Bayes
acc: 0.97 f1: 0.9714285714285713 auc 0.9704937775993576
SVM C_data 0.01
acc: 0.84 f1: 0.8222222222222222 auc 0.8490566037735849
SVM C_data 0.1
acc: 0.94 f1: 0.9400000000000001 auc 0.9433962264150944
SVM C_data 1.0
acc: 0.92 f1: 0.9199999999999999 auc 0.9233239662786029
SVM C_data 10.0
acc: 0.9 f1: 0.9019607843137256 auc 0.9020473705339221
SVM C_data 100.0
acc: 0.91 f1: 0.9090909090909092 auc 0.9138900040144521
Random Forest n_data 10
acc: 0.97 f1: 0.9714285714285713 auc 0.9704937775993576
Random Forest n_data 100
acc: 0.96 f1: 0.9615384615384616 auc 0.9610598153352067
Random Forest n_data 1000
acc: 0.97 f1: 0.9714285714285713 auc 0.9704937775993576
Process finished with exit code 0
分析
- 随机森林算法非常优秀,准确率一般比SVC和GaussianNB要准确
- SVC的惩罚系数设置,比较低时容易过拟合降低准确率,比较高时欠拟合也容易出现准确率低的情况