书籍信息
Hands-On Machine Learning with Scikit-Learn and Tensorflow
出版社: O’Reilly Media, Inc, USA
平装: 566页
语种: 英语
ISBN: 1491962291
条形码: 9781491962299
商品尺寸: 18 x 2.9 x 23.3 cm
ASIN: 1491962291
系列博文为书籍中文翻译
代码以及数据下载:https://github.com/ageron/handson-ml
1、如果我们在相同的训练数据上训练5个不同的模型,他们都有95%的准确率,是否有方法整合这些模型以获得更高的准确率?
答:使用voting的思想。
2、hard voting和soft voting的区别是什么?
答:hard voting将多数分类器输出的结果最为最终的预测结果,soft voting将平均概率最高的结果作为最终的预测结果,soft voting的前提是所有的分类器都能够预测类别的概率。
3、bagging,pasting,boosting,random forest,stacking是否能够通过并行加速?
答:bagging,pasting,random forest属于bagging范畴,能够完全并行,boosting几乎不能并行,stacking能够部分并行(每层的模型能够并行训练,层间不能并行)。
4、Out-of-Bag评价方式的优点是什么?
答:利用没有使用的数据验证模型效果,不需要额外验证数据。可能能够增加训练数据规模,提升训练效果。
5、为什么Extra-Trees相比随机森林更加随机?这样的随机性有什么意义?Extra-Trees的训练速度相比随机森林更快还是更慢?
答:当训练随机森林的决策树时,我们从部分特征中寻找最优的特征。事实上,我们也能尝试在此基础上使用随机的阈值分割当前特征,而不是寻找最优的阈值,这是Extra-Trees(Extremely Randomized Trees)的基本思想。这样的思想再次增加偏差,减少方差。通常,Extra-Trees训练速度优于随机森林,因为寻找最优的阈值是决策树训练耗时的部分。
6、如何解决AdaBoost的欠拟合问题?
答:增加模型数量/降低基础模型的正则化程度/适当增加学习速率。
7、如果AdaBoost存在过拟合问题,学习速率应该如何变化?
答:适当降低学习速率/在验证数据效果最优时停止训练。
8、使用MNIST数据进行练习。首先将数据分为训练数据(50000),验证数据(10000),测试数据(10000)。然后训练不同的模型(例如随机森林,Extra-Trees,SVM),最后使用soft voting或hard voting整合这些模型。模型效果是否有所提高?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from __future__ import division
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
from sklearn.utils import shuffle
X, y = shuffle(mnist['data'], mnist['target'], random_state=0)
X_train = X[:50000]
y_train = y[:50000]
X_validation = X[50000:60000]
y_validation = y[50000:60000]
X_test = X[60000:]
y_test = y[60000:]
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
clf1 = RandomForestClassifier(random_state=0)
clf1.fit(X_train, y_train)
print 'Random Forest validation score: ', clf1.score(X_validation, y_validation)
print 'Random Forest test score: ', clf1.score(X_test, y_test)
clf2 = ExtraTreesClassifier(random_state=0)
clf2.fit(X_train, y_train)
print 'Extra Trees validation score: ', clf2.score(X_validation, y_validation)
print "Extra Trees test score: ", clf2.score(X_test, y_test)
# svm is too slow
# clf3 = SVC(probability=True, random_state=0)
# clf3.fit(X_train, y_train)
# print 'SVC validation score: ', clf3.score(X_validation, y_validation)
# print 'SVC test score: ', clf3.score(X_test, y_test)
clf3 = SGDClassifier(loss='log', random_state=0)
clf3.fit(X_train, y_train)
print 'SGD validation score: ', clf3.score(X_validation, y_validation)
print 'SGD test score: ', clf3.score(X_test, y_test)
from sklearn.ensemble import VotingClassifier
hard_voting_clf = VotingClassifier(
estimators=[('rf', clf1), ('et', clf2), ('sgd', clf3)],
voting='hard'
)
hard_voting_clf.fit(X_train, y_train)
print 'hard voting validation score: ', hard_voting_clf.score(X_validation, y_validation)
print "hard voting test score: ", hard_voting_clf.score(X_test, y_test)
# sgd makes the result worse
soft_voting_clf = VotingClassifier(
estimators=[('rf', clf1), ('et', clf2)],
voting='soft'
)
soft_voting_clf.fit(X_train, y_train)
print 'soft voting validation score: ', soft_voting_clf.score(X_validation, y_validation)
print "soft voting test score: ", soft_voting_clf.score(X_test, y_test)
# output
# Random Forest validation score: 0.9465
# Random Forest test score: 0.9425
# Extra Trees validation score: 0.949
# Extra Trees test score: 0.9473
# SGD validation score: 0.8769
# SGD test score: 0.8742
# hard voting validation score: 0.9506
# hard voting test score: 0.9474
# soft voting validation score: 0.9592
# soft voting test score: 0.9598
可以发现,集成学习效果有所提高。
9、在上面的基础上,利用基础模型在验证数据的预测结果,训练blender整合模型的结果。模型效果是否有所提高?
pred1 = clf1.predict(X_validation)
pred2 = clf2.predict(X_validation)
pred3 = clf3.predict(X_validation)
pred_train = pd.DataFrame({"pred1" : pred1, "pred2" : pred2, "pred3" : pred3, "y" : y_validation})
blender = RandomForestClassifier(random_state=0)
blender.fit(pred_train[['pred1', 'pred2', 'pred3']], pred_train['y'])
pred1 = clf1.predict(X_test)
pred2 = clf2.predict(X_test)
pred3 = clf3.predict(X_test)
pred_test = pd.DataFrame({"pred1" : pred1, "pred2" : pred2, "pred3" : pred3, "y" : y_test})
print blender.score(pred_test[['pred1', 'pred2', 'pred3']], pred_test['y'])
# output
# 0.9466
尝试多种方法,效果均无提升。