XGBoost模型
- Boosting 分类器属于集成学习模型,它的基本思想是把成百上千个分类准确性较低的树模型组合起来,成为一个准确率很高的模型;
- 特点是不断迭代,每次迭代完成生成一棵新的树,如何在每一步生成合理的树,有许多不同的方法,比如Gradient Tree Boosting,在生成每一棵树的时候,采用梯度下降的思想,以之前生成的所有决策树为基础,向着minimize目标函数的方向再进一步;
- 往往需要生成一定数量的树才能达到令人满意的准确率,在数据集较大较复杂的时候,模型可能需要几千次迭代运算,于是我们通过XGBoost工具来解决这个问题;
- XGBoost全称eXtreme Gradient Boosting,它是Gradient Boosting Machine的一个C++实现,最大特点是能够自动利用CPU的多线程进行并行运算,在算法上也加以改进提高了精度。
import pandas as pd
titanic = pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt')
X = titanic[['pclass','sex','age']]
y = titanic['survived']
X['age'].fillna(X['age'].mean(),inplace=True)
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=33)
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False)
X_train = vec.fit_transform(X_train.to_dict(orient='record'))
X_test = vec.fit_transform(X_test.to_dict(orient='record'))
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(X_train,y_train)
y_rfc_predict = rfc.predict(X_test)
print 'The accuracy of rfc on testing set is',rfc.score(X_test,y_test)
from xgboost import XGBClassifier
xgbc = XGBClassifier()
xgbc.fit(X_train,y_train)
print 'The accuracy of xgbc on testing set is',xgbc.score(X_test,y_test)