参考文章:
https://wang-shuo.github.io/2017/02/21/%E5%9C%A8Windows%E4%B8%8B%E5%AE%89%E8%A3%85XGBoost/
https://www.ibm.com/developerworks/community/blogs/jfp/entry/Installing_XGBoost_For_Anaconda_on_Windows?lang=zh
http://blog.youkuaiyun.com/u012344939/article/details/68064084
XGBoost是Gradient Boosting算法的一种增强算法,在Kaggle竞赛中大放异彩。下面介绍XGBoost在Windows上的安装过程,我的环境配置:(Windows 7,64 bits, python2.7, anaconda2)。
一、软件安装
为了能在Windows上通过Python使用XGBoost,需要先安装以下三个软件:Python,Git,MINGW
1.1 Python和Git的安装
对于Python,到Python官网下载想安装的版本。对于Git的安装有很多种选择,一种是使用Git for Windows,安装按照默认选项即可(https://gitforwindows.org/)。
1.2 XGBoost的下载
Git安装完成后,开始菜单中会出现一个叫Git Bash的程序,点开后就会出现一个类似Windows命令行的窗口,首先在这个Bash窗口,使用cd命令进入你想保存XGBoost代码的文件夹,比如:
- cd C:/Users/Administrator.ZGC-20150403SJZ
通过如下命令下载xgboost:
- git clone --recursive https://github.com/dmlc/xgboost
出错:RPC failed,解决方法:git init
再输入如下指令:
- cd xgboost
- git submodule init
- git submodule update
1.3 MinGW-W64的安装
接下来就是编译刚刚下载的XGBoost的代码,编译代码需要用到MinGW-W64。它的安装包从这里下载,下载完成后双击安装,出现下面的安装界面,点击Next:
然后在Architecture选项处选择x86_64即可,其他选项保持默认,如下图:
然后点击下一步,就能安装完成。
我使用的是默认安装路径C:\Program Files\mingw-w64\x86_64-6.3.0-posix-seh-rt_v5-rev1。那么make命令和运行库就在下面的文件夹中(也就是包含mingw32-make的文件夹):C:\Program Files\mingw-w64\x86_64-6.3.0-posix-seh-rt_v5-rev1\mingw64\bin,接下来就是把上面的路径添加到系统的Path中。
上面的步骤完成后,关闭Git Bash窗口后重新打开,为了确认添加环境变量已经添加成功,可以在Bash中键入下面的命令:
- which mingw32-make
如果添加成功的话,应该返回类似下面这样的信息:
为了输入的方便,可以简化mingw32-make命令为make:
- alias make='mingw32-make'
二、XGBoost的编译
现在就可以开始编译XGBoost了,首先进入xgboost文件夹
- cd F:/tools/xgboost
采用下面的命令来分开编译,每次编译一个子模块。注意,我们要等每个命令编译完成后才能键入下一个命令。
- cd dmlc-core
- make -j4
- cd ../rabit
- make lib/librabit_empty.a -j4
- cd ..
- cp make/mingw64.mk config.mk
- make -j4
一旦最后一个命令完成后,整个编译过程就完成了。
下面就开始在anaconda下安装Python xgboost模块。打开Anaconda prompt,进入XGBoost文件夹下面的python-package子文件夹,然后键入:
- cd xgboost/python-package>python setup.py install
最后,我运行本地包含调用xgboost的代码,成功运行:
import numpy as np
import pandas as pd
from Cython.Shadow import inline
import matplotlib.pyplot as plt
#matplotlib inline
###################1 oridinal data##################
train_df = pd.read_csv('input/train.csv', index_col=0)
test_df = pd.read_csv('input/test.csv', index_col=0)
print("type of train_df:" + str(type(train_df)))
#print(train_df.columns)
print("shape of train_df:" + str(train_df.shape))
print("shape of test_df:" + str(test_df.shape))
train_df.head()
#print(train_df.head())
##############################2 smooth label#################################
prices = pd.DataFrame({"price":train_df["SalePrice"], "log(price+1)":np.log1p(train_df["SalePrice"])})
print("shape of prices:" + str(prices.shape))
prices.hist()
# plt.plot(alphas, test_scores)
# plt.title("Alpha vs CV Error")
plt.show()
y_train = np.log1p(train_df.pop('SalePrice'))
print("shape of y_train:" + str(y_train.shape))
######################3 take train and test data together################
all_df = pd.concat((train_df, test_df), axis=0)
print("shape of all_df:" + str(all_df.shape))
######################4 make category data to string##########################
print(all_df['MSSubClass'].dtypes)
all_df['MSSubClass'] = all_df['MSSubClass'].astype(str)
all_df['MSSubClass'].value_counts()
print(all_df['MSSubClass'].value_counts())
#####################5 fill null#############################
all_dummy_df = pd.get_dummies(all_df)
print(all_dummy_df.head())
print(all_dummy_df.isnull().sum().sort_values(ascending=False).head())
mean_cols = all_dummy_df.mean()
print(mean_cols.head(10))
all_dummy_df = all_dummy_df.fillna(mean_cols)
print(all_dummy_df.isnull().sum().sum())
###############6 smooth numeric cols########################
numeric_cols = all_df.columns[all_df.dtypes != 'object']
print(numeric_cols)
numeric_col_means = all_dummy_df.loc[:, numeric_cols].mean()
numeric_col_std = all_dummy_df.loc[:, numeric_cols].std()
all_dummy_df.loc[:, numeric_cols] = (all_dummy_df.loc[:, numeric_cols] - numeric_col_means) / numeric_col_std
###############7 train model################################
dummy_train_df = all_dummy_df.loc[train_df.index]
dummy_test_df = all_dummy_df.loc[test_df.index]
print("shape of dummy_train_df:" + str(dummy_train_df))
print("shape of dummy_test_df:" + str(dummy_test_df))
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
X_train = dummy_train_df.values
X_test = dummy_test_df.values
from xgboost import XGBRegressor
params = [1,2,3,4,5,6]
test_scores = []
for param in params:
clf = XGBRegressor(max_depth=param)
test_score = np.sqrt(-cross_val_score(clf, X_train, y_train, cv=10, scoring='neg_mean_squared_error'))
test_scores.append(np.mean(test_score))
plt.plot(params, test_scores)
plt.title("max_depth vs CV Error")
plt.show()