xgboost+python参数介绍的简单使用

最新推荐文章于 2024-12-04 08:53:09 发布

原创最新推荐文章于 2024-12-04 08:53:09 发布 · 1.6k 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#python #机器学习

机器学习同时被 2 个专栏收录

6 篇文章

订阅专栏

python

5 篇文章

订阅专栏

本文介绍了xgboost的参数控制过拟合、模型复杂度、处理不平衡数据等，包括max_depth、min_child_weight、gamma等重要参数。同时，讲解了eta、num_round、scale_pos_weight等对模型性能的影响。还提供了xgboost的基本使用方法，并给出了Kaggle竞赛中的Python代码示例。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

官网参数介绍（英文版）

http://xgboost.readthedocs.io/en/latest/how_to/param_tuning.html
http://xgboost.readthedocs.io/en/latest/parameter.html

中文部分翻译版

http://blog.youkuaiyun.com/zc02051126/article/details/46711047

1. xgboost的参数介绍

控制过拟合
- 直接控制模型的复杂度
  - max_depth, min_child_weight, gamma
- 增大产生树的随机性
  - subsample, colsample_bytree
  - eta, num_round
处理不平衡的数据集
- 预测的排序（AUC）
  - scale_pos_weight
- 预测可靠性
  - max_delta_step
参数分别介绍
- booster: [default=gbtree]，可选gbtree和gblinear，gbtree使用基于树的模型进行提升计算，gblinear使用线性模型进行提升计算
- silent: [default=0], 是否打印运行时信息，0为打印
- nthread: [默认为支持的最大线程数], 运行时的线程数
- num_pbuffer: [自动生成，不需要用户自己设置], 预测数量，一般是输入样本数
- num_feature: [自动生成，不需要用户自己设置], 特征维数
- eta: [default=0.3]，取值范围[0,1]，学习率，迭代的步长比例
- gamma: [default=0]，取值范围[0, $\infty$ ]，损失阈值
- max_depth: [default=6], 取值范围[0, $\infty$ ]，树的最大深度
- min_child_weight: [default=1], 取值范围[0, $\infty$ ]，拆分节点权重和阈值，如果节点的样本权重和小于该阈值，就不再进行拆分
- max_delta_step: [default=0]，取值范围[0, $\infty$ ]，每棵树的最大权重估计，0为没有限制
- subsample: [default=1]，取值范围(0,1]，随机选取一定比例的样本来训练树
- colsample_bytree: [default=1]，取值范围(0,1]，选取构造树的特征比例
- colsample_bylevel: [default=1]，取值范围(0,1]，每个层分裂的节点数
- lambda: [default=0]，L2 正则的惩罚系数
- alpha: [default=0]，L1 正则的惩罚系数
- tree_method: string，[default=’auto’]，xgboost构建树的算法，‘auto’‘exact’‘approx’‘hist’
- lambda_bias: 在偏置上的L2正则
- sketch_eps: [default=0.03]，只在approximate greedy algorithm上使用
- scale_pos_weight: [default=1]，用来控制正负样本的比例
- updater: [default=’grow_colmaker,prune’]，提供模块化的方式来构建树，一般不需要由用户设置
- refresh_leaf: [default=1]，刷新参数，如果为1，刷新叶子和树节点，否则只刷新树节点
- process_type: [default=’default’]，提升的方式
- grow_policy: string [default=’depthwise’]，控制新增节点的方式，‘depthwise’，分裂离根节点最近的节点，‘lossguide’，分裂损失函数变化最大的节点
- max_leaves: [default=0]，增加的最大节点数，只和lossguide’ grow policy相关
- max_bins: [default=256]，只和tree_method的‘hist’相关
- objective: [default=reg:linear], 定义学习任务及相应的学习目标，可选的目标函数如下：
  - “reg:linear”, 线性回归。
  - “reg:logistic”, 逻辑回归。
  - “binary:logistic”, 二分类的逻辑回归问题，输出为概率。
  - “binary:logitraw”, 二分类的逻辑回归问题，输出的结果为wTx。
  - “count:poisson”, 计数问题的poisson回归，输出结果为poisson分布。
    在poisson回归中，max_delta_step的缺省值为0.7。(used to safeguard optimization)
  - “multi:softmax”, 让XGBoost采用softmax目标函数处理多分类问题，同时需要设置参数num_class（类别个数）
  - “multi:softprob”, 和softmax一样，但是输出的是ndata * nclass的向量，可以将该向量reshape成ndata行nclass列的矩阵。没行数据表示样本所属于每个类别的概率。
  - “rank:pairwise” –set XGBoost to do ranking task by minimizing the pairwise loss
    base_score [ default=0.5 ]
    the initial prediction score of all instances, global bias
- eval_metric: [默认和objective相关]，校验数据所需要的评价指标，不同的目标函数将会有缺省的评价指标（rmse for regression, and error for classification, mean average precision for ranking），用户可以添加多种评价指标，对于Python用户要以list传递参数对给程序，而不是map参数list参数不会覆盖
  - ’eval_metric’，可选参数如下：
    - “rmse”: root mean square error，均方根误差
    - “logloss”: negative log-likelihood，对数似然
    - “error”: Binary classification error rate，二值误差率，计算方法为误分样本/总样本
    - “merror”: Multiclass classification error rate，多分类误差率，计算方法同上
    - “auc”: Area under the curve for ranking evaluation.
    - “ndcg”:Normalized Discounted Cumulative Gain
    - “map”:Mean average precision
    - “ndcg@n”,”map@n”: n can be assigned as an integer to cut off the top positions in the lists for evaluation.
    - “ndcg-“,”map-“,”ndcg@n-“,”map@n-“: In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding “-” in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions.
      training repeatively
  - seed: [default=0], 随机数的种子。缺省值为0

2. xgboost的基本使用方法

import xgboost as xgb
# 在这里设置需要的参数
gbm = xgb.XGBClassifier(max_depth=3, n_estimators=300, learning_rate=0.05)
# 传入训练集
gbm = fit(train_X, train_y)
# 预测
predictions = gbm.predict(test_X)

Kaggle竞赛上一个例子

https://www.kaggle.com/cbrogan/xgboost-example-python/code/code

# This script shows you how to make a submission using a few
# useful Python libraries.
# It gets a public leaderboard score of 0.76077.
# Maybe you can tweak it and do better...?

import pandas as pd
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder
import numpy as np

# Load the data
train_df = pd.read_csv('../input/train.csv', header=0)
test_df = pd.read_csv('../input/test.csv', header=0)

# We'll impute missing values using the median for numeric columns and the most
# common value for string columns.
# This is based on some nice code by 'sveitser' at http://stackoverflow.com/a/25562948
from sklearn.base import TransformerMixin
class DataFrameImputer(TransformerMixin):
    def fit(self, X, y=None):
        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].median() for c in X],
            index=X.columns)
        return self
    def transform(self, X, y=None):
        return X.fillna(self.fill)

feature_columns_to_use = ['Pclass','Sex','Age','Fare','Parch']
nonnumeric_columns = ['Sex']

# Join the features from train and test together before imputing missing values,
# in case their distribution is slightly different
big_X = train_df[feature_columns_to_use].append(test_df[feature_columns_to_use])
big_X_imputed = DataFrameImputer().fit_transform(big_X)

# XGBoost doesn't (yet) handle categorical features automatically, so we need to change
# them to columns of integer values.
# See http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing for more
# details and options
le = LabelEncoder()
for feature in nonnumeric_columns:
    big_X_imputed[feature] = le.fit_transform(big_X_imputed[feature])

# Prepare the inputs for the model
train_X = big_X_imputed[0:train_df.shape[0]].as_matrix()
test_X = big_X_imputed[train_df.shape[0]::].as_matrix()
train_y = train_df['Survived']

# You can experiment with many other options here, using the same .fit() and .predict()
# methods; see http://scikit-learn.org
# This example uses the current build of XGBoost, from https://github.com/dmlc/xgboost
gbm = xgb.XGBClassifier(max_depth=3, n_estimators=300, learning_rate=0.05).fit(train_X, train_y)
predictions = gbm.predict(test_X)

# Kaggle needs the submission to have a certain format;
# see https://www.kaggle.com/c/titanic-gettingStarted/download/gendermodel.csv
# for an example of what it's supposed to look like.
submission = pd.DataFrame({ 'PassengerId': test_df['PassengerId'],
                            'Survived': predictions })
submission.to_csv("submission.csv", index=False)