xgboost调节树的深度(size)和数量

本文通过使用XGBoost算法,对Otto数据集进行了一系列的参数调优实验,包括调整基分类器的数量(n_estimators)、树的深度(max_depth)以及两者结合的效果。通过10折交叉验证的方式,对不同参数设置下的模型进行了评估,最终确定了最优参数组合。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1。树的个数(基分类器的个数

# XGBoost on Otto dataset, Tune n_estimators
from pandas import read_csv
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
import matplotlib
matplotlib.use('Agg')
from matplotlib import pyplot
# load data
data = read_csv('train.csv')
dataset = data.values
# split data into X and y
X = dataset[:,0:94]
y = dataset[:,94]
# encode string class values as integers
label_encoded_y = LabelEncoder().fit_transform(y)
# grid search
model = XGBClassifier()
n_estimators = range(50, 400, 50)
param_grid = dict(n_estimators=n_estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)
grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold)
grid_result = grid_search.fit(X, label_encoded_y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
	print("%f (%f) with: %r" % (mean, stdev, param))
# plot
pyplot.errorbar(n_estimators, means, yerr=stds)
pyplot.title("XGBoost n_estimators vs Log Loss")
pyplot.xlabel('n_estimators')
pyplot.ylabel('Log Loss')
pyplot.savefig('n_estimators.png')

这里设定了3个树的大小的范围,测试集训练集上10折交叉验证,这样10*2*3一共60个model

结果如下:

Best: -0.001152 using {'n_estimators': 250}
-0.010970 (0.001083) with: {'n_estimators': 50}
-0.001239 (0.001730) with: {'n_estimators': 100}
-0.001163 (0.001715) with: {'n_estimators': 150}
-0.001153 (0.001702) with: {'n_estimators': 200}
-0.001152 (0.001702) with: {'n_estimators': 250}
-0.001152 (0.001704) with: {'n_estimators': 300}
-0.001153 (0.001706) with: {'n_estimators': 350}

图张这样:这里y用了10折交叉验证的得分的均值

Tune The Number of Trees in XGBoost

其实虽然最优为250,但其实100-350变化都不大了

2.树的大小(深度

# XGBoost on Otto dataset, Tune max_depth
from pandas import read_csv
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
import matplotlib
matplotlib.use('Agg')
from matplotlib import pyplot
# load data
data = read_csv('train.csv')
dataset = data.values
# split data into X and y
X = dataset[:,0:94]
y = dataset[:,94]
# encode string class values as integers
label_encoded_y = LabelEncoder().fit_transform(y)
# grid search
model = XGBClassifier()
max_depth = range(1, 11, 2)
print(max_depth)
param_grid = dict(max_depth=max_depth)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)
grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold, verbose=1)
grid_result = grid_search.fit(X, label_encoded_y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
	print("%f (%f) with: %r" % (mean, stdev, param))
# plot
pyplot.errorbar(max_depth, means, yerr=stds)
pyplot.title("XGBoost max_depth vs Log Loss")
pyplot.xlabel('max_depth')
pyplot.ylabel('Log Loss')
pyplot.savefig('max_depth.png')

Best: -0.001236 using {'max_depth': 5}
-0.026235 (0.000898) with: {'max_depth': 1}
-0.001239 (0.001730) with: {'max_depth': 3}
-0.001236 (0.001701) with: {'max_depth': 5}
-0.001237 (0.001701) with: {'max_depth': 7}
-0.001237 (0.001701) with: {'max_depth': 9}

Tune Max Tree Depth in XGBoost

 3.树的大小和深度一起变

# XGBoost on Otto dataset, Tune n_estimators and max_depth
from pandas import read_csv
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
import matplotlib
matplotlib.use('Agg')
from matplotlib import pyplot
import numpy
# load data
data = read_csv('train.csv')
dataset = data.values
# split data into X and y
X = dataset[:,0:94]
y = dataset[:,94]
# encode string class values as integers
label_encoded_y = LabelEncoder().fit_transform(y)
# grid search
model = XGBClassifier()
n_estimators = [50, 100, 150, 200]
max_depth = [2, 4, 6, 8]
print(max_depth)
param_grid = dict(max_depth=max_depth, n_estimators=n_estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)
grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold, verbose=1)
grid_result = grid_search.fit(X, label_encoded_y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
	print("%f (%f) with: %r" % (mean, stdev, param))
# plot results
scores = numpy.array(means).reshape(len(max_depth), len(n_estimators))
for i, value in enumerate(max_depth):
    pyplot.plot(n_estimators, scores[i], label='depth: ' + str(value))
pyplot.legend()
pyplot.xlabel('n_estimators')
pyplot.ylabel('Log Loss')
pyplot.savefig('n_estimators_vs_max_depth.png')

结果:

Best: -0.001141 using {'n_estimators': 200, 'max_depth': 4}
-0.012127 (0.001130) with: {'n_estimators': 50, 'max_depth': 2}
-0.001351 (0.001825) with: {'n_estimators': 100, 'max_depth': 2}
-0.001278 (0.001812) with: {'n_estimators': 150, 'max_depth': 2}
-0.001266 (0.001796) with: {'n_estimators': 200, 'max_depth': 2}
-0.010545 (0.001083) with: {'n_estimators': 50, 'max_depth': 4}
-0.001226 (0.001721) with: {'n_estimators': 100, 'max_depth': 4}


-0.001150 (0.001704) with: {'n_estimators': 150, 'max_depth': 4}
-0.001141 (0.001693) with: {'n_estimators': 200, 'max_depth': 4}
-0.010341 (0.001059) with: {'n_estimators': 50, 'max_depth': 6}
-0.001237 (0.001701) with: {'n_estimators': 100, 'max_depth': 6}
-0.001163 (0.001688) with: {'n_estimators': 150, 'max_depth': 6}
-0.001154 (0.001679) with: {'n_estimators': 200, 'max_depth': 6}
-0.010342 (0.001059) with: {'n_estimators': 50, 'max_depth': 8}
-0.001237 (0.001701) with: {'n_estimators': 100, 'max_depth': 8}
-0.001161 (0.001688) with: {'n_estimators': 150, 'max_depth': 8}
-0.001153 (0.001679) with: {'n_estimators': 200, 'max_depth': 8}

Tune The Number of Trees and Max Tree Depth in XGBoost

建议:

The lines overlap making it hard to see the relationship, but generally we can see the interaction we expect. Fewer boosted trees are required with increased tree depth.

Further, we would expect the increase complexity provided by deeper individual trees to result in greater overfitting of the training data which would be exacerbated by having more trees, in turn resulting in a lower cross validation score. We don’t see this here as our trees are not that deep nor do we have too many. Exploring this expectation  is left as an exercise you could explore yourself.

https://machinelearningmastery.com/tune-number-size-decision-trees-xgboost-python/

### XGBoost 回归任务使用指南 XGBoost 是一种高效且灵活的梯度提升框架,适用于多种类型的机器学习任务,包括分类回归。对于回归任务,可以使用 `XGBRegressor` 类来构建模型并完成预测工作。 以下是完整的实现方法及其代码示例: #### 1. 数据准备 为了演示 XGBoost 在回归任务中的应用,通常会选用经典的波士顿房价数据集。该数据集包含了多个影响房价的因素(如犯罪率、房间数等),目标是通过这些因素预测房屋价格。 ```python from sklearn.datasets import load_boston import pandas as pd # 加载波士顿房价数据集 boston = load_boston() data = pd.DataFrame(boston.data, columns=boston.feature_names) target = pd.Series(boston.target) print(data.head()) ``` 此部分完成了数据加载,并将其转换为 Pandas DataFrame Series 格式以便于后续操作[^3]。 --- #### 2. 划分训练集测试集 在实际建模过程中,需要将数据划分为训练集测试集,从而能够评估模型的真实表现。 ```python from sklearn.model_selection import train_test_split # 划分数据集 X_train, X_test, y_train, y_test = train_test_split( data, target, test_size=0.2, random_state=42 ) print(f"Training set size: {len(X_train)}") print(f"Testing set size: {len(X_test)}") ``` 这里采用了 80% 的数据用于训练,剩余 20% 用于测试模型性能。 --- #### 3. 初始化 XGBoost 模型 XGBoost 提供了一个专门针对回归任务的类 `XGBRegressor`。可以通过设置不同的超参数来自定义模型行为。 ```python from xgboost import XGBRegressor # 创建 XGBoost 回归模型实例 model = XGBRegressor( objective='reg:squarederror', # 设置损失函数为目标平方误差 learning_rate=0.1, # 学习速率 n_estimators=100, # 基础估计器数量 max_depth=6 # 的最大深度 ) print(model) ``` 以上配置中设置了几个重要参数:`objective` 定义了优化的目标;`learning_rate` 控制每棵的学习贡献程度;`n_estimators` 表示迭代次数或数量;而 `max_depth` 决定了单棵决策的复杂度。 --- #### 4. 训练模型 一旦模型被初始化完毕,就可以利用训练数据对其进行拟合。 ```python # 开始训练过程 model.fit(X_train, y_train) ``` 此时,XGBoost 将按照指定的超参数逐步建立多颗弱学习器(即 CART ),并通过加权组合形成最终强学习器[^3]。 --- #### 5. 性能评估 经过训练之后,需借助测试集上的预测结果衡量模型的好坏。常用指标有 R² 分数均方误差 (MSE) 等。 ```python from sklearn.metrics import mean_squared_error, r2_score # 预测 y_pred = model.predict(X_test) # 计算 MSE R² 得分 mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print(f"Mean Squared Error: {mse:.4f}") print(f"R² Score: {r2:.4f}") ``` 上述代码片段计算出了两个核心评价标准,帮助我们直观理解模型效果如何[^3]。 --- #### 6. 参数调优 如果初始设定未能达到预期精度,则可通过网格搜索或其他技术进一步寻找最佳参数组合。 ```python from sklearn.model_selection import GridSearchCV param_grid = { 'learning_rate': [0.01, 0.1], 'n_estimators': [50, 100, 200], 'max_depth': [4, 6, 8] } grid_search = GridSearchCV(estimator=XGBRegressor(), param_grid=param_grid, cv=3, scoring='neg_mean_squared_error') grid_search.fit(X_train, y_train) best_params = grid_search.best_params_ print(f"Best Parameters: {best_params}") optimized_model = grid_search.best_estimator_ ``` 这段脚本实现了自动化参数寻优逻辑,显著提升了效率的同时也增强了泛化能力[^2]。 --- ### 结论 综上所述,XGBoost 不仅支持快速搭建基础版回归分析工具链路,还允许深入探索更精细粒度下的调节选项空间。无论是初学者还是资深工程师都能从中受益匪浅。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值