LightGBM

LightGBM

  1. 更快的训练速度和更高的效率: LightGBM使用基于直方图的算法。例如,它将连续的特征值分桶(buckets)装进离散的箱子(bins),这是的训练过程中变得更快。
  2. 更低的内存占用:使用离散的箱子(bins)保存并替换连续值导致更少的内存占用。
  3. 更高的准确率(相比于其他任何提升算法) : 它通过leaf-wise分裂方法产生比level-wise分裂方法更复杂的树,这就是实现更高准确率的主要因素。然而,它有时候或导致过拟合,但是我们可以通过设置 max-depth 参数来防止过拟合的发生。
  4. 大数据处理能力: 相比于XGBoost,由于它在训练时间上的缩减,它同样能够具有处理大数据的能力。
  5. 支持并行学习

二分类任务

params = {'objective': 'binary', #定义的目标函数
		  'boosting': 'gbdt',
		  'metric': {'binary_logloss', 'auc'},  ##评价函数选择
		  'learning_rate': 0.02,
          'max_depth': -1,
          'num_leaves': 38,   #结果对最终效果影响较大,越大值越好,太大会出现过拟合
          "feature_fraction": 0.9,
          "bagging_freq": 1,
          "bagging_fraction": 0.7,
          "bagging_seed": 11,
          
          "min_sum_hessian_in_leaf": 6,
          'min_data_in_leaf': 50,
          "lambda_l1": 0.1,      #l1正则
          # 'lambda_l2': 0.001,     #l2正则
          
          "verbosity": -1,
          "nthread": 4,    #线程数量,-1表示全部线程,线程越多,运行的速度越快
          "random_state": 2019,   #随机数种子,可以防止每次运行的结果不一致
          # 'device': 'gpu'   #如果安装的事gpu版本的lightgbm,可以加快运算
          }

folds = KFold(n_splits=5, shuffle=True, random_state=2019)
prob_oof = np.zeros((train_x.shape[0], ))
test_pred_prob = np.zeros((test.shape[0], ))

## train and predict
feature_importance_df = pd.DataFrame()
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train)):
    print("fold {}".format(fold_ + 1))
    trn_data = lgb.Dataset(train_x.iloc[trn_idx], label=train_y[trn_idx])
    val_data = lgb.Dataset(train_x.iloc[val_idx], label=train_y[val_idx])


    clf = lgb.train(params,
                    trn_data,
                    num_round,
                    valid_sets=[trn_data, val_data],
                    verbose_eval=20,
                    early_stopping_rounds=60)
    prob_oof[val_idx] = clf.predict(train_x.iloc[val_idx], num_iteration=clf.best_iteration)

    fold_importance_df = pd.DataFrame()
    fold_importance_df["Feature"] = features
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)

    test_pred_prob += clf.predict(test[features], num_iteration=clf.best_iteration) / folds.n_splits

threshold = 0.5
for pred in test_pred_prob:
    result = 1 if pred > threshold else 0

回归

params = {'objective': 'regression',
		  'boosting': 'gbdt',
		  'metric': 'mae',
		  'learning_rate': 0.02,
          'max_depth': -1,
          'num_leaves': 38,
          "feature_fraction": 0.9,
          "bagging_freq": 1,
          "bagging_fraction": 0.7,
          "bagging_seed": 11,
          
          "min_sum_hessian_in_leaf": 6,
          'min_data_in_leaf': 50,
          "lambda_l1": 0.1,
          
          "verbosity": -1,
          "nthread": 4,
          "random_state": 2019,
          # 'device': 'gpu'
          }


def mean_absolute_percentage_error(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / (y_true))) * 100

def smape_func(preds, dtrain):
    label = dtrain.get_label().values
    epsilon = 0.1
    summ = np.maximum(0.5 + epsilon, np.abs(label) + np.abs(preds) + epsilon)
    smape = np.mean(np.abs(label - preds) / summ) * 2
    return 'smape', float(smape), False


folds = KFold(n_splits=5, shuffle=True, random_state=2019)
oof = np.zeros(train_x.shape[0])
predictions = np.zeros(test.shape[0])

train_y = np.log1p(train_y) # Data smoothing
feature_importance_df = pd.DataFrame()
for fold_n, (trn_idx, val_idx) in enumerate(folds.split(train_x)):
    print("fold {}".format(fold_n + 1))
    trn_data = lgb.Dataset(train_x.iloc[trn_idx], label=train_y.iloc[trn_idx])
    val_data = lgb.Dataset(train_x.iloc[val_idx], label=train_y.iloc[val_idx])

    clf = lgb.train(params,
                    trn_data,
                    num_round,
                    valid_sets=[trn_data, val_data],
                    verbose_eval=200,
                    early_stopping_rounds=200)
                    
    oof[val_idx] = clf.predict(train_x.iloc[val_idx], num_iteration=clf.best_iteration)

    fold_importance_df = pd.DataFrame()
    fold_importance_df["Feature"] = features
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)

    predictions += clf.predict(test, num_iteration=clf.best_iteration) / folds.n_splits

print('mse %.6f' % mean_squared_error(train_y, oof))
print('mae %.6f' % mean_absolute_error(train_y, oof))

result = np.expm1(predictions) #reduction
result = predictions

多分类

params = {'objective': 'multiclass',
		  'num_class': 33,
		  'boosting': 'gbdt',
		  'metric': 'multi_logloss',
		  'learning_rate': 0.02,
          'max_depth': -1,
          'num_leaves': 38,   
          "feature_fraction": 0.9,
          "bagging_freq": 1,
          "bagging_fraction": 0.7,
          "bagging_seed": 11,
          
          "min_sum_hessian_in_leaf": 6,
          'min_data_in_leaf': 50,
          "lambda_l1": 0.1,    
          # 'lambda_l2': 0.001,     
          
          "verbosity": -1,
          "nthread": 4,   
          "random_state": 2019,   
          # 'device': 'gpu'   
          }

folds = KFold(n_splits=5, shuffle=True, random_state=2019)
prob_oof = np.zeros((train_x.shape[0], 33))
test_pred_prob = np.zeros((test.shape[0], 33))

## train and predict
feature_importance_df = pd.DataFrame()
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train)):
    print("fold {}".format(fold_ + 1))
    trn_data = lgb.Dataset(train_x.iloc[trn_idx], label=train_y.iloc[trn_idx])
    val_data = lgb.Dataset(train_x.iloc[val_idx], label=train_y.iloc[val_idx])

    clf = lgb.train(params,
                    trn_data,
                    num_round,
                    valid_sets=[trn_data, val_data],
                    verbose_eval=20,
                    early_stopping_rounds=60)
    prob_oof[val_idx] = clf.predict(train_x.iloc[val_idx], num_iteration=clf.best_iteration)


    fold_importance_df = pd.DataFrame()
    fold_importance_df["Feature"] = features
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)

    test_pred_prob += clf.predict(test[features], num_iteration=clf.best_iteration) / folds.n_splits
result = np.argmax(test_pred_prob, axis=1)

特征曲线

import matplotlib.pyplot as plt

def plot_feature_importance(train, model):
    list_feature_name = list(train.columns[1:])
    list_feature_importance = list(model.feature_importance(importance_type='split', iteration=-1))
    dataframe_feature_importance = pd.DataFrame({'feature_name': list_feature_name, 'importance': list_feature_importance})
    
    print(dataframe_feature_importance)
    x = range(len(list_feature_name))
    plt.xticks(x, list_feature_name, rotation=45, fontsize=10)
    plt.plot(x, list_feature_importance)

    plt.show()

在这里插入图片描述
在这里插入图片描述

### LightGBM 使用教程 LightGBM 是一种高效的梯度提升框架,具有快速训练速度、较低内存消耗和高精度的特点[^4]。以下是关于其安装指南、使用方法以及参数调优的具体说明。 #### 安装指南 要安装 LightGBM,可以通过 Python 的 pip 工具完成。运行以下命令即可安装最新版本: ```bash pip install lightgbm ``` 如果需要 GPU 支持,则需额外配置 NVIDIA CUDA 和其他依赖项。具体可参考官方文档中的详细步骤[^1]。 --- #### 基本使用方法 在实际应用中,通常会通过 `lightgbm.LGBMClassifier` 或 `lightgbm.LGBMRegressor` 来构建类器或回归模型。下面是一个简单的例子展示如何创建并训练一个二类模型: ```python import lightgbm as lgb from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # 加载数据集 data = load_breast_cancer() X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2) # 创建LGBMClassifier实例 model = lgb.LGBMClassifier( boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=100 ) # 训练模型 model.fit(X_train, y_train) # 预测测试集 y_pred = model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print(f'Accuracy: {accuracy:.4f}') ``` 上述代码展示了如何加载数据、定义模型超参数并通过 `.fit()` 方法进行训练[^2]。 --- #### 参数调优 为了获得更好的性能表现,在实际项目中往往需要调整多个重要参数。这些参数大致为几大类别:核心参数、学习控制参数、度量参数等。 ##### 核心参数 (Core Parameters) - **num_leaves**: 控制每棵树的最大叶子数量,默认值为 31。增大此数值可能会提高拟合能力但也容易过拟合。 - **max_depth**: 设置树的最大深度,当指定该选项时优先级高于 `num_leaves`。 ##### 学习控制参数 (Learning Control Parameters) - **learning_rate**: 决定每次迭代的学习步长,较小的值有助于减少过拟合风险但可能延长收敛时间。 - **n_estimators**: 表示总共生成多少棵决策树,一般建议结合交叉验证来寻找最佳值。 ##### 度量参数 (Metric Parameters) - **objective**: 指定目标函数类型,比如 `'binary'` 对应于二元类问题;`'regression'` 则用于连续变量预测。 - **metric**: 设定评估指标名称,例如 AUC 曲线下面积 (`auc`) 或均方误差 (`l2`)。 更多高级技巧涉及正则化项调节(如 lambda_l1/l2)、特征数权重配等方面。 --- ### 总结 通过对 LightGBM 不同类型的参数合理设定与优化,能够显著改善机器学习任务的效果。同时借助内置 API 及第三方库支持,使得整个流程更加便捷高效[^3]。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值