文章目录
理论
TITLE 标题
CatBoost: unbiased boosting with categorical features
CatBoost: 类别型特征的无偏提升
AUTHOR 作者
Liudmila Prokhorenkova, Gleb Gusev, etc
ABSTRACT 摘要
This paper presents the key algorithmic techniques behind CatBoost, a new gradient boosting toolkit. Their combination leads to CatBoost outperforming other publicly available boosting implementations in terms of quality on a variety of datasets.
本文介绍了一种新的梯度提升工具CatBoost
背后的关键算法技术。它们的结合使得CatBoost在各种数据集上的质量优于其他公开的boosting实现。
Two critical algorithmic advances introduced in CatBoost are the implementation of ordered boosting, a permutation-driven alternative to the classic algorithm, and an innovative algorithm for processing categorical features. Both techniques were created to fight a prediction shift caused by a special kind of target leakage present in all currently existing implementations of gradient boosting algorithms.
在CatBoost中引入的两个关键算法进步是有序boosting的实现
,这是经典算法的排列驱动替代方案,以及处理分类特征的创新算法
。这两种技术都是为了对抗由一种特殊的目标泄漏引起的预测偏移,这种目标泄漏存在于目前所有现有的梯度增强算法实现中。
In this paper, we provide a detailed analysis of this problem and demonstrate that proposed algorithms solve it effectively, leading to excellent empirical results.
在本文中,我们对该问题进行了详细的分析,并证明所提出的算法有效地解决了该问题,并具备优秀的实证结果。
INTRODUCTION 介绍
Gradient boosting is a powerful machine-learning technique that achieves state-of-the-art results in a variety of practical tasks.
We show in this paper that all existing implementations of gradient boosting face the following statistical issue.
A prediction model F F F obtained after several steps of boosting relies on the targets of all training examples. We demonstrate that this actually leads to a shift of the distribution of F ( x k ) ∣ x k F (x_k) | x_k F(xk)∣xk for a training example x k x_k xk from the distribution of F ( x ) ∣ x F (x) | x F(x)∣x for a test example x x x. This finally leads to a prediction shift
of the learned model.
《分布不一致》经过几步提升得到的预测模型 F F F依赖于所有训练样本的目标。我们证明,这实际上导致了 F ( x k ) ∣ x k F (x_k) | x_k F(xk)∣xk(对于训练样本 x k x_k xk)的分布从 F ( x ) ∣ x F (x) | x F(x)∣x(对于测试样本 x x x)的分布的偏移。这最终导致了学习模型的预测转移
。
Further, there is a similar issue in standard algorithms of preprocessing categorical features. One of the most effective ways to use them in gradient boosting is converting categories to their target statistics. A target statistic is a simple statistical model itself, and it can also cause target leakage and a prediction shift.
《分类特征处理问题》此外,分类特征预处理的标准算法也存在类似的问题。在梯度增强中使用它们的最有效的方法之一是将类别转换为目标统计信息。目标统计量本身是一个简单的统计模型,它还可能导致目标泄漏和预测偏移。
In this paper, we propose ordering principle
to solve both problems. Relying on it, we derive ordered boosting
, a modification of standard gradient boosting algorithm, which avoids target leakage, and a new algorithm for processing categorical features. Their combination is implemented as an open-source library called CatBoost (for “Categorical Boosting”), which outperforms the existing state-of-the-art implementations of gradient boosted decision trees — XGBoost and LightGBM — on a diverse set of popular machine learning tasks.
本文提出了“排序原则”来解决这两个问题。在此基础上,我们导出了“有序提升”算法,即对标准梯度提升算法的改进,避免了目标泄漏,并提出了一种新的分类特征处理算法。它们的组合被实现为一个名为CatBoost
(意为“分类增强”)的开源库,在各种流行的机器学习任务上,它比现有的最先进的梯度增强决策树实现XGBoost
和LightGBM
表现得更好。
BACKGROUND 背景
CatBoost is an implementation of gradient boosting, which uses binary decision trees as base predictors.
CATEGORICAL FEATURES 离散特征
Related work on categorical features
A categorical feature is one with a discrete set of values called categories that are not comparable to each other.
类别特征是一组称为类别的离散值,这些值彼此之间没有可比性。
One popular technique for dealing with categorical features in boosted trees is one-hot encoding, i.e., for each category, adding a new binary feature indicating it. However, in the case of high cardinality features (like, e.g., “user ID” feature), such technique leads to infeasibly large number of new features.
《onehot》在提升树中处理分类特征的一种流行技术是onehot编码,也就是说,对于每个类别,添加一个新的二进制特征来指示它。然而,在高基数特征的情况下(例如,“用户ID”特性),这种技术会导致大量的新特征。
To address this issue, one can group categories into a limited number of clusters and then apply one-hot encoding. A popular method is to group categories by target statistics (TS) that estimate expected target value in each category.
《TS特征化》要解决这个问题,可以将类别分组到数量有限的桶中,然后应用onehot编码。一种流行的方法是根据目标统计量(TS)对类别进行分组,目标统计量估计每个类别的预期目标值。
Importantly, among all possible partitions of categories into two sets, an optimal split on the training data in terms of logloss, Gini index, MSE can be found among thresholds for the numerical TS feature.
重要的是,在所有可能将类别划分为两个集合的情况下,在数值TS特征的阈值之间可以找到训练数据在对数、基尼指数、均方误差方面的最佳分割。
In LightGBM, categorical features are converted to gradient statistics at each step of gradient boosting. Though providing important information for building a tree, this approach can dramatically increase (i) computation time, since it calculates statistics for each categorical value at each step, and (ii) memory consumption to store which category belongs to which node for each split based on a categorical feature. LightGBM groups tail categories into one cluster and thus looses part of information. Besides, the authors claim that it is still better to convert categorical features with high cardinality to numerical features
在LightGBM中,分类特征在梯度提升的每一步都被转换为梯度统计信息。尽管为构建树提供了重要信息,但这种方法可以显著增加(i)计算时间,因为它在每一步计算每个分类值的统计信息,以及(ii)存储基于分类特征的每次分裂的哪个类别属于哪个节点的内存消耗。LightGBM将尾部类别分组到一个桶中,因此会丢失部分信息。此外,作者认为将具有高基数的类别特征转换为数值特征更好。
Note that TS features require calculating and storing only one number per one category.
请注意,TS特征要求每个类别只计算和存储一个数字。
Thus, using TS as new numerical features seems to be the most efficient method of handling categorical features with minimum information loss. TS are widely-used, e.g., in the click prediction task (click-through rates), where such categorical features as user, region, ad, publisher play a crucial role. We further focus on ways to calculate TS and leave one-hot encoding and gradient statistics out of the scope of the current paper. At the same time, we believe that the ordering principle proposed in this paper is also effective for gradient statistics.
因此,使用TS作为新的数值特征似乎是处理类别特征的最有效的方法和最小的信息损失。TS被广泛使用,例如在点击预测任务(点击率)中,用户、地区、广告、发布者等分类特征起着至关重要的作用。我们进一步关注TS的计算方法,并将onehot编码和梯度统计排除在本文的讨论范围之外。同时,我们认为本文提出的排序原则对于梯度统计也是有效的。
Target statistics 目标统计量TS
As discussed in Section 3.1, an effective and efficient way to deal with a categorical feature i i i is to substitute the category x k i x^i_k xki of k-th training example with one numeric feature equal to sometarget statistic (TS) x ^ k i \hat{x}^i_k x^ki. Commonly, it estimates the expected target y y y conditioned by the category: x ^ k i ≈ E ( y ∣ x i = x k i ) \hat{x}^i_k \approx E(y | x^i = x^i_k) x^ki≈E(y∣xi=xki).
如3.1节所述,处理分类特征 i i i的一种有效且高效的方法是将第k个训练示例中的类别 x k i x^i_k xki替换为一个等于某个目标统计量(TS) x ^ k i \hat{x}^i_k x^ki的数字特征。通常,它估计预期目标 y y y的条件是: x ^ k i ≈ E ( y ∣ x i = x k i ) \hat{x}^i_k \approx E(y | x^i = x^ i_k) x^ki≈E(y∣xi=xki)。
Greedy TS
A straightforward approach is to estimate E ( y ∣ x i = x k i ) E(y | x^i = x^i_k) E(y∣xi=xki) as the average value of y y y over the training examples with the same category x k i x^i_k xki. This estimate is noisy for low-frequency categories, and one usually smoothes it by some prior p p p:
一个简单的方法是估计 E ( y ∣ x i = x k i ) E(y | x^i = x^i_k) E(y∣xi=xki)为具有相同类别 x k i x^i_k xki的训练示例中 y y y的平均值。对于低频类别,这个估计是有噪声的,通常用一些先验 p p p平滑它
x ^ k i = ∑ j = 1 n I ( x j i = x k i ) ∗ y j + a p ∑ j = 1 n I ( x j i = x k i ) + a \hat{x}^i_k = \frac{\sum_{j=1}^n{I(x^i_j=x^i_k)*y_j} + ap}{\sum_{j=1}^n{I(x^i_j=x^i_k)} + a} x^ki=∑j=1nI(xji=xki)+a∑j=1nI(xji=xki)∗yj+ap
其中 a > 0 a>0 a>0是超参数, p p p一般设定为target的均值
The problem of such greedy approach is target leakage
: feature x ^ k i \hat{x}^i_k x^ki is computed using y k y_k yk, the target of x k x_k xk. This leads to a conditional shift: the distribution of x ^ i ∣ y \hat{x}_i|y x^i∣y differs for training and test examples.
这种贪婪方法的问题是目标泄漏:特征 x ^ k i \hat{x}^i_k x^ki使用 y k y_k yk计算。这导致了一个条件转移: x ^ i ∣ y \hat{x} i|y x^i∣y的分布对于训练和测试示例是不同的。
The following extreme example illustrates how dramatically this may affect the generalization error of the learned model
下面的极端例子说明了这种方法是如何显著地影响学习模型的泛化误差的
假定第i个特征是categorical的,它的所有值都是唯一的(唯一的话, ∑ j = 1 n I ( x j i = x k i ) = 1 \sum_{j=1}^n{I(x^i_j=x^i_k)} = 1 ∑j=1nI(xji=xki)=<