lightGBM等GBDT进行特征工程技巧

最新推荐文章于 2025-05-13 11:29:08 发布

AI强仔

最新推荐文章于 2025-05-13 11:29:08 发布

阅读量2k

点赞数 2

CC 4.0 BY-SA版权

分类专栏：机器学习特征工程

原文链接：https://www.kaggle.com/c/ieee-fraud-detection/discussion/108575

机器学习同时被 2 个专栏收录

30 篇文章

订阅专栏

特征工程

6 篇文章

订阅专栏

一个模型的好坏，在于对数据的处理，即对数据特征的处理。下面是一些数据特征处理技巧，主要是GBDT方面的。

文章来源：https://www.kaggle.com/c/ieee-fraud-detection/discussion/108575

NAN 处理

If you give np.nan to LGBM, then at each tree node split, it will split the non-NAN values and then send all the NANs to either the left child or right child depending on what’s best. Therefore NANs get special treatment at every node and can become overfit. By simply converting all NAN to a negative number lower than all non-NAN values (such as - 999),

df[col].fillna(-999, inplace=True)
then LGBM will no longer overprocess NAN. Instead it will give it the same attention as other numbers. Try both ways and see which gives the highest CV.

减少内存使用：Label Encode/ Factorize/ Memory reduction

df[col],_ = df[col].factorize()

if df[col].max()<128: df[col] = df[col].astype(‘int8’)
elif df[col].max()<32768: df[col] = df[col].astype(‘int16’)
else: df[col].astype(‘int32’)

for col in df.columns:
if df[col].dtype==‘float64’: df[col] = df[col].astype(‘float32’)
if df[col].dtype==‘int64’: df[col] = df[col].astype(‘int32’)

Categorical Features处理

With categorical variables, you have the choice of telling LGBM that they are categorical or you can tell LGBM to treat it as numerical (if you label encode it first). Either way, LGBM can extract the category classes. Try both ways and see which gives the highest CV. After label encoding do the following for category or leave it as int for numeric

df[col] = df[col].astype(‘category’)

分割

A single (string or numeric) column can be made into two columns by splitting. For example a string column id_30 such as “Mac OS X 10_9_5” can be split into Operating System “Mac OS X” and Version “10_9_5”. Or for example number column TransactionAmt “1230.45” can be split into Dollars “1230” and Cents “45”. LGBM cannot see these pieces on its own, you need to split them.

列组合及列间加减乘除Combining / Transforming / Interaction

Two (string or numeric) columns can be combined into one column. For example card1 and card2 can become a new column with

df[‘uid’] = df[‘card1’].astype(str)+’_’+df[‘card2’].astype(str)
This helps LGBM because by themselves card1 and card2 may not correlate with target and therefore LGBM won’t split them at a tree node. But the interaction uid = card1_card2 may correlate with target and now LGBM will split it. Numeric columns can combined with adding, subtracting, multiplying, etc. A numeric example is

df[‘x1_x2’] = df[‘x1’] * df[‘x2’]

均值、统计等Frequency Encoding

Frequency encoding is a powerful technique that allows LGBM to see whether column values are rare or common. For example, if you want LGBM to “see” which credit cards are used infrequently, try

temp = df[‘card1’].value_counts().to_dict()
df[‘card1_counts’] = df[‘card1’].map(temp)
Aggregations / Group Statistics
Providing LGBM with group statistics allows LGBM to determine if a value is common or rare for a particular group. You calculate group statistics by providing pandas with 3 variables. You give it the group, variable of interest, and type of statistic. For example,

temp = df.groupby(‘card1’)[‘TransactionAmt’].agg([‘mean’])
.rename({‘mean’:‘TransactionAmt_card1_mean’},axis=1)
df = pd.merge(df,temp,on=‘card1’,how=‘left’)
The feature here adds to each row what the average TransactionAmt is for that row’s card1 group. Therefore LGBM can now tell if a row has an abnormal TransactionAmt for their card1 group.

归一化Normalize / Standardize

You can normalize columns against themselves. For example

df[col] = ( df[col]-df[col].mean() ) / df[col].std()
Or you can normalize one column against another column. For example if you create a Group Statistic (described above) indicating the mean value for D3 each week. Then you can remove time dependence by

df[‘D3_remove_time’] = df[‘D3’] - df[‘D3_week_mean’]
The new variable D3_remove_time no longer increases as we advance in time because we have normalized it against the affects of time.

异常值处理等Outlier Removal / Relax / Smooth / PCA

Normally you want to remove anomalies from your data because they confuse your models. However in this competition, we want to find anomalies so use smoothing techniques carefully. The idea behind these methods is to determine and remove uncommon values. For example, by using frequency encoding of a variable, you can remove all values that appear less than 0.1% by replacing them with a new value like -9999 (note that you should use a different value than what you used for NAN).