Featuretools特征选择指南：优化特征矩阵的三大实用方法

童香莺Wyman

于 2025-06-06 09:01:11 发布

阅读量289

点赞数 4

CC 4.0 BY-SA版权

本文链接：https://blog.youkuaiyun.com/gitblog_00786/article/details/148464294

Featuretools特征选择指南：优化特征矩阵的三大实用方法

featuretools 项目地址: https://gitcode.com/gh_mirrors/fea/featuretools

引言

在机器学习项目中，特征工程的质量直接影响模型效果。Featuretools作为自动化特征工程的强大工具，通过深度特征合成(Deep Feature Synthesis)可以生成大量特征。然而，并非所有生成的特征都对模型有益。本文将详细介绍Featuretools提供的三种特征选择方法，帮助您优化特征矩阵，提升模型性能。

为什么需要特征选择

在自动化特征生成过程中，我们经常会遇到以下问题：

包含大量空值的特征
没有区分度的单一值特征
高度相关的冗余特征

这些问题不仅会增加计算成本，还可能降低模型性能。Featuretools提供了专门的函数来解决这些问题。

准备工作

首先，我们需要准备一个实体集(EntitySet)作为示例数据：

import pandas as pd
import featuretools as ft
from featuretools.selection import (
    remove_highly_correlated_features,
    remove_highly_null_features,
    remove_single_value_features
)

# 加载示例航班数据
es = ft.demo.load_flight(nrows=50)

方法一：移除高缺失率特征

问题场景

当原始数据包含大量缺失值的列时，基于这些列生成的特征也会继承高缺失率。这些特征通常对模型训练帮助不大。

解决方案

使用remove_highly_null_features函数：

# 生成特征矩阵
fm, features = ft.dfs(
    entityset=es,
    target_dataframe_name="trip_logs",
    cutoff_time=pd.DataFrame({
        "trip_log_id": [30, 1, 2, 3, 4],
        "time": pd.to_datetime(["2016-09-22 00:00:00"] * 5)
    }),
    trans_primitives=[],
    agg_primitives=[],
    max_depth=2
)

# 移除缺失率高于95%的特征(默认阈值)
clean_fm = remove_highly_null_features(fm)

自定义阈值

可以通过pct_null_threshold参数调整缺失率阈值：

# 移除缺失率高于20%的特征
clean_fm = remove_highly_null_features(fm, pct_null_threshold=0.2)

方法二：移除单一值特征

问题场景

某些特征在所有样本中取值相同(或几乎相同)，这类特征缺乏区分度，对模型没有价值。

基本用法

# 移除单一值特征(不考虑NaN)
new_fm, new_features = remove_single_value_features(fm, features=features)

处理NaN值

默认情况下，NaN值不被视为独立值。如需将NaN视为有效值：

new_fm, new_features = remove_single_value_features(
    fm, 
    features=features, 
    count_nan_as_value=True
)

方法三：移除高度相关特征

问题场景

特征间高度相关会导致信息冗余，增加计算负担，可能影响模型稳定性。

基本原理

Featuretools会：

计算特征间相关系数
对高度相关(默认>95%)的特征对
保留较简单的特征(基于特征深度)
复杂度相同时保留位置靠前的特征

基本用法

# 生成包含负值转换的特征矩阵
fm, features = ft.dfs(
    entityset=es,
    target_dataframe_name="trip_logs",
    trans_primitives=["negate"],
    agg_primitives=[],
    max_depth=3
)

# 移除高度相关特征
new_fm, new_features = remove_highly_correlated_features(fm, features=features)

高级配置

调整相关阈值：

new_fm, new_features = remove_highly_correlated_features(
    fm, 
    features=features, 
    pct_corr_threshold=0.9  # 设置为90%
)

指定检查范围：

new_fm, new_features = remove_highly_correlated_features(
    fm,
    features=features,
    features_to_check=["air_time", "distance", "flights.distance_group"]
)

保护重要特征：

new_fm, new_features = remove_highly_correlated_features(
    fm,
    features=features,
    features_to_keep=["air_time", "distance", "flights.distance_group"]
)