[更新ing]sklearn(二十二)：Feature selection

Sarah ฅʕ•̫͡•ʔฅ

已于 2022-05-24 18:03:45 修改

阅读量316

点赞数

CC 4.0 BY-SA版权

分类专栏： Sklearn 文章标签： sklearn 机器学习 python

于 2018-10-07 22:30:51 首次发布

本文链接：https://blog.youkuaiyun.com/u014765410/article/details/82961859

Sklearn 专栏收录该内容

27 篇文章

订阅专栏

本文介绍了sklearn库中的多种特征选择方法，包括去除低方差特征、单变量特征选择、递归特征消除、基于L1正则化的特征选择及树基特征选择，展示了如何使用这些方法优化模型表现。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

feature selection可以用于dimensionality reduction，进而提升estimator的表现。下面介绍几种sklearn的feature selection方法：

Removing features with low variance

我们可以通过查看feature的variance来决定是否去除该feature，如果feature的variance较低，说明各个sample之间的value相差不多，默认不能很好的表征不同的sample，应当舍弃。
对于boolean value的feature，我们可以根据限定feature的variance，来去除variance低于该threshold的feature。boolean value的feature，其variance=p（1-p）。p为某一boolean值在feature中的概率。
code示例：

sklearn.feature_selection.VarianceThreshold(threshold=0.0) #去除variance低于threshold的feature

>>> from sklearn.feature_selection import VarianceThreshold
>>> X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
>>> sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
>>> sel.fit_transform(X)
array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

Univariate feature selection

“单变量特征选择”选择best feature的方式是基于“univariate statistical tests”进行的，基给定每个feature一个score，选择得分最高的前k个feature，作为feature selection的结果。sklearn中，求univariate score的方法有以下几种：
卡方检验（Chi square statistic）
Ftest（F检验）
T检验
 互信息（Mutual Information）
多重假设检验
sklearn中各种univariate feature selection function如下：

sklearn.feature_selection.SelectKBest(score_func=<function f_classif>, k=10) 
#socre_func：用于为每个feature打分的函数
#k:选取的feature个数

sklearn.feature_selection.SelectPercentile(score_func=<function f_classif>, percentile=10)
#percentile：选取feature的百分位数

sklearn.feature_selection.SelectFpr(score_func=<function f_classif>, alpha=0.05) #based on false positive rate test（ROC曲线）
#alpha：The highest p-value for features to be kept

sklearn.feature_selection.SelectFdr(score_func=<function f_classif>, alpha=0.05)
#alpha：alpha is an upper bound on the expected false discovery rate.

sklearn.feature_selection.SelectFwe(score_func=<function f_classif>, alpha=0.05) #based on familywise error rate(涉及 多重假设检验)
#alpha：The highest uncorrected p-value for features to keep.

sklearn.feature_selection.GenericUnivariateSelect(score_func=<function f_classif>, mode=’percentile’, param=1e-05)
#mode : {‘percentile’, ‘k_best’, ‘fpr’, ‘fdr’, ‘fwe’}
#param:Parameter of the corresponding mode.

对于regression问题而言，score_func：f_regression, mutual_info_regression
对于classification问题而言，score_func： chi2, f_classif, mutual_info_classif
note that：F_test可以估计两个feature之间的linear dependence情况，而mutual information可以估计两个feature之间的任何dependence情况，但是，需要更多的sample，以保证估计的精确度。

Recursive feature elimination

递归性的feature elimination其核心思想是：利用一个estimator with attributes like coef_ or feature_importance，得出每个feature的重要性系数，然后根据feature的重要程度，在每一轮iteration中，prune一个重要性最小的feature，直到剩余的feature数量与user defined number相等。
相关function如下：

sklearn.feature_selection.RFE(estimator, n_features_to_select=None, step=1, verbose=0)
#estimator：所有可以计算feature_importance的estimator都可以
#n_features_to_select：最后需要的feature数量
#step：每一次iteration砍掉的feature数量

sklearn.feature_selection.RFECV(estimator, step=1, min_features_to_select=1, cv=’warn’, scoring=None, verbose=0, n_jobs=None)#带有cross-validation功能的RFE，可以通过CV来选择最优的feature数量。

Feature selection using SelectFromModel

#利用该function，可以结合estimator with attribute of feature_importances or coef_，对feature进行prune。
sklearn.feature_selection.SelectFromModel(estimator, threshold=None, prefit=False, norm_order=1, max_features=None)
#estimator：任何拥有feature_importances或coef_ attributes的estimator都可以
#threshold：feature系数低于该阈值，则砍掉
#prefit：estimator是否已经训练好，如果=Fasle, 先fit()，在transform()，否则，可以直接transform()
#norm_order：如果feature的coef_为2维，则计算该vector的value，用一次方，二次方...
#max_features：保留的最大feature数量

L1-based feature selection

linear model with L1 penalty 拥有feature selection的功效，可以将其与SelectFromModel相结合，获得想要的feature数量。
该linear model如：linear_model.Lasso、 linear_model.LogisticRegression、svm.LinearSVC 。
code 示例：

>>> from sklearn.svm import LinearSVC
>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectFromModel
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
>>> model = SelectFromModel(lsvc, prefit=True)
>>> X_new = model.transform(X)
>>> X_new.shape
(150, 3)

Tree-based feature selection

tree-based estimator具有评价feature importance的功能，可以将其与SelectFromModel结合起来，得到feature selection的效果。
code示例：

>>> from sklearn.ensemble import ExtraTreesClassifier
>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectFromModel
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> clf = ExtraTreesClassifier(n_estimators=50)
>>> clf = clf.fit(X, y)
>>> clf.feature_importances_  
array([ 0.04...,  0.05...,  0.4...,  0.4...])
>>> model = SelectFromModel(clf, prefit=True)
>>> X_new = model.transform(X)
>>> X_new.shape               
(150, 2)

Feature selection as part of a pipeline

将feature selection和model fitting通过pipeline串联起来，code示例如下：

clf = Pipeline([
  ('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
  ('classification', RandomForestClassifier())
])
clf.fit(X, y)