[更新ing]sklearn(二十二):Feature selection

本文介绍了sklearn库中的多种特征选择方法,包括去除低方差特征、单变量特征选择、递归特征消除、基于L1正则化的特征选择及树基特征选择,展示了如何使用这些方法优化模型表现。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

feature selection可以用于dimensionality reduction,进而提升estimator的表现。下面介绍几种sklearn的feature selection方法:

Removing features with low variance

我们可以通过查看feature的variance来决定是否去除该feature,如果feature的variance较低,说明各个sample之间的value相差不多,默认不能很好的表征不同的sample,应当舍弃。
对于boolean value的feature,我们可以根据限定feature的variance,来去除variance低于该threshold的feature。boolean value的feature,其variance=p(1-p)。p为某一boolean值在feature中的概率。
code示例:

sklearn.feature_selection.VarianceThreshold(threshold=0.0) #去除variance低于threshold的feature

>>> from sklearn.feature_selection import VarianceThreshold
>>> X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
>>> sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
>>> sel.fit_transform(X)
array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

Univariate feature selection

“单变量特征选择”选择best feature的方式是基于“univariate statistical tests”进行的,基给定每个feature一个score,选择得分最高的前k个feature,作为feature selection的结果。sklearn中,求univariate score的方法有以下几种:
卡方检验(Chi square statistic)
Ftest(F检验)
T检验
互信息(Mutual Information)
多重假设检验
sklearn中各种univariate feature selection function如下:

sklearn.feature_selection.SelectKBest(score_func=<function f_classif>, k=10) 
#socre_func:用于为每个feature打分的函数
#k:选取的feature个数

sklearn.feature_selection.SelectPercentile(score_func=<function f_classif>, percentile=10)
#percentile:选取feature的百分位数

sklearn.feature_selection.SelectFpr(score_func=<function f_classif>, alpha=0.05) #based on false positive rate test(ROC曲线)
#alpha:The highest p-value for features to be kept

sklearn.feature_selection.SelectFdr(score_func=<function f_classif>, alpha=0.05)
#alpha:alpha is an upper bound on the expected false discovery rate.

sklearn.feature_selection.SelectFwe(score_func=<function f_classif>, alpha=0.05) #based on familywise error rate(涉及 多重假设检验)
#alpha:The highest uncorrected p-value for features to keep.

sklearn.feature_selection.GenericUnivariateSelect(score_func=<function f_classif>, mode=’percentile’, param=1e-05)
#mode : {‘percentile’, ‘k_best’, ‘fpr’, ‘fdr’, ‘fwe’}
#param:Parameter of the corresponding mode.

对于regression问题而言,score_func:f_regression, mutual_info_regression
对于classification问题而言,score_func: chi2, f_classif, mutual_info_classif
note that:F_test可以估计两个feature之间的linear dependence情况,而mutual information可以估计两个feature之间的任何dependence情况,但是,需要更多的sample,以保证估计的精确度。

Recursive feature elimination

递归性的feature elimination其核心思想是:利用一个estimator with attributes like coef_ or feature_importance,得出每个feature的重要性系数,然后根据feature的重要程度,在每一轮iteration中,prune一个重要性最小的feature,直到剩余的feature数量与user defined number相等。
相关function如下:

sklearn.feature_selection.RFE(estimator, n_features_to_select=None, step=1, verbose=0)
#estimator:所有可以计算feature_importance的estimator都可以
#n_features_to_select:最后需要的feature数量
#step:每一次iteration砍掉的feature数量

sklearn.feature_selection.RFECV(estimator, step=1, min_features_to_select=1, cv=’warn’, scoring=None, verbose=0, n_jobs=None)#带有cross-validation功能的RFE,可以通过CV来选择最优的feature数量。

Feature selection using SelectFromModel

#利用该function,可以结合estimator with attribute of feature_importances or coef_,对feature进行prune。
sklearn.feature_selection.SelectFromModel(estimator, threshold=None, prefit=False, norm_order=1, max_features=None)
#estimator:任何拥有feature_importances或coef_ attributes的estimator都可以
#threshold:feature系数低于该阈值,则砍掉
#prefit:estimator是否已经训练好,如果=Fasle, 先fit(),在transform(),否则,可以直接transform()
#norm_order:如果feature的coef_为2维,则计算该vector的value,用一次方,二次方...
#max_features:保留的最大feature数量
L1-based feature selection

linear model with L1 penalty 拥有feature selection的功效,可以将其与SelectFromModel相结合,获得想要的feature数量。
该linear model如:linear_model.Lasso、 linear_model.LogisticRegression、svm.LinearSVC 。
code 示例:

>>> from sklearn.svm import LinearSVC
>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectFromModel
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
>>> model = SelectFromModel(lsvc, prefit=True)
>>> X_new = model.transform(X)
>>> X_new.shape
(150, 3)
Tree-based feature selection

tree-based estimator具有评价feature importance的功能,可以将其与SelectFromModel结合起来,得到feature selection的效果。
code示例:

>>> from sklearn.ensemble import ExtraTreesClassifier
>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectFromModel
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> clf = ExtraTreesClassifier(n_estimators=50)
>>> clf = clf.fit(X, y)
>>> clf.feature_importances_  
array([ 0.04...,  0.05...,  0.4...,  0.4...])
>>> model = SelectFromModel(clf, prefit=True)
>>> X_new = model.transform(X)
>>> X_new.shape               
(150, 2)

Feature selection as part of a pipeline

将feature selection和model fitting通过pipeline串联起来,code示例如下:

clf = Pipeline([
  ('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
  ('classification', RandomForestClassifier())
])
clf.fit(X, y)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Sarah ฅʕ•̫͡•ʔฅ

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值