feature selection可以用于dimensionality reduction,进而提升estimator的表现。下面介绍几种sklearn的feature selection方法:
Removing features with low variance
我们可以通过查看feature的variance来决定是否去除该feature,如果feature的variance较低,说明各个sample之间的value相差不多,默认不能很好的表征不同的sample,应当舍弃。
对于boolean value的feature,我们可以根据限定feature的variance,来去除variance低于该threshold的feature。boolean value的feature,其variance=p(1-p)。p为某一boolean值在feature中的概率。
code示例:
sklearn.feature_selection.VarianceThreshold(threshold=0.0) #去除variance低于threshold的feature
>>> from sklearn.feature_selection import VarianceThreshold
>>> X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
>>> sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
>>> sel.fit_transform(X)
array([[0, 1],
[1, 0],
[0, 0],
[1, 1],
[1, 0],
[1, 1]])
Univariate feature selection
“单变量特征选择”选择best feature的方式是基于“univariate statistical tests”进行的,基给定每个feature一个score,选择得分最高的前k个feature,作为feature selection的结果。sklearn中,求univariate score的方法有以下几种:
卡方检验(Chi square statistic)
Ftest(F检验)
T检验
互信息(Mutual Information)
多重假设检验
sklearn中各种univariate feature selection function如下:
sklearn.feature_selection.SelectKBest(score_func=<function f_classif>, k=10)
#socre_func:用于为每个feature打分的函数
#k:选取的feature个数
sklearn.feature_selection.SelectPercentile(score_func=<function f_classif>, percentile=10)
#percentile:选取feature的百分位数
sklearn.feature_selection.SelectFpr(score_func=<function f_classif>, alpha=0.05) #based on false positive rate test(ROC曲线)
#alpha:The highest p-value for features to be kept
sklearn.feature_selection.SelectFdr(score_func=<function f_classif>, alpha=0.05)
#alpha:alpha is an upper bound on the expected false discovery rate.
sklearn.feature_selection.SelectFwe(score_func=<function f_classif>, alpha=0.05) #based on familywise error rate(涉及 多重假设检验)
#alpha:The highest uncorrected p-value for features to keep.
sklearn.feature_selection.GenericUnivariateSelect(score_func=<function f_classif>, mode=’percentile’, param=1e-05)
#mode : {‘percentile’, ‘k_best’, ‘fpr’, ‘fdr’, ‘fwe’}
#param:Parameter of the corresponding mode.
对于regression问题而言,score_func:f_regression, mutual_info_regression
对于classification问题而言,score_func: chi2, f_classif, mutual_info_classif
note that:F_test可以估计两个feature之间的linear dependence情况,而mutual information可以估计两个feature之间的任何dependence情况,但是,需要更多的sample,以保证估计的精确度。
Recursive feature elimination
递归性的feature elimination其核心思想是:利用一个estimator with attributes like coef_ or feature_importance,得出每个feature的重要性系数,然后根据feature的重要程度,在每一轮iteration中,prune一个重要性最小的feature,直到剩余的feature数量与user defined number相等。
相关function如下:
sklearn.feature_selection.RFE(estimator, n_features_to_select=None, step=1, verbose=0)
#estimator:所有可以计算feature_importance的estimator都可以
#n_features_to_select:最后需要的feature数量
#step:每一次iteration砍掉的feature数量
sklearn.feature_selection.RFECV(estimator, step=1, min_features_to_select=1, cv=’warn’, scoring=None, verbose=0, n_jobs=None)#带有cross-validation功能的RFE,可以通过CV来选择最优的feature数量。
Feature selection using SelectFromModel
#利用该function,可以结合estimator with attribute of feature_importances or coef_,对feature进行prune。
sklearn.feature_selection.SelectFromModel(estimator, threshold=None, prefit=False, norm_order=1, max_features=None)
#estimator:任何拥有feature_importances或coef_ attributes的estimator都可以
#threshold:feature系数低于该阈值,则砍掉
#prefit:estimator是否已经训练好,如果=Fasle, 先fit(),在transform(),否则,可以直接transform()
#norm_order:如果feature的coef_为2维,则计算该vector的value,用一次方,二次方...
#max_features:保留的最大feature数量
L1-based feature selection
linear model with L1 penalty 拥有feature selection的功效,可以将其与SelectFromModel相结合,获得想要的feature数量。
该linear model如:linear_model.Lasso、 linear_model.LogisticRegression、svm.LinearSVC 。
code 示例:
>>> from sklearn.svm import LinearSVC
>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectFromModel
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
>>> model = SelectFromModel(lsvc, prefit=True)
>>> X_new = model.transform(X)
>>> X_new.shape
(150, 3)
Tree-based feature selection
tree-based estimator具有评价feature importance的功能,可以将其与SelectFromModel结合起来,得到feature selection的效果。
code示例:
>>> from sklearn.ensemble import ExtraTreesClassifier
>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectFromModel
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> clf = ExtraTreesClassifier(n_estimators=50)
>>> clf = clf.fit(X, y)
>>> clf.feature_importances_
array([ 0.04..., 0.05..., 0.4..., 0.4...])
>>> model = SelectFromModel(clf, prefit=True)
>>> X_new = model.transform(X)
>>> X_new.shape
(150, 2)
Feature selection as part of a pipeline
将feature selection和model fitting通过pipeline串联起来,code示例如下:
clf = Pipeline([
('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
('classification', RandomForestClassifier())
])
clf.fit(X, y)