机器学习中特征选择的几种方法原理和代码实现(python)

 

一.特征选择-单变量特征选择

1.SelectKBest可以依据相关性对特征进行选择,保留k个评分最高的特征。

方差分析

分类问题使用f_classif,回归问题使用f_regression。

f_classif:分类任务

跟目标的分类,将样本划分成n个子集,S1,S2,..,Sn,我们希望每个子集的均值μ1,μ2,...,μn不相等。

我们假设H0:μ1=μ2=...=μn,当然我们希望拒绝H0,所以我们希望构造出来f最大越好。所以我们可以通过第i个特征xi对分类进行预测。f值越大,预测值越好。

f_regression:回归任务

引用参考:https://blog.youkuaiyun.com/jetFlow/article/details/78884619

要计算f_regression中的ff值,我们首先要计算的是,这个就是i号特征和因变量y之间的样本相关系数。

我们计算的 ,才是f_regression中的ff值,服从F(1,n−2)F(1,n−2)分布。

ff值越大,i号特征和因变量y之间的相关性就越大,据此我们做特征选择。
 

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.datasets import load_iris
# 特征选择
data = load_iris()
slectKBest = SelectKBest(f_classif,k=2)
dataK = slectKBest.fit_transform(data.data,data.target)

2.基于学习模型的特征排序

针对每个单独的特征和响应变量建立预测模型。其实Pearson相关系数等价于线性回归里的标准化回归系数。假如某个特征和响应变量之间的关系是非线性的,可以用基于树的方法

n many data analysis tasks, one is often confronted with very high dimensional data. Feature selection techniques are designed to find the relevant feature subset of the original features which can facilitate clustering, classification and retrieval. The feature selection problem is essentially a combinatorial optimization problem which is computationally expensive. Traditional feature selection methods address this issue by selecting the top ranked features based on certain scores computed independently for each feature. These approaches neglect the possible correlation between different features and thus can not produce an optimal feature subset. Inspired from the recent developments on manifold learning and L1-regularized models for subset selection, we propose here a new approach, called {\em Multi-Cluster/Class Feature Selection} (MCFS), for feature selection. Specifically, we select those features such that the multi-cluster/class structure of the data can be best preserved. The corresponding optimization problem can be efficiently solved since it only involves a sparse eigen-problem and a L1-regularized least squares problem. It is important to note that MCFS can be applied in superised, unsupervised and semi-supervised cases. If you find these algoirthms useful, we appreciate it very much if you can cite our following works: Papers Deng Cai, Chiyuan Zhang, Xiaofei He, "Unsupervised Feature Selection for Multi-cluster Data", 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'10), July 2010. Bibtex source Xiaofei He, Deng Cai, and Partha Niyogi, "Laplacian Score for Feature Selection", Advances in Neural Information Processing Systems 18 (NIPS'05), Vancouver, Canada, 2005 Bibtex source
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值