【十】特征选择

本文探讨了无限假设集问题,指出训练样本数与假设集的VC维成正比。接着介绍了交叉验证法,如简单交叉验证和k折交叉验证,用于模型选择和避免过拟合。最后,讨论了特征选择的重要性,提到了前向搜索、反向搜索和过滤特征选择等方法。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

无限假设集问题 The Case of Infinite H

在上一讲中我们讲解了有限假设集的情况,在这一讲中我们将把它扩展到无限假设集的情况上。

我们先思考一种直观的思路。假设我们有一个无限假设集,它被d个参数描述。当我们将其存在计算机中时,如果以双精度浮点格式存储,则存储一个数需要64 bit的空间,所以存储假设集中的一种假设就需要64d bit这么大的空间。由于计算机中一位只表示0和1,因此我们的“无限”假设集最多包含k=2^(64d)个假设。之所以由无限变为了有限,是因为计算机在存储过程中采用了一定的近似,相当于将连续值离散化了,并且表示的数的范围也是有限的。但我们提出这种思路只是为了更好的理解无限假设集如何将上一讲中得出的结论应用下来。将K带入上一讲最后的公式中,在保证准确率高于1-δ的情况下,我们有

由这一公式可以看出,我们所需要的训练样本数近似与参数个数成正比。上述这一模型比较特殊的两点在于:1)一个无限假设集不一定可以通过k=2^64d来表示;2)我们对参数没有一个严格的限制,即我们可以改变参数的个数实现同样的效果。为了克服上面的问题,我们先介绍下一定义。

给定集合S={x1,...,xd},当假设集的假设可以处理S的任何一种标记情况时,我们称假设集H可以分离S,即对于任何标记{y1,...,yd},一定存在假设集中的假设h可对所有i=1,...,d,使h(xi)=yi。即

n many data analysis tasks, one is often confronted with very high dimensional data. Feature selection techniques are designed to find the relevant feature subset of the original features which can facilitate clustering, classification and retrieval. The feature selection problem is essentially a combinatorial optimization problem which is computationally expensive. Traditional feature selection methods address this issue by selecting the top ranked features based on certain scores computed independently for each feature. These approaches neglect the possible correlation between different features and thus can not produce an optimal feature subset. Inspired from the recent developments on manifold learning and L1-regularized models for subset selection, we propose here a new approach, called {\em Multi-Cluster/Class Feature Selection} (MCFS), for feature selection. Specifically, we select those features such that the multi-cluster/class structure of the data can be best preserved. The corresponding optimization problem can be efficiently solved since it only involves a sparse eigen-problem and a L1-regularized least squares problem. It is important to note that MCFS can be applied in superised, unsupervised and semi-supervised cases. If you find these algoirthms useful, we appreciate it very much if you can cite our following works: Papers Deng Cai, Chiyuan Zhang, Xiaofei He, "Unsupervised Feature Selection for Multi-cluster Data", 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'10), July 2010. Bibtex source Xiaofei He, Deng Cai, and Partha Niyogi, "Laplacian Score for Feature Selection", Advances in Neural Information Processing Systems 18 (NIPS'05), Vancouver, Canada, 2005 Bibtex source
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值