sklearn学习-SVM例程总结2(特征选择——单因素方差分析（方差分析anova ）)-优快云博客

本文链接：https://blog.youkuaiyun.com/sqiu_11/article/details/58719935

本文介绍了一种结合支持向量机(SVM)与单因素方差分析(ANOVA)进行特征选择的方法，通过调整所选特征百分比提高分类准确率。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

SVM with univariate feature selection(单因素方差分析)

本文隶属于机器学习的特征选择部分，是训练前对数据的预处理部分。对于机器学习而言，特征选择是影响结果的极其重要的组成部分。这部分内容涉及到数理统计的内容，尤其是方差分析。当然，这些只是我这几天查到的，其背后的知识还有很多，以后的学习会不断加深这方面的理解，这里只对这个例程以及背后的特征选择做个初步总结。
例程和总结如下：

SVM-Anova

"""
=================================================
SVM-Anova: SVM with univariate feature selection
=================================================

This example shows how to perform univariate feature selection before running a
SVC (support vector classifier) to improve the classification scores.
"""
print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets, feature_selection
from sklearn.cross_validation import cross_val_score#交叉验证
from sklearn.pipeline import Pipeline

###############################################################################
# Import some data to play with
digits = datasets.load_digits()
y = digits.target
# Throw away data, to be in the curse of dimension settings
y = y[:200]
X = digits.data[:200]
n_samples = len(y)
X = X.reshape((n_samples, -1))
# add 200 non-informative features
X = np.hstack((X, 2 * np.random.random((n_samples, 200))))

###############################################################################
# Create a feature-selection transform and an instance of SVM that we
# combine together to have an full-blown estimator
#feature_selection.f_classif：计算所提供样本的方差分析F-值anova：方差分析
#feature_selection.SelectPercentile（k）：只留下k值最高的一组特征，返回最终的估计器
transform = feature_selection.SelectPercentile(feature_selection.f_classif)
#anova：Analysis of Variance(方差分析)
#http://baike.baidu.com/link?url=8ufVQvD2KZrWbS3VvvuhYDfw3dk8nSD84QRUNB1P864
#rW8XKSw6-P4-xGIHVkAEBHUIjQGFhFsPtQhazMQrUVmcAqLVDBkQKVXSb3MPq92QFhPaPmVyEgsMNF
#ZJ_p1B-QyQ-tHMQKFJB_recu1qG9nDDpfdDbwMAomoktviOFca
clf = Pipeline([('anova', transform), ('svc', svm.SVC(C=1.0))])

###############################################################################
# Plot the cross-validation score as a function of percentile of features
score_means = list()
score_stds = list()
percentiles = (1, 3, 6, 10, 15, 20, 30, 40, 60, 80, 100)

for percentile in percentiles:
    #clf.set_params：设置此估计器的参数。
    #使用网格搜索(grid search)和交叉验证(cross validation)来选择参数.
    #对方差分析中的参数percentile进行调节，实现多重比较检验
    #用于确定控制变量的不同水平对观测变量的影响程度如何
    clf.set_params(anova__percentile=percentile)
    # Compute cross-validation score using 1 CPU
    #http://scikit-learn.org/dev/modules/generated/sklearn.model_selection.
    #cross_val_score.html#sklearn.model_selection.cross_val_score
    #cross_val_score：最简单的交叉验证方法，cv选择折数，默认是3折交叉验证
    this_scores = cross_val_score(clf, X, y, n_jobs=1)
    score_means.append(this_scores.mean())
    score_stds.append(this_scores.std())
#plt.errorbar以折线形式画出均值和方差
plt.errorbar(percentiles, score_means, np.array(score_stds))

plt.title(
    'Performance of the SVM-Anova varying the percentile of features selected')
plt.xlabel('Percentile')
plt.ylabel('Prediction rate')

plt.axis('tight')
plt.show()