scikit学习心得——Concatenating multiple feature extraction methods

最新推荐文章于 2024-10-02 23:37:03 发布

KevinPigZhu

最新推荐文章于 2024-10-02 23:37:03 发布

阅读量1.2k

点赞数

分类专栏： scikit 文章标签： python scikit 数据挖掘机器学习

scikit 专栏收录该内容

3 篇文章

订阅专栏

本文详细介绍了scikit-learn库中利用Pipeline、FeatureUnion、GridSearchCV等工具进行多特征提取及参数调优的过程，通过主成分分析PCA与单变量特征选择相结合的方式优化特征集，最终实现对支持向量机SVM模型的高效训练与性能提升。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

scikit 例子中的多特征提取例子学习
http://scikit-learn.org/stable/auto_examples/feature_stacker.html
下面是例子中源代码的理解和备注
-------------------------------------------------------------
# Author: Andreas Mueller <amueller@ais.uni-bonn.de>
#
# License: BSD 3 clause

from sklearn.pipeline import Pipeline, FeatureUnion
#从pipeline模块中加载Pipeline,FeatureUnion
from sklearn.grid_search import GridSearchCV
#从grid_search 中加载GridSearchCV
from sklearn.svm import SVC
#支持向量机
from sklearn.datasets import load_iris
#加载分类例子所需要的数据
from sklearn.decomposition import PCA
#主成分分析函数
from sklearn.feature_selection import SelectKBest
 
iris = load_iris()
#加载数据
X, y = iris.data, iris.target
#X是特征，y是类型

# This dataset is way to high-dimensional. Better do PCA:
pca = PCA(n_components=2)
#生成主成分分析对象

 # Maybe some original features where good, too?
selection = SelectKBest(k=1)
#生成特征提取对象
# Build estimator from PCA and Univariate selection:

combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])
#合并主成分分析和综合筛选
 # Use combined features to transform dataset:
X_features = combined_features.fit(X, y).transform(X)
#最后得到的是选出的特征
 svm = SVC(kernel="linear")
#核函数是线性的支持向量机对象
# Do grid search over k, n_components and C:

pipeline = Pipeline([("features", combined_features), ("svm", svm)])
#设置通道，使之可以在交叉验证每一步中设置不同的参数
param_grid = dict(features__pca__n_components=[1, 2, 3],
                  features__univ_select__k=[1, 2],
                  svm__C=[0.1, 1, 10])
#不同参数的字典
grid_search = GridSearchCV(pipeline, param_grid=param_grid, verbose=10)
#交叉验证得到由参数网格计算出的分数网格

grid_search.fit(X, y)
#找到分数网格中最优的点
print(grid_search.best_estimator_)
#打印出这个点所代表的参数
 -------------------------------------------------------------------------
输出的结果：
Pipeline(steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('pca', PCA(copy=True, n_components=2, whiten=False)), 
('univ_select', SelectKBest(k=2, score_func=<function f_classif at 0x045510B0>))],transformer_weights=None)), 
('svm', SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])
--------------------------------------------------------------------------
对scikit中相关函数的翻译：
PCA(n_components=None, copy=True, whiten=False)
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA
主成分分析函数，用于计算两个特征之间的线性相关性。
n_components:
 
  
   Number of components to keep.if n_components is not set all components are kept:
   保持组件数的数量，如果n_components没有设置，所有的组件将被保存。

   
    
     n_components == min(n_samples, n_features)

    
   
   if n_components == ‘mle’, Minka’s MLE is used to guess the dimension if 0 < n_components < 1, 

   如果n_components被设为'mle'，mle被用作猜测维度数，如果大于0小于1

   select the number of components such that the amount of variance that needs to be explained is
   为了需要解释的方差的数量而去搜索组件的数量比被n_components指定百分比要好

   greater than the percentage specified by n_components
   

   
   copy : bool
   
    
     If False, data passed to fit are overwritten and runningfit(X).transform(X) will not
     如果出错了，合适的数据传递将被覆盖，并且方法runningfit(X).transform(X)将不会得出期望的结果，使用fit_tranform(X）代替

      yield the expected results,use fit_transform(X) instead.
     

    
   
   Methods
   fit(X[, y]) Fit the model with X.用X使用模型
fit_transform(X[, y]) Fit the model with X and apply the dimensionality reduction on X.使用模型并且在X上降维
get_covariance() Compute data covariance with the generative model.使用生成模型计算数据的协方差
get_params([deep]) Get parameters for this estimator.得到参数
get_precision() Compute data precision matrix with the generative model.使用生成计算数据精确度矩阵
inverse_transform(X) Transform data back to its original space, i.e.,转换数据到他原始的地址空间
score(X[, y]) Return the average log-likelihood of all samples返回所有样本的对数似然函数
score_samples(X) Return the log-likelihood of each sample返回每个样本的对数似然函数
set_params(**params) Set the parameters of this estimator.设置估计量的参数
transform(X) Apply the dimensionality reduction on X.使X降维
  
 
SelectKBest(score_func=<function f_classif at 0x7f49246ca048>, k=10)
http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest
根据k挑选k个最高分的特征
Parameters:score_func : callable 随时可以设置的
 
  
   Function taking two arrays X and y, and returning a pair of arrays(scores, pvalues).
   这个函数使用两个数组X和y并且返回一对组（分数，显著性）

  
 
k : int or “all”, optional, default=10 整型或“all”，可选的，默认值为10

 
  
   Number of top features to select.The “all” option by passes selection, for use in a parameter search.
   挑选最好的特征数量，如果设为all，则绕过筛选，为了用于参数挑选

  
 
Methods
 fit(X, y) Run
 score function on (X, y) and get the appropriate features.运行函数根据(X,y)得到合适的特征
fit_transform(X[, y]) Fit
 to data, then transform it.和数据吻合，然后转换他们
get_params([deep]) Get
 parameters for this estimator.得到这种估计方式的参数
get_support([indices]) Get
 a mask, or integer index, of the features selected 得到关于被搜索的特征的mask或者整数指数
inverse_transform(X) Reverse
 the transformation operation转换的逆向操作
set_params(**params) Set
 the parameters of this estimator.设置估计模型的参数
transform(X) Reduce
 X to the selected features.减少被筛选的特征的x

 FeatureUnion(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin)
Concatenates results of multiple transformer objects.
 |
 |  This estimator applies a list of transformer objects in parallel to the
 |  input data, then concatenates the results. This is useful to combine
 |  several feature extraction mechanisms into a single transformer.
 |这个估计方法对并排的输入数据应用了一系列转换对象，这对合并一系列特征提取方法到一起是很有用的
 |  Read more in the :ref:`User Guide <feature_union>`.
 |
 |  Parameters
 |  ----------
 |  transformer_list: list of (string, transformer) tuples
 |      List of transformer objects to be applied to the data. The first
 |      half of each tuple is the name of the transformer.
        一系列应用在数据的转换对象 ，前半部分是转换方式的名字（string）
 |
 |  n_jobs: int, optional
 |      Number of jobs to run in parallel (default 1).
        并行工作的工作次数
 |
 |  transformer_weights: dict, optional
 |      Multiplicative weights for features per transformer.
 |      Keys are transformer names, values the weights.
 |
 |  Method resolution order:
 |      FeatureUnion
 |      sklearn.base.BaseEstimator
 |      sklearn.base.TransformerMixin
 |      __builtin__.object


SVC(C=1.0, kernel='rbf', degree=3, gamma='auto', coef0=0.0, shrinking=True, probability=False,
 tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=None,
 random_state=None)

支持向量机中的分类函数

Pipeline(steps)
Pipeline of transforms with a final estimator.
最后估计值转换的通道
Sequentially apply a list of transforms and a final estimator.Intermediate steps of the pipeline must be ‘transforms’, 
循序使用一系列转化和估计函数，通道的中间步骤一定要被转化
that is, they must implement fit and transform methods.The final estimator only needs to implement fit.
他们一定要被合适实施和转化的方法，最后的估计函数只需要符合实施
The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.
pipeline的目的是为了实施一系列当交叉验证时可以设置不同参数的步骤
For this, it enables setting parameters of the various steps using theirnames and the parameter name separated by a ‘__’,
 as in the example below.
为了这样，他可以设置不同步骤的参数
Methods
 decision_function(X) Applies
 transforms to the data, and the decision_function method of the final estimator.
fit(X[, y]) Fit
 all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.
fit_predict(X[, y]) Applies
 fit_predict of last step in pipeline after transforms.
fit_transform(X[, y]) Fit
 all the transforms one after the other and transform the data, then use fit_transform on transformed data using the final estimator.
get_params([deep])  
inverse_transform(X) Applies
 inverse transform to the data.
predict(X) Applies
 transforms to the data, and the predict method of the final estimator.
predict_log_proba(X) Applies
 transforms to the data, and the predict_log_proba method of the final estimator.
predict_proba(X) Applies
 transforms to the data, and the predict_proba method of the final estimator.
score(X[, y]) Applies
 transforms to the data, and the score method of the final estimator.
set_params(**params) Set
 the parameters of this estimator.
transform(X) Applies
 transforms to the data, and the transform method of the final estimator.

GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, 
cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score='raise')
http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html#sklearn.grid_search.GridSearchCV
Exhaustive search over specified parameter values for an estimator.
Important members are fit, predict.
GridSearchCV implements a “fit” and a “score” method.It also implements “predict”, “predict_proba”, “decision_function”,
这个函数使用在“fit”和“score”类的方法，也可以用做“predict”，“predict_proba”，“decision_function”
“transform” and “inverse_transform” if they are implemented in the estimator used.
如果用在估计函数中
The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.
使用这个方法的估计函数的参数将在参数网格中被网格搜索交叉验证优化。
-----------------------------------------------------------------------------------------------------------------------------------------
作者英语和技术比较zuoji，写的有误的地方希望指点

`decision_function`(X)	Applies transforms to the data, and the decision_function method of the final estimator.
`fit`(X[, y])	Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.
`fit_predict`(X[, y])	Applies fit_predict of last step in pipeline after transforms.
`fit_transform`(X[, y])	Fit all the transforms one after the other and transform the data, then use fit_transform on transformed data using the final estimator.
`get_params`([deep])
`inverse_transform`(X)	Applies inverse transform to the data.
`predict`(X)	Applies transforms to the data, and the predict method of the final estimator.
`predict_log_proba`(X)	Applies transforms to the data, and the predict_log_proba method of the final estimator.
`predict_proba`(X)	Applies transforms to the data, and the predict_proba method of the final estimator.
`score`(X[, y])	Applies transforms to the data, and the score method of the final estimator.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X)	Applies transforms to the data, and the transform method of the final estimator.

`fit`(X[, y])	Fit the model with X.用X使用模型
`fit_transform`(X[, y])	Fit the model with X and apply the dimensionality reduction on X.使用模型并且在X上降维
`get_covariance`()	Compute data covariance with the generative model.使用生成模型计算数据的协方差
`get_params`([deep])	Get parameters for this estimator.得到参数
`get_precision`()	Compute data precision matrix with the generative model.使用生成计算数据精确度矩阵
`inverse_transform`(X)	Transform data back to its original space, i.e.,转换数据到他原始的地址空间
`score`(X[, y])	Return the average log-likelihood of all samples返回所有样本的对数似然函数
`score_samples`(X)	Return the log-likelihood of each sample返回每个样本的对数似然函数
`set_params`(**params)	Set the parameters of this estimator.设置估计量的参数
`transform`(X)	Apply the dimensionality reduction on X.使X降维

`fit`(X, y)	Run score function on (X, y) and get the appropriate features.运行函数根据(X,y)得到合适的特征
`fit_transform`(X[, y])	Fit to data, then transform it.和数据吻合，然后转换他们
`get_params`([deep])	Get parameters for this estimator.得到这种估计方式的参数
`get_support`([indices])	Get a mask, or integer index, of the features selected 得到关于被搜索的特征的mask或者整数指数
`inverse_transform`(X)	Reverse the transformation operation转换的逆向操作
`set_params`(**params)	Set the parameters of this estimator.设置估计模型的参数
`transform`(X)	Reduce X to the selected features.减少被筛选的特征的x