sklearn学习笔记3——pipeline
pipeline为方便数据处理,提供了两种模式:串行化和并行化
1.串行化,通过Pipeline类实现
通过steps参数,设定数据处理流程。格式为('key','value'),key是自己为这一step设定的名称,value是对应的处理类。最后通过list将这些step传入。前n-1个step中的类都必须有transform函数,最后一步可有可无,一般最后一步为模型。pipe继承了最后一个类的所有方法。
-
In [
42]:
from sklearn.pipeline
import Pipeline
-
...:
from sklearn.svm
import SVC
-
...:
from sklearn.decomposition
import PCA
-
...: pipe=Pipeline(steps=[(
'pca',PCA()),(
'svc',SVC())])
-
...:
-
...:
from sklearn.datasets
import load_iris
-
...: iris=load_iris()
-
...: pipe.fit(iris.data,iris.target)
-
...:
-
Out[
42]:
-
Pipeline(steps=[(
'pca', PCA(copy=
True, iterated_power=
'auto', n_components=
None,
-
random_state=
None,
-
svd_solver=
'auto', tol=
0.0, whiten=
False)), (
'svc', SVC(C=
1.0, cache_size=
200,
-
class_weight=
None, coef0=
0.0,
-
decision_function_shape=
None, degree=
3, gamma=
'auto', kernel=
'rbf',
-
max_iter=
-1, probability=
False, random_state=
None, shrinking=
True,
-
tol=
0.001, verbose=
False))])
训练得到的是一个模型,可直接用来预测,预测时,数据会从step1开始进行转换,避免了模型用来预测的数据还要额外写代码实现。还可通过pipe.score(X,Y)得到这个模型在X训练集上的正确率。
-
In [
46]: pipe.predict(iris.data)
-
Out[
46]:
-
array([
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
-
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
-
0,
0,
0,
0,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
-
1,
1,
1,
1,
1,
1,
1,
1,
2,
1,
1,
1,
1,
1,
2,
1,
1,
1,
1,
1,
1,
1,
1,
-
1,
1,
1,
1,
1,
1,
1,
1,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
-
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
-
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2])
-
-
In [
47]: iris.target
-
Out[
47]:
-
array([
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
-
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
-
0,
0,
0,
0,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
-
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
-
1,
1,
1,
1,
1,
1,
1,
1,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
-
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
-
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2])
make_pipeline函数是Pipeline类的简单实现,只需传入每个step的类实例即可,不需自己命名,自动将类的小写设为该step的名。
-
In [
50]: make_pipeline(StandardScaler(),GaussianNB())
-
Out[
50]: Pipeline(steps=[(
'standardscaler', StandardScaler(copy=
True, with_mean=
-
True, with_std=
True)), (
'gaussiannb', GaussianNB(priors=
None))])
-
-
In [
51]: p=make_pipeline(StandardScaler(),GaussianNB())
-
-
In [
52]: p.steps
-
Out[
52]:
-
[(
'standardscaler', StandardScaler(copy=
True, with_mean=
True, with_std=
True)),
-
(
'gaussiannb', GaussianNB(priors=
None))]
同时可以通过set_params重新设置每个类里边需传入的参数,设置方法为step的name__parma名=参数值
-
In [
59]: p.set_params(standardscaler__with_mean=
False)
-
Out[
59]: Pipeline(steps=[(
'standardscaler', StandardScaler(copy=
True, with_mean=
-
False, with_std=
True)), (
'gaussiannb', GaussianNB(priors=
None))])
2.并行化,通过FeatureUnion实现
FeatureUnion,同样通过(key,value)对来设置,通过set_params设置参数。不同的是,每一个step分开计算,FeatureUnion最后将它们计算得到的结果合并到一块,返回的是一个数组,不具备最后一个estimator的方法。有些数据需要标准化,或者取对数,或onehot编码最后形成多个特征项,再选择重要特征,这时候FeatureUnion非常管用。
In [60]: from sklearn.pipeline import FeatureUnion
-
In [
61]:
from sklearn.preprocessing
import StandardScaler
-
In [
63]:
from sklearn.preprocessing
import FunctionTransformer
-
In [
64]:
from numpy
import log1p
-
-
In [
65]: step1=(
'Standar',StandardScaler())
-
In [
66]: step2=(
'ToLog',FunctionTransformer(log1p))
-
In [
67]: steps=FeatureUnion(transformer_list=[step1,step2])
-
-
In [
68]: steps.fit_transform(iris.data)
-
Out[
68]:
-
array([[
-0.90068117,
1.03205722,
-1.3412724 , ...,
1.5040774 ,
-
0.87546874,
0.18232156],
-
[
-1.14301691,
-0.1249576 ,
-1.3412724 , ...,
1.38629436,
-
0.87546874,
0.18232156],
-
[
-1.38535265,
0.33784833,
-1.39813811, ...,
1.43508453,
-
0.83290912,
0.18232156],
-
...,
-
[
0.79566902,
-0.1249576 ,
0.81962435, ...,
1.38629436,
-
1.82454929,
1.09861229],
-
[
0.4321654 ,
0.80065426,
0.93335575, ...,
1.48160454,
-
1.85629799,
1.19392247],
-
[
0.06866179,
-0.1249576 ,
0.76275864, ...,
1.38629436,
-
1.80828877,
1.02961942]])
-
-
In [
69]: data=steps.fit_transform(iris.data)
-
-
In [
70]: data.shape
#最后有8个特征,标准化后4个,log后4个,共个
-
Out[
70]: (
150,
8)
-
-
In [
71]: iris.data.shape
-
Out[
71]: (
150,
4)
参考: