grid_search_error

I’m trying to get to grips with sci-kit learn for some simple machine learning projects but I’m coming unstuck with Pipelines and wonder what I’ve done wrong…

I’m trying to work through a tutorial on Kaggle

Here’s my code:

import pandas as pd

train = pd.read_csv(local path to training data) train_labels =
pd.read_csv(local path to labels)


from sklearn.decomposition import PCA
from sklearn.svm import LinearSVC
from sklearn.grid_search import GridSearchCV

pca = PCA()
clf = LinearSVC()

n_components = arange(1, 39)
loss = ['l1','l2']
penalty = ['l1','l2']
C = arange(0, 1, .1)
whiten = [True, False]

from sklearn.pipeline import Pipeline

'''set up pipeline'''
pipe = Pipeline(steps=[('pca', pca), ('clf', clf)])

'''set up GridsearchCV'''
estimator = GridSearchCV(pipe, dict(pca__n_components = n_components, pca__whiten = whiten, clf__loss = loss, clf__penalty = penalty, clf__C = C)) 

> estimator

Returns:

GridSearchCV(cv=None,
       estimator=Pipeline(steps=[('pca', PCA(copy=True, n_components=None, whiten=False)), ('clf', LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2',
     random_state=None, tol=0.0001, verbose=0))]),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'clf__penalty': ['l1', 'l2'], 'clf__loss': ['l1', 'l2'], 'clf__C': array([ 0. ,  0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9]), 'pca__n_components': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38]), 'pca__whiten': [True, False]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)

But when I try to train data:

estimator.fit(train, train_labels)
The error is:

    428         for test_fold_idx, per_label_splits in enumerate(zip(*per_label_cvs)):
    429             for label, (_, test_split) in zip(unique_labels, per_label_splits):
--> 430                 label_test_folds = test_folds[y == label]
    431                 # the test split can be too big because we used
    432                 # KFold(max(c, self.n_folds), self.n_folds) instead of

IndexError: too many indices for array

Can anyone point me in the right direction?


It turns out that the Pandas dataframe is the wrong shape.

estimator.fit(train.values, train_labels[0].values)

works, although I also had to drop the penalty term.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值