我正在使用Python解决一些文本文档的二元分类问题,并实现scikit-learn库,我希望尝试不同的模型来比较和对比结果-主要使用朴素Bayes分类器、具有K-Fold-CV的SVM和CV=5。鉴于后两个模型使用gridSearchCV(),我发现将所有方法组合到一个管道中有一个困难。由于并发问题,我不能在一个实现过程中运行多个管道,因此我需要使用一个管道来实现所有不同的模型。在
这就是我现在所拥有的# pipeline for naive bayes
naive_bayes_pipeline = Pipeline([
('bow_transformer', CountVectorizer(analyzer=split_into_lemmas, stop_words='english')),
('tf_idf', TfidfTransformer()),
('classifier', MultinomialNB())
])
# accessing and using the pipelines
naive_bayes = naive_bayes_pipeline.fit(train_data['data'], train_data['gender'])
# pipeline for SVM
svm_pipeline = Pipeline([
('bow_transformer', CountVectorizer(analyzer=split_into_lemmas, stop_words='english')),
('tf_idf', TfidfTransformer()),
('classifier', SVC())
])
param_svm = [
{'classifier__C': [1, 10], 'classifier__kernel': ['linear']},
{'classifier__C': [1, 10], 'classifier__gamma': [0.001, 0.0001], 'classifier__kernel': ['rbf']},
]
grid_svm_skf = GridSearchCV(
svm_pipeline, # pipeline from above
param_grid=param_svm, # parameters to tune via cross validation
refit=True, # fit using all data, on the best detected classifier
n_jobs=-1, # number of cores to use for parallelization; -1 uses "all cores"
scoring='accuracy',
cv=StratifiedKFold(train_data['gender'], n_folds=5), # using StratifiedKFold CV with 5 folds
)
svm_skf = grid_svm_skf.fit(train_data['data'], train_data['gender'])
predictions_svm_skf = svm_skf.predict(test_data['data'])
编辑1:
第二个管道是唯一使用gridSearchCV()的管道,似乎从未执行过。在
编辑2:
添加了更多代码以显示gridSearchCV()的使用。在