python保存模型与参数_GridSearch以获得最佳模型:保存和加载参数

本文展示了如何使用GridSearchCV在文本分类任务中寻找最佳的TfidfVectorizer参数，结合LogisticRegression进行模型选择。通过训练、交叉验证和参数调优，找到最佳参数后，将参数单独保存并加载，以便后续使用。然而，加载参数时遇到了问题，因为TfidfVectorizer的参数设置方式不正确，导致错误。解决方案是确保在Pipeline中正确设置TfidfVectorizer的参数。

我喜欢运行以下工作流程：

选择用于文本向量化的模型

定义参数列表

在参数上应用带有GridSearchCV的管道，使用LogisticRegression()作为基线以找到最佳的模型参数

保存最佳模型(参数)

加载最佳模型参数，以便我们可以在此定义的模型上应用一系列其他分类器。

这是您可以复制的代码：

GridSearch：

%%time

import numpy as np

import pandas as pd

from sklearn.externals import joblib

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.model_selection import GridSearchCV

from sklearn.pipeline import Pipeline

from gensim.utils import simple_preprocess

np.random.seed(0)

data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')

X_train, X_test, y_train, y_test = train_test_split([simple_preprocess(doc) for doc in data.text],

data.label, random_state=0)

# Find best Tfidf model using LR

pipeline = Pipeline([

('tfidf', TfidfVectorizer(preprocessor=' '.join, tokenizer=None)),

('clf', LogisticRegression())

])

parameters = {

'tfidf__max_df': [0.25, 0.5, 0.75, 1.0],

'tfidf__smooth_idf': (True, False),

'tfidf__norm': ('l1', 'l2', None),

}

grid = GridSearchCV(pipeline, parameters, cv=2, verbose=1)

grid.fit(X_train, y_train)

print(grid.best_params_)

# Save model

#joblib.dump(grid.best_estimator_, 'best_tfidf.pkl', compress = 1) # this unfortunately includes the LogReg

joblib.dump(grid.best_params_, 'best_tfidf.pkl', compress = 1) # Only best parameters

对24位候选人各进行2次折叠，共48次

{'tfidf__smooth_idf'：True，'tfidf__norm'：'l2'，'tfidf__max_df'：0.25}

使用最佳参数加载模型：

from sklearn.model_selection import GridSearchCV

# Load best parameters

tfidf_params = joblib.load('best_tfidf.pkl')

pipeline = Pipeline([

('vec', TfidfVectorizer(preprocessor=' '.join, tokenizer=None).set_params(**tfidf_params)), # here is the issue?

('clf', LogisticRegression())

])

cval = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=5)

print("Cross-Validation Score: %s" % (np.mean(cval)))

ValueError：估算器的参数tfidf无效

TfidfVectorizer(analyzer ='word'，binary = False，decode_error ='strict'，

dtype =，encoding ='utf-8'，input ='content'，

小写=真，max_df = 1.0，max_features =无，min_df = 1，

ngram_range =(1，1)，norm ='l2'，

预处理器=，

smooth_idf = True，stop_words = None，strip_accents = None，

sublinear_tf = False，token_pattern ='(？u)\ b \ w \ w + \ b'，

tokenizer =无，use_idf = True，词汇=无)。使用estimator.get_params().keys()检查可用参数列表。

题：

如何加载Tfidf模型的最佳参数？

参考方案

这行：

joblib.dump(grid.best_params_, 'best_tfidf.pkl', compress = 1) # Only best parameters

保存pipeline的参数，而不保存TfidfVectorizer的参数。这样做：

pipeline = Pipeline([

# Change the name to be same as before

('tfidf', TfidfVectorizer(preprocessor=' '.join, tokenizer=None)),

('clf', LogisticRegression())

])

pipeline.set_params(**tfidf_params)

R'relaimpo'软件包的Python端口 - python

我需要计算Lindeman-Merenda-Gold(LMG)分数，以进行回归分析。我发现R语言的relaimpo包下有该文件。不幸的是，我对R没有任何经验。我检查了互联网，但找不到。这个程序包有python端口吗？如果不存在，是否可以通过python使用该包？ python参考方案最近，我遇到了pingouin库。Python:传递记录器是个好主意吗？ - python

我的Web服务器的API日志如下：started started succeeded failed 那是同时收到的两个请求。很难说哪一个成功或失败。为了彼此分离请求，我为每个请求创建了一个随机数，并将其用作记录器的名称logger = logging.getLogger(random_number) 日志变成[111] started [222] start…Python-Excel导出 - python

我有以下代码：import pandas as pd import requests from bs4 import BeautifulSoup res = requests.get("https://www.bankier.pl/gielda/notowania/akcje") soup = BeautifulSoup(res.cont…Matplotlib'粗体'字体 - python

跟随this example：import numpy as np import matplotlib.pyplot as plt fig = plt.figure() for i, label in enumerate(('A', 'B', 'C', 'D')): ax = f…Python:如何根据另一列元素明智地查找一列中的空单元格计数？ - python

df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice','Jane', 'Alice','Bob', 'Alice'], 'income…