python pickle,更新Python Pickle对象

最新推荐文章于 2022-05-13 15:17:37 发布

老饭骨

最新推荐文章于 2022-05-13 15:17:37 发布

阅读量387

点赞数

文章标签： python pickle

本文介绍如何在Python中使用pickle模块保存和更新机器学习分类器。通过加载已有的pickle文件，并用新数据集继续训练，最终将更新后的分类器重新保存。文章提供了完整的代码示例，包括如何设置特征集、加载或创建分类器以及保存分类器。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

I am doing a project in Machine Learning and for that I am using the pickle module of Python.

Basically, I am parsing through a huge data set which is not possible in one execution that is why I need to save the classifier object and update it in the next execution.

So my question is, when I run the program again with the new data set then will the already created pickle object be modified (or updated). If not then how can I update the same pickle object every time I run the program.

save_classifier = open("naivebayes.pickle","wb")

pickle.dump(classifier,save_classifier)

save_classifier.close()

解决方案

Unpickling your classifier object will re-create it in the same state that it was when you pickled it, so you can proceed to update it with fresh data from your data set. And at the end of the program run, you pickle the classifier again and save it to a file again. It's a Good Idea to not overwrite the same file but to keep a backup (or even better, a series of backups), in case you mess something up. That way, you can easily go back to a known good state of your classifier.

You should experiment with pickling, using a simple program and a simple object to pickle and unpickle, until you're totally confident with how this all works.

Here's a rough sketch of how to update the pickled classifier data.

import pickle

import os

from os.path import exists

# other imports required for nltk ...

picklename = "naivebayes.pickle"

# stuff to set up featuresets ...

featuresets = [(find_features(rev), category) for (rev, category) in documents]

numtrain = int(len(documents) * 90 / 100)

training_set = featuresets[:numtrain]

testing_set = featuresets[numtrain:]

# Load or create a classifier and apply training set to it

if exists(picklename):

# Update existing classifier

with open(picklename, "rb") as f:

classifier = pickle.load(f)

classifier.train(training_set)

else:

# Create a brand new classifier

classifier = nltk.NaiveBayesClassifier.train(training_set)

# Create backup

if exists(picklename):

backupname = picklename + '.bak'

if exists(backupname):

os.remove(backupname)

os.rename(picklename, backupname)

# Save

with open(picklename, "wb") as f:

pickle.dump(classifier, f)

The first time you run this program it will create a new classifier, train it with the data in training_set, then pickle classifier to "naivebayes.pickle". Each subsequent time you run this program it will load the old classifier and apply more training data to it.

BTW, if you are doing this in Python 2 you should use the much faster cPickle module; you can do that by replacing

import pickle

with

import cPickle as pickle