文章目录
Scikit-learn机器学习库包含的机器学习方法
官网直达scikit-learn: machine learning in Python — scikit-learn 0.21.3 documentation
https://scikit-learn.org/stable/
本例中用到的方法有:
classifiers = {'NB':naive_bayes_classifier,
'KNN':knn_classifier,
'LR':logistic_regression_classifier,
'RF':random_forest_classifier,
'DT':decision_tree_classifier,
'SVM':svm_classifier,
'SVMCV':svm_cross_validation,
'GBDT':gradient_boosting_classifier
}
1 加载训练集和测试集数据
核心代码注解
# 使用gizp打开文件,read bytes字节方式读取
f = gzip.open(data_file,"rb")
# 使用pickle方法从序列化的数据中按照字节方式加载到内存
train, val, test = pickcle.load(f, encoding = 'bytes')
# 获取训练数据集的样本数和特征维度
num_train, num_feat = train_x.shape
# 对训练集的标记进行去重处理,判断输出空间的维度,如果是2,就说明是一个二分类问题
is_binary_class = (len(np.unique(train_y)) == 2)
Python数据存储:pickle模块的使用讲解 - 风-fmgao - 博客园
https://www.cnblogs.com/fmgao-technology/p/9078918.html
完整代码如下
# python中对数据进行序列化,并保存到对象中的操作模块
import pickle
# 原本是linux中的解压缩命令
import gzip
# 开始读取数据
print('reading training and testing data...')
# 使用gzip里的方法打开数据文件
f = gzip.open(data_file, "rb")
# 编码方式用bytes字节而不是ascll字符,也不是utf-8,这里自动就加载了训练集和测试集
train, val, test = pickle.load(f,encoding='bytes')
f.close()
# 第一列是特征
train_x = train[0]
# 第二列是标记
train_y = train[1]
test_x = test[0]
test_y = test[1]
# 获取大小,结果是(50000, 784),即50000行,784列,784表示特征的维度
num_train, num_feat = train_x.shape
# 大小是(10000, 784)
num_test, num_feat = test_x.shape
# np.unique(train_y)表示测试数据集去重之后的结果,测试数据集包含0~9的数据
# 如果这个长度是2,说明数字只有两类,变成二分类问题
is_binary_class = (len(np.unique(train_y)) == 2)
print('******************** Data Info *********************')
print('#training data: %d, #testing_data: %d, dimension: %d' % (num_train, num_test, num_feat))
输出结果:
reading training and testing data...
******************** Data Info *********************
#training data: 50000, #testing_data: 10000, dimension: 784
2 模型训练,预测和性能测试报告
关键代码分析
# 统计花费的时间
time_cost = time.time() - start_time
# 选择模型
model = MultinomialNB(alpha=0.01)
# 模型训练(在训练集上)
model.fit(train_x, train_y)
# 模型预测(在测试集上)
predict = model.predict(test_x)
# 模型性能报告
accuracy = metrics.accuracy_score(test_y, predict)
metrics
是机器学习方法的性能指标
机器学习常用性能指标总结 - Prince1994 - 博客园
https://www.cnblogs.com/princecoding/p/6714216.html
scikit-learn 朴素贝叶斯类库使用小结 - 刘建平Pinard - 博客园
https://www.cnblogs.com/pinard/p/6074222.html
以朴素贝叶斯为例,代码如下
# 开始计时
start_time = time.time()
# 使用一个朴素贝叶斯分类器训练模型
# 从机器学习库中的朴素贝叶斯加载多分类朴素贝叶斯模型
from sklearn.naive_bayes import MultinomialNB
# 新建一个模型的对象,并设置学习率
model = MultinomialNB(alpha=0.01)
# 开始训练
model.fit(train_x, train_y)
# 打印训练模型花费的时间
print('training took %fs!' % (time.time() - start_time))
# 对测试集进行预测
predict = model.predict(test_x)
# 如果是二分类问题,还需要计算准确率和召回率
if is_binary_class:
precision = metrics.precision_score(test_y, predict)
recall = metrics.recall_score(test_y, predict)
print('precision: %.2f%%, recall: %.2f%%' % (100 * precision, 100 * recall))
# 计算模型在测试集上的准确性
accuracy = metrics.accuracy_score(test_y, predict)
print('accuracy: %.2f%%' % (100 * accuracy) )
输出结果
training took 0.244317s!
accuracy: 83.69%
从python2到python3需要注意的地方
参考的原代码是python2写的,自己在编译的时候遇到了很多坑,总结如下:
print
需要改为print()
import cPickle
需要改为import pickle
import sys
改成import importlib
reload(sys)
改成importlib.reload(sys)
sys.setdefaultencoding('utf8')
直接去掉,因为utf8
是python3的默认编码格式train, val, test = pickle.load(f)
需要改成train, val, test = pickle.load(f,encoding='bytes')
完整代码附录和运行结果
#!usr/bin/env python
#-*- coding: utf-8 -*-
# python2的加载
# import sys
# import cPickle as pickle
# reload(sys)
# sys.setdefaultencoding('utf8')
import os
# 统计时间
import time
# 性能度量
from sklearn import metrics
import numpy as np
# python数据存储模块
import pickle
import importlib,sys
importlib.reload(sys)
# Multinomial Naive Bayes Classifier
def naive_bayes_classifier(train_x, train_y):
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB(alpha=0.01)
model.fit(train_x, train_y)
return model
# KNN Classifier
def knn_classifier(train_x, train_y):
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
model.fit(train_x, train_y)
return model
# Logistic Regression Classifier
def logistic_regression_classifier(train_x, train_y):
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(penalty='l2')
model.fit(train_x, train_y)
return model
# Random Forest Classifier
def random_forest_classifier(train_x, train_y):
# 集成学习
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=8)
model.fit(train_x, train_y)
return model
# Decision Tree Classifier
def decision_tree_classifier(train_x, train_y):
from sklearn import tree
model = tree.DecisionTreeClassifier()
model.fit(train_x, train_y)
return model
# GBDT(Gradient Boosting Decision Tree) Classifier
def gradient_boosting_classifier(train_x, train_y):
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(n_estimators=200)
model.fit(train_x, train_y)
return model
# SVM Classifier
def svm_classifier(train_x, train_y):
from sklearn.svm import SVC
model = SVC(kernel='rbf', probability=True)
model.fit(train_x, train_y)
return model
# SVM Classifier using cross validation
def svm_cross_validation(train_x, train_y):
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVC
model = SVC(kernel='rbf', probability=True)
param_grid = {'C': [1e-3, 1e-2, 1e-1, 1, 10, 100, 1000], 'gamma': [0.001, 0.0001]}
grid_search = GridSearchCV(model, param_grid, n_jobs = 1, verbose=1)
grid_search.fit(train_x, train_y)
best_parameters = grid_search.best_estimator_.get_params()
for para, val in best_parameters.items():
print(para, val)
model = SVC(kernel='rbf', C=best_parameters['C'], gamma=best_parameters['gamma'], probability=True)
model.fit(train_x, train_y)
return model
def read_data(data_file):
# 加载一个能解压缩的库
import gzip
# 使用gzip里的方法打开数据文件
f = gzip.open(data_file, "rb")
# train, val, test = pickle.load(f)
# 编码方式用bytes,这里自动就加载了训练集和测试集
train, val, test = pickle.load(f,encoding='bytes')
f.close()
train_x = train[0]
train_y = train[1]
test_x = test[0]
test_y = test[1]
# 返回训练集和测试集
return train_x, train_y, test_x, test_y
if __name__ == '__main__':
# 加载数据集,注意是未解压缩的
data_file = "mnist.pkl.gz"
thresh = 0.5
model_save_file = None
model_save = {}
test_classifiers = ['NB', 'KNN', 'LR', 'RF', 'DT', 'SVM', 'GBDT']
classifiers = {'NB':naive_bayes_classifier,
'KNN':knn_classifier,
'LR':logistic_regression_classifier,
'RF':random_forest_classifier,
'DT':decision_tree_classifier,
'SVM':svm_classifier,
'SVMCV':svm_cross_validation,
'GBDT':gradient_boosting_classifier
}
print('reading training and testing data...')
train_x, train_y, test_x, test_y = read_data(data_file)
num_train, num_feat = train_x.shape
num_test, num_feat = test_x.shape
is_binary_class = (len(np.unique(train_y)) == 2)
print('******************** Data Info *********************')
print('#training data: %d, #testing_data: %d, dimension: %d' % (num_train, num_test, num_feat))
# 对每一个分类器进行遍历
for classifier in test_classifiers:
print('******************* %s ********************' % classifier)
# 开始计时
start_time = time.time()
# 使用分类器训练模型,这里直接选用函数
model = classifiers[classifier](train_x, train_y)
# 打印训练模型花费的时间
print('training took %fs!' % (time.time() - start_time))
# 对测试集进行预测
predict = model.predict(test_x)
if model_save_file != None:
model_save[classifier] = model
# 如果是二分类问题,还需要计算准确率和召回率
if is_binary_class:
precision = metrics.precision_score(test_y, predict)
recall = metrics.recall_score(test_y, predict)
print('precision: %.2f%%, recall: %.2f%%' % (100 * precision, 100 * recall))
# 计算准确性
accuracy = metrics.accuracy_score(test_y, predict)
print('accuracy: %.2f%%' % (100 * accuracy) )
if model_save_file != None:
pickle.dump(model_save, open(model_save_file, 'wb'))
本次使用mnist手写体库进行实验:http://deeplearning.net/data/mnist/mnist.pkl.gz,共5万训练样本和1万测试样本。
运行效果:
reading training and testing data...
******************** Data Info *********************
#training data: 50000, #testing_data: 10000, dimension: 784
******************* NB ********************
training took 0.295175s!
accuracy: 83.69%
******************* KNN ********************
training took 15.932032s!
accuracy: 96.64%
******************* LR ********************
training took 52.029990s!
accuracy: 91.99%
******************* RF ********************
training took 2.521217s!
accuracy: 93.85%
******************* DT ********************
training took 16.277198s!
accuracy: 87.45%
******************* SVM ********************
training took 2627.654293s!
accuracy: 94.35%
******************* GBDT ********************
training took 4673.062296s!
accuracy: 96.23%
后面的运算时间非常夸张,电脑配置是i5-9400 2.9GHZ
,占用50%左右,内存16GB
,占用50%左右。
参考博客和样本下载
Python机器学习库scikit-learn实践 - zouxy09的专栏 - 优快云博客
Python的机器学习库之Sklearn快速入门 - 简书
https://www.jianshu.com/p/b5eb165ac2c2
scikit-learn: machine learning in Python — scikit-learn 0.21.3 documentation
https://scikit-learn.org/stable/
MNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris Burges
http://yann.lecun.com/exdb/mnist/
Python数据存储:pickle模块的使用讲解 - 风-fmgao - 博客园
https://www.cnblogs.com/fmgao-technology/p/9078918.html
机器学习常用性能指标总结 - Prince1994 - 博客园
https://www.cnblogs.com/princecoding/p/6714216.html
scikit-learn 朴素贝叶斯类库使用小结 - 刘建平Pinard - 博客园
https://www.cnblogs.com/pinard/p/6074222.html
样本下载
本次使用mnist手写体库进行实验:http://deeplearning.net/data/mnist/mnist.pkl.gz。共5万训练样本和1万测试样本。