机器学习系列笔记三:K近邻算法与参数调优[下]
文章目录
网格搜索超参 Grid Search
在上一章节中简单描述了对各个超参数的求解过程,实际上sklearn内部已经封装了一个GridSearchCV
类,方便我们来检索最佳的超参数
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
digits = datasets.load_digits()
X= digits.data
y = digits.target
X_train, X_test, y_train, y_test =train_test_split(X, y,test_size=0.2,random_state=666)
np.__version__
'1.14.5'
- 将需要搜索的超参数的范围放入一个字典集合中,写法类似于sublime的设置
from sklearn.neighbors import KNeighborsClassifier
param_grid=[
{
'weights':['uniform'],
'n_neighbors':[i for i in range(1,11)]
},
{
'weights':['distance'],
'n_neighbors':[i for i in range(1,11)],
'p':[i for i in range(1,6)]
}
]
knn_clf = KNeighborsClassifier()
- 定义网格搜索的对象,传入模型和定义的超参数搜索范围
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(knn_clf,param_grid)
%%time
grid_search.fit(X_train,y_train)
Wall time: 1min 1s
GridSearchCV(cv=None, error_score=nan,
estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
metric='minkowski',
metric_params=None, n_jobs=None,
n_neighbors=5, p=2,
weights='uniform'),
iid='deprecated', n_jobs=None,
param_grid=[{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'weights': ['uniform']},
{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'p': [1, 2, 3, 4, 5], 'weights': ['distance']}],
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring=None, verbose=0)
- 查看搜索到的最佳分类器(估计器)
grid_search.best_estimator_
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=1, p=2,
weights='uniform')
grid_search.best_score_ # 最佳超参数模型对应的准确度
0.9860820751064653
grid_search.best_params_ # 最佳超参数
{'n_neighbors': 1, 'weights': 'uniform'}
- 拿到这个最佳参数对应的分类器模型
knn_clf = grid_search.best_estimator_
knn_clf.predict(X_test)
array([8, 1, 3, 4, 4, 0, 7, 0, 8, 0, 4, 6, 1, 1, 2, 0, 1, 6, 7, 3, 3, 6,
3, 2, 9, 4, 0, 2, 0, 3, 0, 8, 7, 2, 3, 5, 1, 3, 1, 5, 8, 6, 2, 6,
3, 1, 3, 0, 0, 4, 9, 9, 2, 8, 7, 0, 5, 4, 0, 9, 5, 5, 9, 3, 4, 2,
8, 8, 7, 1, 4, 3, 0, 2, 7, 2, 1, 2, 4, 0, 9, 0, 6, 6, 2, 0, 0, 5,
4, 4, 3, 1, 3, 8, 6, 4, 4, 7, 5, 6, 8, 4, 8, 4, 6, 9, 7, 7, 0, 8,
8, 3, 9, 7, 1, 8, 4, 2, 7, 0, 0, 4, 9, 6, 7, 3, 4, 6, 4, 8, 4, 7,
2, 6, 5, 5, 8, 7, 2, 5, 5, 9, 7, 9, 3, 1, 9, 4, 4, 1, 5, 1, 6, 4,
4, 8, 1, 6, 2, 5, 2, 1, 4, 4, 3, 9, 4, 0, 6, 0, 8, 3, 8, 7, 3, 0,
3, 0, 5, 9, 2, 7, 1, 8, 1, 4, 3, 3, 7, 8, 2, 7, 2, 2, 8, 0, 5, 7,
6, 7, 3, 4, 7, 1, 7, 0, 9, 2, 8, 9, 3, 8, 9, 1, 1, 1, 9, 8, 8, 0,
3, 7, 3, 3, 4, 8, 2, 1, 8, 6, 0, 1, 7, 7, 5, 8, 3, 8, 7, 6, 8, 4,
2, 6, 2, 3, 7, 4, 9, 3, 5, 0, 6, 3, 8, 3, 3, 1, 4, 5, 3, 2, 5, 6,
8, 6, 9, 5, 5, 3, 6, 5, 9, 3, 7, 7, 0, 2, 4, 9, 9, 9, 2, 5, 6, 1,
9, 6, 9, 7, 7, 4, 5, 0, 0, 5, 3, 8, 4, 4, 3, 2, 5, 3, 2, 2, 3, 0,
9, 8, 2, 1, 4, 0, 6, 2, 8, 0, 6, 4, 9, 9, 8, 3, 9, 8, 6, 3, 2, 7,
9, 4, 2, 7, 5, 1, 1, 6, 1, 0, 4, 9, 2, 9, 0, 3, 3, 0, 7, 4, 8, 5,
9, 5, 9, 5, 0, 7, 9, 8])
knn_clf.score(X_test,y_test)
0.9833333333333333
- GridSearchCV的其他构成参数:
- n_jobs:表示利用计算器的几个核(进程)来搜索最佳超参数(int)
- verbose:输出搜索过程(int)
%%time
grid_search = GridSearchCV(knn_clf,param_grid,n_jobs=-1,verbose=4) # -1表示当前计算机的所有核
grid_search.fit(X_train,y_train)
Fitting 5 folds for each of 60 candidates, totalling 300 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 0.9s
[Parallel(n_jobs=-1)]: Done 108 tasks | elapsed: 3.5s
Wall time: 13.2 s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 13.0s finished
GridSearchCV(cv=None, error_score=nan,
estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
metric='minkowski',
metric_params=None, n_jobs=None,
n_neighbors=1, p=2,
weights='uniform'),
iid='deprecated', n_jobs=-1,
param_grid=[{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'weights': ['uniform']},
{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'p': [1, 2, 3, 4, 5], 'weights': ['distance']}],
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring=None, verbose=4)
metric(距离)的其他选择
identifier | class name | args | distance function |
---|---|---|---|
“euclidean” | EuclideanDistance | sqrt(sum((x - y)^2)) |
|
“manhattan” | ManhattanDistance | sum(|x - y|) |
|
“chebyshev” | ChebyshevDistance | max(|x - y|) |
|
“minkowski” | MinkowskiDistance |