机器学习系列笔记三:K近邻算法与参数调优[下]

本文是机器学习系列笔记的第三部分,重点探讨了K近邻算法的参数调优,包括使用Grid Search进行超参数搜索,以及数据归一化的两种方法:最值归一化和均值方差归一化。文中强调了数据预处理的重要性,并展示了如何使用sklearn库中的StandardScaler手动实现数据归一化。此外,还讨论了K近邻算法的优缺点,如高效解决分类和回归问题,但易受数据量和维度影响,提出了优化策略和解决方法。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

机器学习系列笔记三:K近邻算法与参数调优[下]

网格搜索超参 Grid Search

在上一章节中简单描述了对各个超参数的求解过程,实际上sklearn内部已经封装了一个GridSearchCV类,方便我们来检索最佳的超参数

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split

digits = datasets.load_digits()
X= digits.data
y = digits.target
X_train, X_test, y_train, y_test =train_test_split(X, y,test_size=0.2,random_state=666)
np.__version__
'1.14.5'
  • 将需要搜索的超参数的范围放入一个字典集合中,写法类似于sublime的设置
from sklearn.neighbors import KNeighborsClassifier

param_grid=[
    {
   
   
        'weights':['uniform'],
        'n_neighbors':[i for i in range(1,11)]
    },
    {
   
   
         'weights':['distance'],
        'n_neighbors':[i for i in range(1,11)],
        'p':[i for i in range(1,6)]
    }
]

knn_clf = KNeighborsClassifier()
  • 定义网格搜索的对象,传入模型和定义的超参数搜索范围
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(knn_clf,param_grid)
%%time
grid_search.fit(X_train,y_train)
Wall time: 1min 1s

GridSearchCV(cv=None, error_score=nan,
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='deprecated', n_jobs=None,
             param_grid=[{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                          'weights': ['uniform']},
                         {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                          'p': [1, 2, 3, 4, 5], 'weights': ['distance']}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)
  • 查看搜索到的最佳分类器(估计器)
grid_search.best_estimator_ 
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')
grid_search.best_score_ # 最佳超参数模型对应的准确度
0.9860820751064653
grid_search.best_params_ # 最佳超参数
{'n_neighbors': 1, 'weights': 'uniform'}
  • 拿到这个最佳参数对应的分类器模型
knn_clf = grid_search.best_estimator_
knn_clf.predict(X_test)
array([8, 1, 3, 4, 4, 0, 7, 0, 8, 0, 4, 6, 1, 1, 2, 0, 1, 6, 7, 3, 3, 6,
       3, 2, 9, 4, 0, 2, 0, 3, 0, 8, 7, 2, 3, 5, 1, 3, 1, 5, 8, 6, 2, 6,
       3, 1, 3, 0, 0, 4, 9, 9, 2, 8, 7, 0, 5, 4, 0, 9, 5, 5, 9, 3, 4, 2,
       8, 8, 7, 1, 4, 3, 0, 2, 7, 2, 1, 2, 4, 0, 9, 0, 6, 6, 2, 0, 0, 5,
       4, 4, 3, 1, 3, 8, 6, 4, 4, 7, 5, 6, 8, 4, 8, 4, 6, 9, 7, 7, 0, 8,
       8, 3, 9, 7, 1, 8, 4, 2, 7, 0, 0, 4, 9, 6, 7, 3, 4, 6, 4, 8, 4, 7,
       2, 6, 5, 5, 8, 7, 2, 5, 5, 9, 7, 9, 3, 1, 9, 4, 4, 1, 5, 1, 6, 4,
       4, 8, 1, 6, 2, 5, 2, 1, 4, 4, 3, 9, 4, 0, 6, 0, 8, 3, 8, 7, 3, 0,
       3, 0, 5, 9, 2, 7, 1, 8, 1, 4, 3, 3, 7, 8, 2, 7, 2, 2, 8, 0, 5, 7,
       6, 7, 3, 4, 7, 1, 7, 0, 9, 2, 8, 9, 3, 8, 9, 1, 1, 1, 9, 8, 8, 0,
       3, 7, 3, 3, 4, 8, 2, 1, 8, 6, 0, 1, 7, 7, 5, 8, 3, 8, 7, 6, 8, 4,
       2, 6, 2, 3, 7, 4, 9, 3, 5, 0, 6, 3, 8, 3, 3, 1, 4, 5, 3, 2, 5, 6,
       8, 6, 9, 5, 5, 3, 6, 5, 9, 3, 7, 7, 0, 2, 4, 9, 9, 9, 2, 5, 6, 1,
       9, 6, 9, 7, 7, 4, 5, 0, 0, 5, 3, 8, 4, 4, 3, 2, 5, 3, 2, 2, 3, 0,
       9, 8, 2, 1, 4, 0, 6, 2, 8, 0, 6, 4, 9, 9, 8, 3, 9, 8, 6, 3, 2, 7,
       9, 4, 2, 7, 5, 1, 1, 6, 1, 0, 4, 9, 2, 9, 0, 3, 3, 0, 7, 4, 8, 5,
       9, 5, 9, 5, 0, 7, 9, 8])
knn_clf.score(X_test,y_test)
0.9833333333333333
  • GridSearchCV的其他构成参数:
    • n_jobs:表示利用计算器的几个核(进程)来搜索最佳超参数(int)
    • verbose:输出搜索过程(int)
%%time
grid_search = GridSearchCV(knn_clf,param_grid,n_jobs=-1,verbose=4) # -1表示当前计算机的所有核
grid_search.fit(X_train,y_train)
Fitting 5 folds for each of 60 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done 108 tasks      | elapsed:    3.5s


Wall time: 13.2 s


[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:   13.0s finished

GridSearchCV(cv=None, error_score=nan,
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=1, p=2,
                                            weights='uniform'),
             iid='deprecated', n_jobs=-1,
             param_grid=[{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                          'weights': ['uniform']},
                         {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                          'p': [1, 2, 3, 4, 5], 'weights': ['distance']}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=4)

metric(距离)的其他选择

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html#sklearn.neighbors.DistanceMetric

identifier class name args distance function
“euclidean” EuclideanDistance sqrt(sum((x - y)^2))
“manhattan” ManhattanDistance sum(|x - y|)
“chebyshev” ChebyshevDistance max(|x - y|)
“minkowski” MinkowskiDistance
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值