Datacamp 笔记&代码 Supervised Learning with scikit-learn 第一章 Classification

最新推荐文章于 2024-07-17 17:31:59 发布

JinnyR

最新推荐文章于 2024-07-17 17:31:59 发布

阅读量1.9k

点赞数 1

分类专栏： datacamp 文章标签： datacamp sklearn data science python machine learning

本文链接：https://blog.youkuaiyun.com/u011292816/article/details/97013030

版权

本篇博客介绍了如何利用scikit-learn库进行k-最近邻（k-NN）分类。首先，创建了一个k-NN分类器，设置了邻居数为6，并将其应用于国会投票记录数据集。接着，讨论了训练/测试拆分，过拟合和欠拟合的概念。最后，通过MNIST手写数字识别数据集展示了多类分类问题，并且实践了训练/测试拆分、模型拟合和评估准确性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

更多原始数据文档和JupyterNotebook
Github: https://github.com/JinnyR/Datacamp_DataScienceTrack_Python

Datacamp track: Data Scientist with Python - Course 21 (1)

Exercise

k-Nearest Neighbors: Fit

Having explored the Congressional voting records dataset, it is time now to build your first classifier. In this exercise, you will fit a k-Nearest Neighbors classifier to the voting dataset, which has once again been pre-loaded for you into a DataFrame df.

In the video, Hugo discussed the importance of ensuring your data adheres to the format required by the scikit-learn API. The features need to be in an array where each column is a feature and each row a different observation or data point - in this case, a Congressman’s voting record. The target needs to be a single column with the same number of observations as the feature data. We have done this for you in this exercise. Notice we named the feature array X and response variable y: This is in accordance with the common scikit-learn practice.

Your job is to create an instance of a k-NN classifier with 6 neighbors (by specifying the n_neighbors parameter) and then fit it to the data. The data has been pre-loaded into a DataFrame called df.

Instruction

Import KNeighborsClassifier from sklearn.neighbors.
Create arrays X and y for the features and the target variable. Here this has been done for you. Note the use of .drop() to drop the target variable 'party' from the feature array X as well as the use of the .valuesattribute to ensure X and y are NumPy arrays. Without using .values, X and y are a DataFrame and Series respectively; the scikit-learn API will accept them in this form also as long as they are of the right shape.
Instantiate a KNeighborsClassifier called knnwith 6 neighbors by specifying the n_neighborsparameter.
Fit the classifier to the data using the .fit() method.

import pandas as pd

df = pd.read_csv('https://s3.amazonaws.com/assets.datacamp.com/production/course_1939/datasets/votes-ch1.csv')

# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier

# Create arrays for the features and the response variable
y = df['party'].values
X = df.drop('party', axis=1).values

# Create a k-NN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the data
knn.fit(X,y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=6, p=2,
           weights='uniform')

Exercise

k-Nearest Neighbors: Predict

Having fit a k-NN classifier, you can now use it to predict the label of a new data point. However, there is no unlabeled data available since all of it was used to fit the model! You can still use the .predict() method on the X that was used to fit the model, but it

最低0.47元/天解锁文章