更多原始数据文档和JupyterNotebook
Github: https://github.com/JinnyR/Datacamp_DataScienceTrack_Python
Datacamp track: Data Scientist with Python - Course 21 (3)
Exercise
Metrics for classification
In Chapter 1, you evaluated the performance of your k-NN classifier based on its accuracy. However, as Andy discussed, accuracy is not always an informative metric. In this exercise, you will dive more deeply into evaluating the performance of binary classifiers by computing a confusion matrix and generating a classification report.
You may have noticed in the video that the classification report consisted of three rows, and an additional support column. The support gives the number of samples of the true response that lie in that class - so in the video example, the support was the number of Republicans or Democrats in the test set on which the classification report was computed. The precision, recall, and f1-score columns, then, gave the respective metrics for that particular class.
Here, you’ll work with the PIMA Indians dataset obtained from the UCI Machine Learning Repository. The goal is to predict whether or not a given female patient will contract diabetes based on features such as BMI, age, and number of pregnancies. Therefore, it is a binary classification problem. A target value of 0
indicates that the patient does not have diabetes, while a value of 1
indicates that the patient does have diabetes. As in Chapters 1 and 2, the dataset has been preprocessed to deal with missing values.
The dataset has been loaded into a DataFrame df
and the feature and target variable arrays X
and y
have been created for you. In addition, sklearn.model_selection.train_test_split
and sklearn.neighbors.KNeighborsClassifier
have already been imported.
Your job is to train a k-NN classifier to the data and evaluate its performance by generating a confusion matrix and classification report.
Instruction
- Import
classification_report
andconfusion_matrix
fromsklearn.metrics
. - Create training and testing sets with 40% of the data used for testing. Use a random state of
42
. - Instantiate a k-NN classifier with
6
neighbors, fit it to the training data, and predict the labels of the test set. - Compute and print the confusion matrix and classification report using the
confusion_matrix()
andclassification_report()
functions.
# modified/added by Jinny
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
df = pd.read_csv('https://s3.amazonaws.com/assets.datacamp.com/production/course_1939/datasets/diabetes.csv')
df.insulin.replace(0, np.nan, inplace=True)
df.triceps.replace(0, np.nan, inplace=True)
df.bmi.replace(0, np.nan, inplace=True)
df.iloc[:, 1:] = df.iloc[:, 1:].apply(lambda x: x.fillna(x.mean()))
y = df['diabetes']
X = df.drop('diabetes', axis=1)
# Import necessary modules
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
# Create training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
# Instantiate a k-NN classifier: knn
knn = KNeighborsClassifier(n_neighbors=6)
# Fit the classifier to the training data
knn.fit(X_train, y_train)
# Predict the labels of the test data: y_pred
y_pred = knn.predict(X_test)
# Generate the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
[[176 30]
[ 52 50]]
precision recall f1-score support
0 0.77 0.85 0.81 206
1 0.62 0.49 0.55 102
micro avg 0.73 0.73 0.73 308
macro avg 0.70 0.67 0.68 308
weighted avg 0.72 0.73 0.72 308
Exercise
Building a logistic regression model
Time to build your first logistic regression model! As Hugo showed in the video, scikit-learn makes it very easy to try different models, since the Train-Test-Split/Instantiate/Fit/Predict paradigm applies to all classifiers and regressors - which are known in scikit-learn as ‘estimators’. You’ll see this now for yourself as you train a logistic regression model on exactly the same data as in the previous exercise. Will it outperform k-NN? There’s only one way to find out!
The feature and target variable arrays X
and y
have been pre-loaded, and train_test_split
has been imported for you from sklearn.model_selection
.
Instruction
- Import:
LogisticRegression
fromsklearn.linear_model
.confusion_matrix
andclassification_report
fromsklearn.metrics
.
- Create training and test sets with 40% (or
0.4
) of the data used for testing. Use a random state of42
. This has been done for you. - Instantiate a
LogisticRegression
classifier calledlogreg
. - Fit the classifier to the training data and predict the labels of the test set.
- Compute and print the confusion matrix and classification report. This has been done for you, so hit ‘Submit Answer’ to see how logistic regression compares to k-NN!
# Modified/Added by Jinny
import pandas as pd
import numpy as np
df = pd.read_csv('https://s3.amazonaws.com/assets.datacamp.com/production/course_1939/datasets/diabetes.csv')
df.insulin.replace(0, np.nan, inplace=True)
df.triceps.replace(0, np.nan, inplace=True)
df.bmi.replace(0, np.nan, inplace=True)
df.iloc[:, 1:] = df.iloc[:, 1:].apply(lambda x: x.fillna(x.mean()))
y = df['diabetes']
X = df.drop('diabetes', axis=1)
from sklearn.model_selection import train_test_split
# Import the necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=42)
# Create the classifier: logreg
logreg = LogisticRegression(solver="liblinear")
# Fit the classifier to the training data
logreg.fit(X_train, y_train)
# Predict the labels of the test set: y_pred
y_pred = logreg.predict(X_test)
# Compute and print the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
[[176 30]
[ 35 67]]
precision recall f1-score support
0 0.83 0.85 0.84 206
1 0.69 0.66 0.67 102
micro avg 0.79 0.79 0.79 308
macro avg 0.76 0.76 0.76 308
weighted avg 0.79 0.79 0.79 308
Exercise