这是斯坦福课程的作业,根据大纲里面assignment1内的提示,下载好实验所需要的数据集。
目录
通过Cross-validation选取hyperparameters
前言:
1、knn的预处理步骤需要对数据进行Normalize,对于图像而言,可以理解为一个像素为一个feature。因为像素的分布是homogeneous,且不存在widely different distributions,因此这儿不需要data mormalization。
2、实际运用knn时需要对数据的预处理第二步是降维处理,因为knn是通过定义不同的distance metric来vote的,而distance一旦处于高维就是反直觉的,并且在数学上也是能解释为什么效果不好,具体参见文章section6。这儿的实验只是为了学习knn,实际中从来不会用knn进行图片分类,原因见3
3、knn的优点在于简单、直观。但缺点 一是 在训练上花费很少的时间(具体见下,训练只是缓存了训练集),但在预测上花费很多时间(有扩展的算法可以减少预测时间,以正确率作为代价,例如Approximate Nearest Neighbor (ANN) ,具体可参考FLANN),实际的需要与此相反。二是 knn在高维的数据上效果并不好。
4、可以先从一个很简单的例子开始:knn简单例子。
代码主体:
part1 数据的导入以及预处理
import random
import numpy as np
from cs231n.data_utils import load_CIFAR10
import matplotlib.pyplot as plt
from __future__ import print_function
"""
这一步只是进行配置
"""
# Run some setup code for this notebook.
# This is a bit of magic to make matplotlib figures appear inline in the notebook
# rather than in a new window.
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2
# Load the raw CIFAR-10 data.
cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'
# Cleaning up variables to prevent loading data multiple times (which may cause memory issue)
try:
del X_train, y_train
del X_test, y_test
print('Clear previously loaded data.')
except:
pass
# cs231n.data_utils 中的load_CIFAR10 实现在后面
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)
# As a sanity check, we print out the size of the training and test data.
# print('Training data shape: ', X_train.shape) >> Training data shape: (50000, 32, 32, 3)
# print('Training labels shape: ', y_train.shape) >> Training labels shape: (50000,)
# print('Test data shape: ', X_test.shape) >> Test data shape: (10000, 32, 32, 3)
# print('Test labels shape: ', y_test.shape) >> Test labels shape: (10000,)
# Subsample the data for more efficient code execution in this exercise
num_training = 5000
mask = range(num_training)
# 下面两种方式都可以实现筛选前5000行,注意,若数据是高维的,也可以实现
# 例如若数据是(10000,32,32,3),那么下面的代码运行后,维数变为(5000,32,32,3)
#X_train = X_train[mask]
X_train = X_train[np.arange(num_training)]
y_train = y_train[mask]
#print(X_train.shape) >> (5000, 32, 32, 3)
num_test = 500
mask = list(range(num_test))
X_test = X_test[mask]
y_test = y_test[mask]
# Reshape the image data into rows
X_train = np.reshape(X_train, (X_train.shape[0], -1))
X_test = np.reshape(X_test, (X_test.shape[0], -1))
# print(X_train.shape, X_test.shape) >> (5000, 3072) (500, 3072)
part2 knn训练以及预测
from cs231n.classifiers import KNearestNeighbor
# the Classifier simply remembers the data and does no further processing
# 具体实现见后面
classifier = KNearestNeighbor()
classifier.train(X_train, y_train)
# Test your implementation:
dists = classifier.compute_distances_two_loops(X_test)
print(dists.shape)
# We can visualize the distance matrix: each row is a single test example and
# its distances to training examples
#plt.imshow(dists, interpolation='none')
#plt.show()
# Now implement the function predict_labels and run the code below:
# We use k = 1 (which is Nearest Neighbor).
y_test_pred = classifier.predict_labels(dists, k=1)
# Compute and print the fraction of correctly predicted examples
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))
y_test_pred = classifier.predict_labels(dists, k=5)
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))
"""
接下来是用不同的方式计算dist,并进行比较,可以不看
"""
# Now lets speed up distance matrix computation by using partial vectorization
# with one loop. Implement the function compute_distances_one_loop and run the
# code below:
dists_one = classifier.compute_distances_one_loop(X_test)
# To ensure that our vectorized implementation is correct, we make sure that it
# agrees with the naive implementation. There are many ways to decide whether
# two matrices are similar; one of the simplest is the Frobenius norm. In case
# you haven't seen it before, the Frobenius norm of two matrices is the square
# root of the squared sum of differences of all elements; in other words, reshape
# the matrices into vectors and compute the Euclidean distance between them.
#矩阵A的Frobenius范数(ord='fro')定义为矩阵A各项元素的绝对值平方的总和
difference = np.linalg.norm(dists - dists_one, ord='fro')
print('Difference was: %f' % (difference, ))
if difference < 0.001:
print('Good! The distance matrices are