python numpy
1.np.flatnonzero()
(1)功能:输入一个矩阵,返回扁平化矩阵中label为y的索引值;
(2)使用:
a = np.array([1,2,3,4,4,3,5,3,6])
b = np.flatnonzero(d == 3)
print (b)
[2 5 7]
2.np.random.choice()
(1)功能:从一个给定数组随机生成新的数组;
(2)使用:
arr = np.array('a', 'b', 'c', 'd')
np.random.choice(arr, 3) #默认replace为True
np.random.choice(arr, 3, replace = False) #输出数组元素不可重复
np.random.choice(arr, 3, p=[0.3, 0.2, 0.1,0.4]) #p为随机选择的概率
输出:['a', 'a', 'c']
['a', 'b', 'd']
['a', 'd', 'd']
3.np.argsort()
(1)功能:返回数组从小到大的索引值;
(2)使用:
针对二维数组:
x = np.array([[0, 3], [2, 2]])
np.argsort(x, axis=0) #按列排序
np.argsort(x, axis=1) #按行排序
输出: array([[0, 1], [1, 0]])
array([[0, 1], [0, 1]])
4.np.bincount()
(1)功能:设输入数组的最大值为x,该函数则输出0->x中每个值出现的次数(常接argsort函数)
(2)使用:
x = np.array([0, 1, 1, 3, 2, 1, 7])
np.bincount(x)
输出:array([1, 3, 1, 1, 0, 0, 0, 1])
5.np.linalg.norm()
(1)功能:linalg = linear + algbra(代数),norm表示范数
(2)使用:
x_norm=np.linalg.norm(x, ord=None, axis=None, keepdims=False)
ord为选择的范数,可为1,2,np.inf(无穷),默认为2;
axis:0表示按列向量处理,1表示按行向量处理,默认为矩阵范式;
keepdims:True表示保持矩阵的二维特性,False相反
6.np.array_split()
(1)功能:划分数组,可不均等划分,而split仅可进行均等划分;
(2)使用:
x = np.arange(8.0)
np.array_split(x, 3)
输出:[array([ 0., 1., 2.]), array([ 3., 4., 5.]), array([ 6., 7.])]
7.np.concatenate()
(1)功能:数组拼接函数;一般地,对于一维数组,可用extend、append()函数,而对于多个数组,则需使用concatenate进行拼接;
(2)使用
a=np.array([[1,2,3],[4,5,6]])
b=np.array([[11,21,31],[7,8,9]])
np.concatenate((a,b),axis=0)
输出:array([[ 1, 2, 3],
[ 4, 5, 6],
[11, 21, 31],
[ 7, 8, 9]])
np.concatenate((a,b),axis=1) #axis=1表示对应行的数组进行拼接
输出:array([[ 1, 2, 3, 11, 21, 31],
[ 4, 5, 6, 7, 8, 9]])
K-NN算法
1.主要思想
计算新数据与训练数据之间的距离,选取K个距离最近的邻居进行分类和回归;(监督学习)
具体实现如下:
(1)距离计算:
可用曼哈顿距离、欧氏距离等;
(2)预测分类:
def predict_labels(self, dists, k=1):
num_test = dists.shape[0]
y_pred = np.zeros(num_test)
for i in range(num_test):
closest_y = []
closest_y = self.y_train[np.argsort(dists[i, :])[:k]] #统计距离最近的K个邻居,并存储label
y_pred[i] = np.argmax(np.bincount(closest_y))找到出现次数最多的类别
return y_pred
2.确定K值
K的确定对预测结果的影响较大,当K较小时,训练误差较小,但测试误差较大,即产生过拟合现象;
常用的选取K值的方法是采用交叉验证数据集;
代码如下:
num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]
X_train_folds = []
y_train_folds = []
X_train_folds = np.array_split(X_train, num_folds, axis = 0)
y_train_folds = np.array_split(y_train, num_folds, axis = 0)
k_to_accuracies = {}
for i in range(num_folds):
X_val = X_train_folds[0]
y_val = y_train_folds[0]
X_tra = np.concatenate(X_train_folds[1:num_folds])
y_tra = np.concatenate(y_train_folds[1:num_folds])
if i < num_folds - 1:
temp = X_train_folds[0]
X_train_folds[0] = X_train_folds[i+1]
X_train_folds[i+1] = temp
temp = y_train_folds[0]
y_train_folds[0] = y_train_folds[i+1]
y_train_folds[i+1] = temp
model = KNearestNeighbor()
model.train(X_tra, y_tra)
dists = model.compute_distances_no_loops(X_val)
for k in k_choices:
y_val_pred = model.predict_labels(dists, k = k)
num_correct = np.sum(y_val_pred == y_val)
accuracy = float(num_correct) / y_val.shape[0]
if i == 0:
k_to_accuracies[k] = []
k_to_accuracies[k].append(accuracy)
# Print out the computed accuracies
for k in sorted(k_to_accuracies):
for accuracy in k_to_accuracies[k]:
print('k = %d, accuracy = %f' % (k, accuracy)