在聚类分析中,最简单、基本的版本是划分,它把对象组织成多个互斥的组或簇。
这些簇的形成旨在优化一个客观划分准则,使得同一簇中的对象是相似的,不同簇的对象是相异的。
最常用的划分方法包括,k-means和k-medoids。
一. k-means算法
k-means算法是一种很常见的聚类算法,它的基本思想是:通过迭代寻找k个聚类的一种划分方案,使得用这k个聚类的均值来代表相应各类样本时所得的总体误差最小。
算法
算法: k-均值。用于划分的k-均值算法,其中每个簇的中心都用簇中所有对象的均值来表示。
输入:
k: 簇的数目
D: 包含n个对象的数据集
输出:
k个簇的集合
方法:
1、随机选取 k个聚类质心点
2、重复下面过程直到收敛 {
对于每一个样例 i,计算其应该属于的类:
对于每一个类 j,重新计算该类的质心:
3.停止条件:
1.超过最大迭代
2.J超过阈值
}
二.实现
from __future__ import with_statement
import random
import numpy as np
from scipy.linalg import norm
import numpy.matlib as ml
import cPickle as pickle
from matplotlib import pyplot
from numpy import zeros, array, tile
def Kmeans(X, k, observer=None, threshold=1e-15, maxiter=300):
N = len(X)
labels = np.zeros(N,dtype=int)
centers = np.array(random.sample(X,k))
iter = 0
def calc_J():
sum = 0
for i in xrange(N):
sum += norm(X[i] - centers[labels[i]])
return sum
def distmat(X,Y):
n = len(X)
m = len(Y)
xx = ml.sum(X*X,axis=1) # #axis=1 是按行求和
print "xx:{}".format(xx)
yy = ml.sum(Y*Y,axis=1)
print "yy:{}".format(yy)
xy = ml.dot(X,Y.T) #dot矩阵相乘
return tile(xx,(m,1)).T + tile(yy,(n,1)) - 2*xy #tile矩阵复制
Jprev = calc_J()
while True:
# notify the observer
if observer is not None:
observer(iter, labels, centers)
# calculate distance from x to each center
# distance_matrix is only available in scipy newer than 0.7
# dist = distance_matrix(X, centers)
dist = distmat(X, centers)
# assign x to nearst center
labels = dist.argmin(axis=1)
# re-calculate each center
for j in range(k):
idx_j = (labels == j).nonzero()
centers[j] = X[idx_j].mean(axis=0)
J = calc_J()
iter += 1
if Jprev-J < threshold:
break
Jprev = J
if iter >= maxiter:
break
# final notification
if observer is not None:
observer(iter, labels, centers)
if __name__ == '__main__':
# load previously generated points
with open('cluster.pkl') as inf:
samples = pickle.load(inf)
N = 0
for smp in samples:
N += len(smp[0])
X = zeros((N, 2))
idxfrm = 0
for i in range(len(samples)):
idxto = idxfrm + len(samples[i][0])
X[idxfrm:idxto, 0] = samples[i][0]
X[idxfrm:idxto, 1] = samples[i][1]
idxfrm = idxto
def observer(iter, labels, centers):
print "iter %d." % iter
colors = array([[1, 0, 0], [0, 1, 0], [0, 0, 1]])
pyplot.plot(hold=False) # clear previous plot
pyplot.hold(True)
# draw points
data_colors=[colors[lbl] for lbl in labels]
pyplot.scatter(X[:, 0], X[:, 1], c=data_colors, alpha=0.5)
# draw centers
pyplot.scatter(centers[:, 0], centers[:, 1], s=200, c=colors)
pyplot.savefig('kmeans/iter_%02d.png' % iter, format='png')
Kmeans(X, 3, observer=observer)