k-means （python）算法-优快云博客

本文介绍了一种经典的机器学习算法——K-means聚类算法的Python实现。该算法通过迭代的方式将数据集划分成若干个簇(cluster)，使得簇内的样本彼此相似，而不同簇间的样本差异较大。文中详细解释了算法的工作原理，并提供了完整的源代码。

转:http://www.daniweb.com/forums/thread31449.html
什么都不说了，直接看代码吧。
注解应该写的比较详细

# liukaiyi
# 注 k-means ，维度类型 - 数值形式 ( 199 或 23.13

)
import sys, math, random

# -- 类化 '数据'
# 在 n-维度空间
class Point:
     def __init__ (self, coords, reference = None):
        self.coords = coords
        self.n = len(coords)
        self.reference = reference
     def __repr__ (self):
         return str(self.coords)

# -- 类化 '聚集点 / 聚类平均距离点 '
# -- 在 n-维度空间
# -- k-means 核心类
# -- 每次聚集各点围绕她进行聚集
# -- 并提供方法求-聚集后的计算中心点，同时记入此次中心点(聚集各点平均距离)，为下一次聚集提供中心点.
class Cluster:
     def __init__ (self, points):
         if len(points) == 0: raise Exception( " ILLEGAL: EMPTY CLUSTER " )
        self.points = points
        self.n = points[0].n
     for p in points:
             if p.n != self.n: raise Exception( " ILLEGAL: MULTISPACE CLUSTER " )
         # 求聚集各点后平均点
    self.centroid = self.calculateCentroid()
     def __repr__ (self):
         return str(self.points)

     # 更新中心点，并返回原中心点与现中心点(聚集各点平均距离)距离
     def update(self, points):
        old_centroid = self.centroid
        self.points = points
        self.centroid = self.calculateCentroid()
         return getDistance(old_centroid, self.centroid)

     # 计算平均点（聚集/收集各点（离本类的中心点）最近数据,后生成新的中心点）
     def calculateCentroid(self):
        centroid_coords = []
         #   维度迭代
     for i in range(self.n):
            centroid_coords.append( 0.0 )
             # 收集各点迭代
         for p in self.points:
                centroid_coords[i] = centroid_coords[i] + p.coords[i]
            centroid_coords[i] = centroid_coords[i] / len(self.points)
         return Point(centroid_coords)

# -- 返回根据 k-means 聚集形成的数据集
def kmeans(points, k, cutoff):
     # Randomly sample k Points from the points list, build Clusters around them
    initial = random.sample(points, k)
    clusters = []
     for p in initial: clusters.append(Cluster([p]))
     # 迭代 k-means 直到每次迭代各收集点别的最多不超过 0.5
     while True:
         #   k 个收集数组
        lists = []
         for c in clusters: lists.append([])
     # 迭代每个数据点，并计算与每个中心点距离
     # 并把数据点添加入相应最短的中心点收集数组中
     # 在迭代中 smallest_distance 为每个点与各中心点最短距离参数，请注意看
         for p in points:
            smallest_distance = getDistance(p, clusters[0].centroid)
            index = 0
             for i in range(len(clusters[ 1 :])):
                distance = getDistance(p, clusters[i + 1 ].centroid)
                 if distance < smallest_distance:
                    smallest_distance = distance
                    index = i + 1
             # 添加到离最短中心距离的数组中
        lists[index].append(p)

         # 聚集完，计算新中心点
     # 并 cluster.centroid 属性记入下新中心点（下一次聚集的中心点）
     # 并计算与上一次中心点距离，如果差值在 cutoff 0.5 以下 ,跳出迭代（结束，返回最后一次聚集集合）
    biggest_shift = 0.0
         for i in range(len(clusters)):
            shift = clusters[i].update(lists[i])
            biggest_shift = max(biggest_shift, shift)
         if biggest_shift < cutoff: break
     return clusters

# -- 得到欧几里德距离两点之间
def getDistance(a, b):
     # Forbid measurements between Points in different spaces
     if a.n != b.n: raise Exception( " ILLEGAL: NON-COMPARABLE POINTS " )
     # Euclidean distance between a and b is sqrt(sum((a[i]-b[i])^2) for all i)
    ret = 0.0
     for i in range(a.n):
        ret = ret + pow((a.coords[i] - b.coords[i]), 2 )
     return math.sqrt(ret)

# -- 在 n-维度空间中创建随机点
# -- 随机生成测试数据
def makeRandomPoint(n, lower, upper):
    coords = []
     for i in range(n): coords.append(random.uniform(lower, upper))
     return Point(coords)

# main
def main(args):
     # 参数说明
     # num_points,    n,    k,      cutoff,         lower,        upper
     # 随机数据数量 , 维度, 聚集数, 跳出迭代最小距离 ,   维度数最大值,维度数最小值
    num_points, n, k, cutoff, lower, upper = 10 , 2 , 3 , 0.5 , - 200 , 200

     # 在 n-维度空间里 , 创建 num_points 随机点
     # 测试数据生成
    points = []
     for i in range(num_points): points.append(makeRandomPoint(n, lower, upper))

     # 使用 k-means 算法，来聚集数据点 (算法入口点)
    clusters = kmeans(points, k, cutoff)

     print " \nPOINTS: "
     for p in points: print " P: " , p
     print " \nCLUSTERS: "
     for c in clusters: print " C: " , c
if __name__ == " __main__ " : main(sys.argv)

本文转自博客园刘凯毅的博客，原文链接：k-means （python）算法，如需转载请自行联系原博主。