K-均值聚类方法的python实现

本文介绍了一种经典的无监督学习方法——K均值聚类算法,并通过Python代码实现了一个简单的K均值聚类模型。该模型能够对随机生成的数据进行有效聚类,并通过图表展示聚类效果。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

参考《机器学习实战》

首先给出k均值算法


输入:样本集D=\{x_1,x_2,...,x_m\}

           聚类簇数k

过程:

  1. 从D中随机选择k个样本作为初始均值向量\{\mu_1,\mu_2,...,\mu_k\}
  2. repeat
  3.     令C_i=\phi(1\leq i\leq k)
  4.     for j = 1, 2, ..., m do
  5.         计算样本x_j与各均值向量\mu_i(1\leq i\leq k)的距离:d_{ij}=\left \| x_j-\mu_i \right \|_2
  6.         根据距离最近的均值向量确定x_j的簇标记:\lambda _j=argmin_{i\in \{1,2,...,k\}}d_{ji}
  7.         将样本x_j划入相应的簇:C_{\lambda _j}=C_{\lambda _j}\bigcup \{x_j\}
  8.     end for
  9.     for i = 1, 2, ..., k do
  10.         计算新均值向量:\mu '_i=\frac{1}{|C_i|}\sum_{x\in C_i}x
  11.         if \mu'_i\neq \mu_i then
  12.             将当前均值向量\mu_i更新为\mu'_i
  13.         else
  14.             保持当前均值向量不变
  15.         end if
  16.     end for
  17. until 当前均值向量均未更新

输出:簇划分C=\{C_1,C_2,...,C_k\}


# -*- coding: utf-8 -*-
"""
Created on Mon Jul 16 21:54:20 2018

@author: Li Qingquan
"""

import numpy as np
from sklearn.datasets import make_blobs
import time
import matplotlib.pyplot as plt

seed = np.random.seed(2018)

class KMeans(object):

    def __init__(self, n_clusters):
        '''
        :param n_clusters:
        '''
        self.n_clusters = n_clusters


    def fit(self, X):
        '''
        :param X:
        :return:
        '''
        '''
        write your code here
        '''
        # pass
        '''
        Compute K-Means clustering
        '''
        # choose k centroids randomly
        n = X.shape[1]
        centroids = np.zeros((self.n_clusters, n)) # store the centroids
        for i in range(n):
            minNum = np.min(X[:, i])
            maxNum = np.max(X[:, i])
            centroids[:, i] = minNum + (maxNum - minNum) * np.random.rand(self.n_clusters)
        self.centroids = centroids
        
        # update centroids to find the optimum solution
        m = X.shape[0]
        Assment = np.zeros((m, 2)) # store the index of the family and the distance from centroids of sample points
        clusterChanged = True;
        while clusterChanged:
            clusterChanged = False
            for i in range(m): # find the closest centroid
                minDist = np.inf
                minIndex = -1;
                for j in range(self.n_clusters):
                    distBetw = np.math.sqrt(sum(np.power(centroids[j, :] - X[i, :], 2))) # the distance between point and centroid
                    if distBetw < minDist:
                        minDist = distBetw
                        minIndex = j
                if Assment[i, 0] != minIndex:
                    clusterChanged = True;
                    Assment[i, :] = minIndex, minDist**2
            for i in range(self.n_clusters): # update centroids
                indexAll = Assment[:, 0]
                valueI = np.nonzero(indexAll==i)
                pointsIn = X[valueI[0]]
                centroids[i, :] = np.mean(pointsIn, axis=0) # compute the mean to be the new centroids
        self.centroids = centroids
        self.Assment = Assment


    def predict(self, X):
        '''
        :param X:
        :return:
        '''
        '''
        write your code here
        '''
        # Predict the closet cluster each sample in X belongs to
        m = X.shape[0]
        y_pred = np.empty((m,))
        for i in range(m): # distribute sample points
            minDist = np.inf
            for j in range(self.n_clusters):
                distBetw = np.math.sqrt(sum(np.power(self.centroids[j, :] - X[i, :], 2)))
                if distBetw < minDist:
                    minDist = distBetw
                    y_pred[i] = j
        return y_pred


    def fit_predict(self, X):
        '''
        :param X:
        :return:
        '''
        self.fit(X)
        return self.predict(X)

if __name__ == '__main__':

    ################################# 1. define params ###################################
    # number of samples
    n_samples = 1000
    # number of clusters, you can define the param by yourself, the recommend value is 3
    n_clusters = 4

    # read data
    X, y = make_blobs(n_samples=n_samples, random_state=seed)

    ################################# 2. initialize the model ###################################
    kmeans_model = KMeans(n_clusters=n_clusters)



    ################################# 3. training and predicting #######################################
    start_time = time.time()
    y_pred = kmeans_model.fit_predict(X)
    end_time = time.time()
    print('Training and predicting Time: ', end_time - start_time)



    ################################ 4. plot #######################################################
    plt.scatter(X[:, 0], X[:, 1], c=y_pred)
    plt.title("Unevenly Sized Blobs")

    plt.show()

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值