机器学习-KMeans聚类(肘系数Elbow和轮廓系数Silhouette)

最新推荐文章于 2025-09-06 13:42:13 发布

原创

最新推荐文章于 2025-09-06 13:42:13 发布 · 1.1w 阅读

60 ·

CC 4.0 BY-SA版权

Santorinisu博客，未经授权，禁止转载!!

文章标签：

#python #聚类算法 #机器学习

本文介绍了KMeans聚类算法的基本原理，包括如何选择初始中心点和迭代过程。针对聚类数量的选择，通过Elbow方法和Silhouette分析来评估模型性能。Elbow方法寻找变形度下降最陡的点，而Silhouette系数衡量样本在群组内的紧密程度和与其他群组的分离程度。结果表明，当设置聚类数量为3时，聚类效果最佳，平均轮廓系数接近0.72，与Elbow方法的结果一致。

Section I： Brief Introduction on KMeans Cluster

The K-Means algorithm belongs to the category of prototype-based clustering. Prototype-based clustering means that each cluster is represented by a prototype, which can either be the centorid (average) of similar points with continuous features, or the medoid (the most frequently occurring point) in the case of categorical features. While K-Means is very good at identifying clusters with a spherical shape, one of the drawbacks of this clutering algorithm is that the number of clusters need to be specified. An inapproriate choice for cluter number can result in poor clustering performance, so two indexes for model performance, i.e., elbow and silhouette, are useful techniques to evaluate the quality of clutering to determine the optimal number of cluters.
The flowchart of K-Means algorithm can be summarized by the following four steps:

Step 1: Randomly pick k centroids from the sample points as initial cluter centers
Step 2: Assign each sample to the nearest centroid according to distance difference, and then move the centroids to the center of the samples that were assigned to it
Step 3: Repeat steps 2 until the cluster assignments do not change or user-defined tolerance or maximum number of iterations is reached; otherwise, update centroids.

FROM
Sebastian Raschka, Vahid Mirjalili. Python机器学习第二版. 南京：东南大学出版社，2018.

第一部分：基本KMeans聚类算法
代码

from sklearn import datasets

import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

plt.rcParams['figure.dpi']=200
plt.rcParams['savefig.dpi']=200
font = {
   
   'weight': 'light'}
plt.rc("font", **font)

#Section 1: Load Blobs from datasets and visualize it
X,y=datasets.make_blobs(n_samples=150,
                        n_features=2,
                        centers=3,
                        cluster_std=0.5,
                        shuffle=True,
                        random_state=0)
plt.scatter(X[:,0],X[:,1],c='white',marker='o',edgecolors='black',s=50)
plt.grid()
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.savefig('./fig1.png')
plt.show()

#Section 2: Use KMeans algorithm to visualize data points and centroids
from sklearn.cluster import KMeans

#Set n_init=10 to run k-means clustering algorithm 10 times independently
#with different centroids to choose the final model with the lowest SSE
km=KMeans(n_clusters=3,
          init='random',
          n_init=10,
          max_iter=300,
          tol=1e-4,
          random_state=0)

y_km=km.fit_predict(X)
plt.scatter(X[y_km==0,0],
            X[y_km==0,1],
            s=50,
            c='lightgreen',
            marker='s',
            edgecolor='black',
            label='Cluster 1')
plt.scatter(X[y_km==1,0],
            X[y_km==1,1],
            s=50,
            c='orange',
            marker='o',
            edgecolor='black',
            label='Cluster 2')
plt.scatter(X[y_km==2,0],
            X[y_km==2,1],
            s=50,
            c='lightblue',
            marker='v',
            edgecolor='black',
            label='Cluster 3')
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],
            s=250,
            marker='*',
            c='red',
            edgecolor='black',
            label='Centroids')
plt.xlabel('Featur