聚类算法--K值估计及效果评估_聚类内部指标成本函数-优快云博客

本文链接：https://blog.youkuaiyun.com/buracag_mc/article/details/75727895

上周实习工作中用到了聚类分析的相关内容，故又对聚类分析算法重温一遍，中间发现我前面所写博客有两个比较关键的步骤是缺失的：

利用肘部法则估计参数数目
利用轮廓系数评估聚类算法的优劣

Python有现成的轮子，只需更改传入的数据集即可
由于我所有的数据、代码及结果均在公司所配电脑上，所以以下仅附上Python的实现。

本文参考博客：http://blog.youkuaiyun.com/u013719780/article/details/51755124

成本(目标)函数

其实在博文http://blog.youkuaiyun.com/buracag_mc/article/details/74139768中已经提到过，我们要寻找的聚类是使得类内离差平方和最小下的类，即：
$argmin\sum_{i=1}^{k}(\sum_{x\in S_i}(|x-u_i|^2))$

我们这里使用的说法是类内离差平方和最小，典型的统计学上的说法，一看就是学统计=_=；

其实也就是数据挖掘与统计机器学习中常说的成本(目标)函数，最小化类内离差平方和即是最小化成本函数。

使用肘部法则估计聚类数目

如果问题中没有指定聚类数目K的值，可以通过肘部法则这一技术来估计聚类数量。

肘部法则会把不同K值的成本函数值画出来。随着K值的增大，平均畸变程度会减小；每个类包含的样本数会减少，于是样本离其重心会更近。但是，随着K值继续增大，平均畸变程度的改善效果会不断减低。K值增大过程中，畸变程度的改善效果下降幅度最大的位置对应的K值就是肘部。

其中需要解释的一点是：
每个类的畸变程度等于该类重心与其内部成员位置距离的平方和。也即我们前面所说的每个类的类内离差平方和。若类内部的成员彼此间越紧凑则类的畸变程度越小，反之，若类内部的成员彼此间越分散则类的畸变程度越大。

当然Python已经有了现成的轮子可以用了，关键命令如下：

# coding：utf-8

from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist  # 计算距离时
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
# 正常显示中文字体
font = FontProperties(fname=r"c:\windows\fonts\msyh.ttc", size=10) 

K = range(1, 10)
meandistortions = []

for k in K:
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X)
    meandistortions.append(sum(np.min(cdist(X, kmeans.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])

plt.plot(K, meandistortions, 'bx-')
plt.xlabel('k')
plt.ylabel(u'平均畸变程度',fontproperties=font)
plt.title(u'用肘部法则来确定最佳的K值',fontproperties=font);
plt.show()

最后效果图大致形如：
这里写图片描述

聚类效果评估

K-Means是一种典型的非监督学习，没有标签和其他信息来比较聚类结果。但是，我们还是有一些指标可以评估算法的性能。

====下面是百度百科的解释====
轮廓系数（Silhouette Coefficient），是聚类效果好坏的一种评价方式。最早由 Peter J. Rousseeuw 在 1986 提出。它结合内聚度和分离度两种因素。可以用来在相同原始数据的基础上用来评价不同算法、或者算法不同运行方式对聚类结果所产生的影响。

假设我们已经通过聚类算法将待分类数据进行了聚类，分为了 k 个簇。对于簇中的每个向量。分别计算它们的轮廓系数。
对于其中的一个点 i 来说：

计算 a(i) = average(i向量到所有它属于的簇中其它点的距离)
计算 b(i) = min (i向量到所有非本身所在簇的点的平均距离)
那么 i 向量轮廓系数就为：
$S(i) = \frac{b(i) - a(i)}{max\{a(i),b(i)\}}$

这里，再附上百度百科的图片，直观形象：
这里写图片描述

可见轮廓系数的值是介于 [-1,1] ，越趋近于1代表内聚度和分离度都相对较优。

最后将所有样本点的轮廓系数求平均，就是该聚类结果总的轮廓系数。

同样，Python已经有了现成的轮子可用，关键命令如下

import numpy as np
from sklearn.cluster import KMeans
from sklearn import metrics

X = np.array()       # X为待聚类的数据集
kmeans_model=KMeans(n_clusters=k).fit(X)  # K-Means聚类，聚类数为k

s = metrics.silhouette_score(X,kmeans_model.labels_,metric='euclidean') # 轮廓系数得分

这里我再给出metrics.silhouette_score的帮助文档：

Compute the mean Silhouette Coefficient of all samples.

The Silhouette Coefficient is calculated using the mean intra-cluster
distance (``a``) and the mean nearest-cluster distance (``b``) for each
sample.  The Silhouette Coefficient for a sample is ``(b - a) / max(a,
b)``.  To clarify, ``b`` is the distance between a sample and the nearest
cluster that the sample is not a part of.
Note that Silhouette Coefficent is only defined if number of labels
is 2 <= n_labels <= n_samples - 1.

This function returns the mean Silhouette Coefficient over all samples.
To obtain the values for each sample, use :func:`silhouette_samples`.

The best value is 1 and the worst value is -1. Values near 0 indicate
overlapping clusters. Negative values generally indicate that a sample has
been assigned to the wrong cluster, as a different cluster is more similar.

Parameters
----------
X : array [n_samples_a, n_samples_a] if metric == "precomputed", or, \
         [n_samples_a, n_features] otherwise
    Array of pairwise distances between samples, or a feature array.

labels : array, shape = [n_samples]
     Predicted labels for each sample.

metric : string, or callable
    The metric to use when calculating distance between instances in a
    feature array. If metric is a string, it must be one of the options
    allowed by :func:`metrics.pairwise.pairwise_distances
    <sklearn.metrics.pairwise.pairwise_distances>`. If X is the distance
    array itself, use ``metric="precomputed"``.

sample_size : int or None
    The size of the sample to use when computing the Silhouette Coefficient
    on a random subset of the data.
    If ``sample_size is None``, no sampling is used.

random_state : integer or numpy.RandomState, optional
    The generator used to randomly select a subset of samples if
    ``sample_size is not None``. If an integer is given, it fixes the seed.
    Defaults to the global numpy random number generator.

`**kwds` : optional keyword parameters
    Any further parameters are passed directly to the distance function.
    If using a scipy.spatial.distance metric, the parameters are still
    metric dependent. See the scipy docs for usage examples.

Returns
-------
silhouette : float
    Mean Silhouette Coefficient for all samples.

大概解读一下，metrics.silhouette_score函数的功能及参数情况：

 计算所有样本的平均轮廓系数。

 使用前面我们说的公式计算平均集群计算轮阔系数

 另外请注意，Silhouette Coefficent仅在标签数量上被定义
 是2 <= n_labels <= n_samples - 1。

 该函数返回所有样本的平均轮廓系数。
 要获取每个样本的值，请使用：func：`silhouette_samples`。

 最佳值为1，最差值为-1。 0附近的值表示聚类群可能会有重叠，负值通常表示样品具有被分配到错误的群集的概率，因为不同的群集更相似。


 参数情况
 -------
 X:样本之间的特征数组或成对距离数组

 labels：每个样本的预测标签

 metric：string表示采用的度量距离的方式； 如果X是距离数组本身，使用metric ="precomputed"。

 sample_size：int或None
 如果为int，则在数据的随机子集上计算轮廓系数时要使用的样本的大小；如果为None，则不使用抽样。

 random_state：integer或numpy.RandomState，可选；用于随机选择样本子集的生成器。默认为全局numpy随机数生成器。


 Returns
 -------
 所有样本的轮廓系数均值

所以啊，文档很重要啊！！！文档很重要啊！！！很重要啊！！！