文章目录
python(sklearn) 聚类性能度量
一、sklearn聚类评价函数:
metrics.adjusted_mutual_info_score(…[, …])
metrics.adjusted_rand_score(labels_true, …)
metrics.calinski_harabasz_score(X, labels)
metrics.davies_bouldin_score(X, labels)
metrics.completeness_score(labels_true, …)
metrics.cluster.contingency_matrix(…[, …])
metrics.fowlkes_mallows_score(labels_true, …)
metrics.homogeneity_completeness_v_measure(…)
metrics.homogeneity_score(labels_true, …)
metrics.mutual_info_score(labels_true, …)
metrics.normalized_mutual_info_score(…[, …])
metrics.silhouette_score(X, labels[, …])
metrics.silhouette_samples(X, labels[, metric])
metrics.v_measure_score(labels_true, labels_pred)
二、评价函数说明:
1. 轮廓系数(Silhouette Coefficient)
-
函数:
def silhouette_score(X, labels, metric=‘euclidean’, sample_size=None,
random_state=None, **kwds): -
函数值说明:
所有样本的s i 的均值称为聚类结果的轮廓系数,定义为S,是该聚类是否合理、有效的度量。聚类结果的轮廓系数的取值在【-1,1】之间,值越大,说明同类样本相距约近,不同样本相距越远,则聚类效果越好。
2. CH分数(Calinski Harabasz Score )
- 函数:
def calinski_harabasz_score(X, labels): - 函数值说明:
类别内部数据的协方差越小越好,类别之间的协方差越大越好,这样的Calinski-Harabasz分数会高。 总结起来一句话:CH index的数值越大越好。
3. 戴维森堡丁指数(DBI)——davies_bouldin_score
- 函数:
def davies_bouldin_score(X, labels): - 函数值说明:
注意:DBI的值最小是0,值越小,代表聚类效果越好。
完整示例
#!/usr/bin/env python
# encoding: utf-8
'''
@Author : pentiumCM
@Email : 842679178@qq.com
@Software: PyCharm
@File : iris_hierarchical_cluster.py
@Time : 2020/4/15 23:55
@desc : 鸢尾花层次聚类
'''
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import linkage, dendrogram
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
from sklearn import metrics
# 定义常量
cluster_num = 3
# 1. 导入数据集
iris = datasets.load_iris()
iris_data = iris.data
# 2. 数据预处理
data = np.array(iris_data)
std_scaler = preprocessing.StandardScaler()
data_M = std_scaler.fit_transform(data)
# 3. 绘制树状图
plt.figure()
Z = linkage(data_M, method='ward', metric='euclidean')
p = dendrogram(Z, 0)
plt.show()
# 4. 模型训练
ac = AgglomerativeClustering(n_clusters=cluster_num, affinity='euclidean', linkage='ward')
ac.fit(data_M)
# 聚类
label_list = ac.fit_predict(data_M)
for i in range(len(label_list)):
if i % 50 == 0:
print()
else:
print(label_list[i], end=" ")
print()
# 平面聚类的每一簇的元素
reslist = [[] for i in range(cluster_num)]
# 遍历聚类中每个簇的元素
for i in range(len(label_list)):
label = label_list[i]
# 遍历每一类
reslist[label].append(data_M[i, :])
data_M = np.array(data_M.reshape((-1, 4)))
# 聚类结果可视化
pca = PCA(n_components=3)
pca.fit(data_M)
pca_data = pca.transform(data_M)
# 定义三维坐标轴
fig = plt.figure()
ax1 = plt.axes(projection='3d')
# 绘制散点图
zd = pca_data[:, 0]
xd = pca_data[:, 1]
yd = pca_data[:, 2]
colors = []
for label in label_list:
if label == 0:
colors.append('r')
elif label == 1:
colors.append('y')
elif label == 2:
colors.append('g')
elif label == 3:
colors.append('violet')
for i in range(len(label_list), data_M.shape[0]):
colors.append('black')
ax1.scatter3D(xd, yd, zd, cmap='Blues', c=colors)
plt.show()
# 检验聚类的性能
# metrics.silhouette_score(X, labels[, …])
cluster_score_si = metrics.silhouette_score(data_M, label_list)
print("cluster_score_si", cluster_score_si)
cluster_score_ch = metrics.calinski_harabasz_score(data_M, label_list)
print("cluster_score_ch:", cluster_score_ch)
# DBI的值最小是0,值越小,代表聚类效果越好。
cluster_score_DBI = metrics.davies_bouldin_score(data_M, label_list)
print("cluster_score_DBI :", cluster_score_DBI)
运行结果:
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
0 0 2 0 2 0 2 0 2 2 0 2 0 2 0 2 2 2 2 0 0 0 0 0 0 0 0 0 2 2 2 2 0 2 0 0 2 2 2 2 0 2 2 2 2 2 0 2 2
0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cluster_score_si 0.4466890410285909
cluster_score_ch: 222.71916382215363
cluster_score_DBI : 0.8034665302876753
参考资料
https://blog.youkuaiyun.com/qq_27825451/article/details/94436488