size-constrained-clustering 具有类别数量约束的聚类
文章目录
前言
本文主要介绍由MIT Wang Jing开发的 size-constrained-clustering 0.1.2的包的安装过程。该包主要是具有数量约束的聚类算法,典型算法为:类别严格均衡的K-Means算法(最小费流思路和启发式求解两种求解思路),具有类别数量上界和下界的KM(最小费流求解),子类类别数比例(Deterministic Annealing)方法和Shrinkage Clustering方法。
其主要思想改天再介绍,在这里先挖一个坑,嘻嘻嘻
一、size-constrained-clustering 0.1.2是什么?
这是Wang开发的一个包, 官方网站,其 github仓库为最新代码(不要在官方直接下载)。
二、安装步骤
方法一
官方示例:其实一般情况下行不通,转方法二
pip install size-constrained-clustering
方法二
下载github上的zip文件(size_constrained_clustering-master.zip),解压至特定文件夹,进入size_constrained_clustering-master所在目录,需要配置环境,编译pyx文件生成pyd文件,运行setup即可。
新建Python3.7环境(一定要是3.7)
conda create -n XXX python=3.7
进入主目录
pip install -r requirements.txt
编译pyx文件
因为源码是由c和Python代码混合的,一些函数存放在pyx文件下,如果不进行编译,则一直提示缺少相关的函数,我真的是后知后觉,参见博文。编译时需要新建setup1.py代码如下
from setuptools import setup
from Cython.Build import cythonize
from Cython.Distutils import build_ext
import numpy as np
setup(
cmdclass = {'build_ext': build_ext},
ext_modules = cythonize("size_constrained_clustering\\k_means_constrained\\mincostflow_vectorized_.pyx"),
include_dirs=[np.get_include()],
)
然后在新建环境XXX并进入主目录里,运行
python setup1.py build_ext --inplace
可以得到由pyx生成的pyd文件(似乎我生成路径有些问题,可以生成后移动到相应文件夹),共三个pyx文件,分别为:size_constrained_clustering\sklearn_import\cluster_k_means.pyx,size_constrained_clustering\sklearn_import\metrics\pairwise_fast.pyx,size_constrained_clustering\sklearn_import\utils\sparsefuncs_fast.pyx
生成之后可即为完成编译。
主函数代码
主函数相关介绍在该博文里已详细介绍并配有结果图,在这里不再赘述。
# setup
from size_constrained_clustering import fcm, equal, minmax, shrinkage, da
# by default it is euclidean distance, but can select others
from sklearn.metrics.pairwise import haversine_distances
import numpy as np
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# Normal FKM
n_samples = 2000
n_clusters = 4
centers = [(-5, -5), (0, 0), (5, 5), (7, 10)]
X, _ = make_blobs(n_samples=n_samples, n_features=2, cluster_std=1.0,
centers=centers, shuffle=False, random_state=42)
model = fcm.FCM(n_clusters)
# use other distance function: e.g. haversine distance
# model = fcm.FCM(n_clusters, distance_func=haversine_distances)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_
# show FCM results
plt.figure(figsize=(10,10))
colors=['red','green','blue','yellow']
for i,color in enumerate(colors):
color_tmp=np.where(labels==i)[0]
plt.scatter(X[color_tmp,0],X[color_tmp,1],c=color,label=i)
plt.legend()
plt.scatter(centers[:,0],centers[:,1],s=1000,c='black')
plt.show()
# Equal size constraints FCM
n_samples = 2000
n_clusters = 3
X = np.random.rand(n_samples, 2)
# use minimum cost flow framework to solve
# model = equal.SameSizeKMeansMinCostFlow(n_clusters)
# use heuristics method to solve
model = equal.SameSizeKMeansHeuristics(n_clusters)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_
for i,color in enumerate(colors):
color_tmp=np.where(labels==i)[0]
plt.scatter(X[color_tmp,0],X[color_tmp,1],c=color,label=i)
# plt.legend()
plt.scatter(centers[:,0],centers[:,1],s=1000,c='black')
plt.show()
# Minimum and Maximum Size Constraint
n_samples = 2000
n_clusters = 2
X = np.random.rand(n_samples, 2)
model = minmax.MinMaxKMeansMinCostFlow(n_clusters, size_min=400, size_max=1500)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_
for i,color in enumerate(colors):
color_tmp=np.where(labels==i)[0]
plt.scatter(X[color_tmp,0],X[color_tmp,1],c=color,label=i)
# plt.legend()
plt.scatter(centers[:,0],centers[:,1],s=1000,c='black')
plt.show()
# Deterministic Annealing
n_samples = 2000
n_clusters = 3
X = np.random.rand(n_samples, 2)
# distribution is the distribution of cluster sizes
model = da.DeterministicAnnealing(n_clusters, distribution=[0.1, 0.3, 0.6])
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_
for i,color in enumerate(colors):
color_tmp=np.where(labels==i)[0]
plt.scatter(X[color_tmp,0],X[color_tmp,1],c=color,label=i)
# plt.legend()
plt.scatter(centers[:,0],centers[:,1],s=1000,c='black')
plt.show()
# Shrinkage Clustering
n_samples = 1500
n_clusters = 3
# centers = [(-5, -5), (0, 0), (5, 5), (7, 10)]
centers = [(-5, -5), (0, 0), (5, 5)]
X, _ = make_blobs(n_samples=n_samples, n_features=2, cluster_std=1.0, centers=centers, shuffle=False, random_state=42)
model = shrinkage.Shrinkage(n_clusters, size_min=100)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_
print(labels)
for i,color in enumerate(colors):
color_tmp=np.where(labels==i)[0]
plt.scatter(X[color_tmp,0],X[color_tmp,1],c=color,label=i)
# plt.legend()
plt.scatter(centers[:,0],centers[:,1],s=1000,c='black')
plt.show()
总结
其中可能会遇到的错误,但我现在也没解决,也不影响代码运行…但在这个错误上面确实花了挺长时间,参考过得博文,但真的屁用没有。主要还是卡在pyx文件的编译上。只有找对思路才能解决问题呀,下次原理再见。