size-constrained-clustering

MOLLYSMITH

已于 2022-07-15 20:37:39 修改

阅读量849

点赞数 2

文章标签：聚类机器学习 python

于 2022-07-15 20:31:14 首次发布

本文链接：https://blog.youkuaiyun.com/MOLLYSMITH/article/details/125811315

版权

size-constrained-clustering 具有类别数量约束的聚类

文章目录

size-constrained-clustering 具有类别数量约束的聚类
前言
一、size-constrained-clustering 0.1.2是什么？
二、安装步骤
总结

前言

本文主要介绍由MIT Wang Jing开发的 size-constrained-clustering 0.1.2的包的安装过程。该包主要是具有数量约束的聚类算法，典型算法为：类别严格均衡的K-Means算法（最小费流思路和启发式求解两种求解思路），具有类别数量上界和下界的KM（最小费流求解），子类类别数比例（Deterministic Annealing）方法和Shrinkage Clustering方法。

其主要思想改天再介绍，在这里先挖一个坑，嘻嘻嘻

一、size-constrained-clustering 0.1.2是什么？

这是Wang开发的一个包，官方网站，其 github仓库为最新代码（不要在官方直接下载）。

不要在这里下载，去github！！！

二、安装步骤

方法一

官方示例：其实一般情况下行不通，转方法二

pip install size-constrained-clustering

方法二

下载github上的zip文件（size_constrained_clustering-master.zip），解压至特定文件夹，进入size_constrained_clustering-master所在目录，需要配置环境，编译pyx文件生成pyd文件，运行setup即可。
新建Python3.7环境（一定要是3.7）

conda create -n XXX python=3.7

进入主目录

pip install -r requirements.txt

编译pyx文件
因为源码是由c和Python代码混合的，一些函数存放在pyx文件下，如果不进行编译，则一直提示缺少相关的函数，我真的是后知后觉，参见博文。编译时需要新建setup1.py代码如下

from setuptools import setup
from Cython.Build import cythonize
from Cython.Distutils import build_ext
import numpy as np
setup(
    cmdclass = {'build_ext': build_ext},
    ext_modules = cythonize("size_constrained_clustering\\k_means_constrained\\mincostflow_vectorized_.pyx"),
    include_dirs=[np.get_include()],
)

然后在新建环境XXX并进入主目录里，运行

python setup1.py build_ext --inplace

可以得到由pyx生成的pyd文件（似乎我生成路径有些问题，可以生成后移动到相应文件夹），共三个pyx文件，分别为：size_constrained_clustering\sklearn_import\cluster_k_means.pyx，size_constrained_clustering\sklearn_import\metrics\pairwise_fast.pyx，size_constrained_clustering\sklearn_import\utils\sparsefuncs_fast.pyx
生成之后可即为完成编译。

主函数代码

主函数相关介绍在该博文里已详细介绍并配有结果图，在这里不再赘述。

# setup
from size_constrained_clustering import fcm, equal, minmax, shrinkage, da
# by default it is euclidean distance, but can select others
from sklearn.metrics.pairwise import haversine_distances
import numpy as np
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt


# Normal FKM
n_samples = 2000
n_clusters = 4
centers = [(-5, -5), (0, 0), (5, 5), (7, 10)]
X, _ = make_blobs(n_samples=n_samples, n_features=2, cluster_std=1.0,
                    centers=centers, shuffle=False, random_state=42)
model = fcm.FCM(n_clusters)
# use other distance function: e.g. haversine distance
# model = fcm.FCM(n_clusters, distance_func=haversine_distances)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_
# show FCM results
plt.figure(figsize=(10,10))
colors=['red','green','blue','yellow']
 
for i,color in enumerate(colors):
    color_tmp=np.where(labels==i)[0]
    plt.scatter(X[color_tmp,0],X[color_tmp,1],c=color,label=i) 
plt.legend()
plt.scatter(centers[:,0],centers[:,1],s=1000,c='black')
plt.show()

# Equal size constraints FCM
n_samples = 2000
n_clusters = 3
X = np.random.rand(n_samples, 2)
# use minimum cost flow framework to solve
# model = equal.SameSizeKMeansMinCostFlow(n_clusters)
# use heuristics method to solve
model = equal.SameSizeKMeansHeuristics(n_clusters)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_
for i,color in enumerate(colors):
    color_tmp=np.where(labels==i)[0]
    plt.scatter(X[color_tmp,0],X[color_tmp,1],c=color,label=i) 
# plt.legend()
plt.scatter(centers[:,0],centers[:,1],s=1000,c='black')
plt.show()

# Minimum and Maximum Size Constraint
n_samples = 2000
n_clusters = 2
X = np.random.rand(n_samples, 2)
model = minmax.MinMaxKMeansMinCostFlow(n_clusters, size_min=400,   size_max=1500)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_
for i,color in enumerate(colors):
    color_tmp=np.where(labels==i)[0]
    plt.scatter(X[color_tmp,0],X[color_tmp,1],c=color,label=i) 
# plt.legend()
plt.scatter(centers[:,0],centers[:,1],s=1000,c='black')
plt.show()

# Deterministic Annealing
n_samples = 2000
n_clusters = 3
X = np.random.rand(n_samples, 2)
# distribution is the distribution of cluster sizes
model = da.DeterministicAnnealing(n_clusters, distribution=[0.1, 0.3, 0.6])
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_
for i,color in enumerate(colors):
    color_tmp=np.where(labels==i)[0]
    plt.scatter(X[color_tmp,0],X[color_tmp,1],c=color,label=i) 
# plt.legend()
plt.scatter(centers[:,0],centers[:,1],s=1000,c='black')
plt.show()

# Shrinkage Clustering
n_samples = 1500
n_clusters = 3
# centers = [(-5, -5), (0, 0), (5, 5), (7, 10)]
centers = [(-5, -5), (0, 0), (5, 5)]
X, _ = make_blobs(n_samples=n_samples, n_features=2, cluster_std=1.0, centers=centers, shuffle=False, random_state=42)
model = shrinkage.Shrinkage(n_clusters, size_min=100)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_
print(labels)
for i,color in enumerate(colors):
    color_tmp=np.where(labels==i)[0]
    plt.scatter(X[color_tmp,0],X[color_tmp,1],c=color,label=i) 
# plt.legend()
plt.scatter(centers[:,0],centers[:,1],s=1000,c='black')
plt.show()