size-constrained-clustering

size-constrained-clustering 具有类别数量约束的聚类


前言

本文主要介绍由MIT Wang Jing开发的 size-constrained-clustering 0.1.2的包的安装过程。该包主要是具有数量约束的聚类算法,典型算法为:类别严格均衡的K-Means算法(最小费流思路和启发式求解两种求解思路),具有类别数量上界和下界的KM(最小费流求解),子类类别数比例(Deterministic Annealing)方法和Shrinkage Clustering方法。


其主要思想改天再介绍,在这里先挖一个坑,嘻嘻嘻

一、size-constrained-clustering 0.1.2是什么?

这是Wang开发的一个包, 官方网站,其 github仓库最新代码(不要在官方直接下载)。

不要在这里下载,去github!!!

二、安装步骤

方法一

官方示例:其实一般情况下行不通,转方法二

pip install size-constrained-clustering

方法二

下载github上的zip文件(size_constrained_clustering-master.zip),解压至特定文件夹,进入size_constrained_clustering-master所在目录,需要配置环境,编译pyx文件生成pyd文件,运行setup即可。
新建Python3.7环境(一定要是3.7)

conda create -n XXX python=3.7

进入主目录

pip install -r requirements.txt

编译pyx文件
因为源码是由c和Python代码混合的,一些函数存放在pyx文件下,如果不进行编译,则一直提示缺少相关的函数,我真的是后知后觉,参见博文。编译时需要新建setup1.py代码如下

from setuptools import setup
from Cython.Build import cythonize
from Cython.Distutils import build_ext
import numpy as np
setup(
    cmdclass = {'build_ext': build_ext},
    ext_modules = cythonize("size_constrained_clustering\\k_means_constrained\\mincostflow_vectorized_.pyx"),
    include_dirs=[np.get_include()],
)

然后在新建环境XXX并进入主目录里,运行

python setup1.py build_ext --inplace

可以得到由pyx生成的pyd文件(似乎我生成路径有些问题,可以生成后移动到相应文件夹),共三个pyx文件,分别为:size_constrained_clustering\sklearn_import\cluster_k_means.pyx,size_constrained_clustering\sklearn_import\metrics\pairwise_fast.pyx,size_constrained_clustering\sklearn_import\utils\sparsefuncs_fast.pyx
生成之后可即为完成编译。

主函数代码

主函数相关介绍在该博文里已详细介绍并配有结果图,在这里不再赘述。

# setup
from size_constrained_clustering import fcm, equal, minmax, shrinkage, da
# by default it is euclidean distance, but can select others
from sklearn.metrics.pairwise import haversine_distances
import numpy as np
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt


# Normal FKM
n_samples = 2000
n_clusters = 4
centers = [(-5, -5), (0, 0), (5, 5), (7, 10)]
X, _ = make_blobs(n_samples=n_samples, n_features=2, cluster_std=1.0,
                    centers=centers, shuffle=False, random_state=42)
model = fcm.FCM(n_clusters)
# use other distance function: e.g. haversine distance
# model = fcm.FCM(n_clusters, distance_func=haversine_distances)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_
# show FCM results
plt.figure(figsize=(10,10))
colors=['red','green','blue','yellow']
 
for i,color in enumerate(colors):
    color_tmp=np.where(labels==i)[0]
    plt.scatter(X[color_tmp,0],X[color_tmp,1],c=color,label=i) 
plt.legend()
plt.scatter(centers[:,0],centers[:,1],s=1000,c='black')
plt.show()

# Equal size constraints FCM
n_samples = 2000
n_clusters = 3
X = np.random.rand(n_samples, 2)
# use minimum cost flow framework to solve
# model = equal.SameSizeKMeansMinCostFlow(n_clusters)
# use heuristics method to solve
model = equal.SameSizeKMeansHeuristics(n_clusters)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_
for i,color in enumerate(colors):
    color_tmp=np.where(labels==i)[0]
    plt.scatter(X[color_tmp,0],X[color_tmp,1],c=color,label=i) 
# plt.legend()
plt.scatter(centers[:,0],centers[:,1],s=1000,c='black')
plt.show()

# Minimum and Maximum Size Constraint
n_samples = 2000
n_clusters = 2
X = np.random.rand(n_samples, 2)
model = minmax.MinMaxKMeansMinCostFlow(n_clusters, size_min=400,   size_max=1500)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_
for i,color in enumerate(colors):
    color_tmp=np.where(labels==i)[0]
    plt.scatter(X[color_tmp,0],X[color_tmp,1],c=color,label=i) 
# plt.legend()
plt.scatter(centers[:,0],centers[:,1],s=1000,c='black')
plt.show()

# Deterministic Annealing
n_samples = 2000
n_clusters = 3
X = np.random.rand(n_samples, 2)
# distribution is the distribution of cluster sizes
model = da.DeterministicAnnealing(n_clusters, distribution=[0.1, 0.3, 0.6])
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_
for i,color in enumerate(colors):
    color_tmp=np.where(labels==i)[0]
    plt.scatter(X[color_tmp,0],X[color_tmp,1],c=color,label=i) 
# plt.legend()
plt.scatter(centers[:,0],centers[:,1],s=1000,c='black')
plt.show()

# Shrinkage Clustering
n_samples = 1500
n_clusters = 3
# centers = [(-5, -5), (0, 0), (5, 5), (7, 10)]
centers = [(-5, -5), (0, 0), (5, 5)]
X, _ = make_blobs(n_samples=n_samples, n_features=2, cluster_std=1.0, centers=centers, shuffle=False, random_state=42)
model = shrinkage.Shrinkage(n_clusters, size_min=100)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_
print(labels)
for i,color in enumerate(colors):
    color_tmp=np.where(labels==i)[0]
    plt.scatter(X[color_tmp,0],X[color_tmp,1],c=color,label=i) 
# plt.legend()
plt.scatter(centers[:,0],centers[:,1],s=1000,c='black')
plt.show()

总结

其中可能会遇到的错误,但我现在也没解决,也不影响代码运行…但在这个错误上面确实花了挺长时间,参考过得博文,但真的屁用没有。主要还是卡在pyx文件的编译上。只有找对思路才能解决问题呀,下次原理再见。
在这里插入图片描述

### Capacity Constrained K-Means Clustering Algorithm Implementation and Explanation Capacity-constrained k-means clustering is a variant of the traditional k-means algorithm where each cluster has an upper limit on the number of points it can contain. This constraint ensures that clusters do not become too large, which may be desirable in certain applications such as load balancing or resource allocation. The standard k-means objective function minimizes within-cluster variance but does not consider capacity constraints. To incorporate these constraints into the model: - A penalty term must be added to penalize violations of the capacity limits. - The assignment step needs modification so that no more than \( C_i \) points are assigned to any given cluster \( i \). An effective approach involves using Lagrange multipliers to handle inequality constraints during optimization[^1]. Here's how one might implement this method in Python: ```python import numpy as np from sklearn.cluster import MiniBatchKMeans def capacity_constrained_kmeans(X, n_clusters=8, max_iter=300, capacities=None): """ Perform capacity-constrained k-means clustering Parameters: X (array-like): Input data matrix with shape (n_samples, n_features). n_clusters (int): Number of clusters. max_iter (int): Maximum iterations allowed. capacities (list[int]): List containing maximum size per cluster. Returns: labels (ndarray): Array of integer labels indicating cluster membership. centers (ndarray): Centroid coordinates for each cluster. """ if capacities is None: raise ValueError("Capacities list cannot be empty") # Initialize centroids randomly from input samples rng = np.random.RandomState(42) indices = rng.choice(len(X), size=n_clusters, replace=False) centers = X[indices] prev_labels = None for iteration in range(max_iter): distances = ((X[:, :, None] - centers.T)**2).sum(axis=1) # Assign points while respecting capacity restrictions available_slots = capacities.copy() labels = [-1] * len(X) sorted_indices = np.argsort(distances.sum(axis=1)) for idx in sorted_indices: valid_options = [ c for c in range(n_clusters) if available_slots[c] > 0 and distances[idx][c] != float('inf') ] if not valid_options: break chosen_cluster = min(valid_options, key=lambda x: distances[idx][x]) labels[idx] = chosen_cluster available_slots[chosen_cluster] -= 1 # Update center positions based on new assignments updated_centers = [] for clust_id in set(labels): members = [idx for idx, lbl in enumerate(labels) if lbl == clust_id] if members: centroid = X[members].mean(axis=0) updated_centers.append(centroid) centers = np.array(updated_centers) # Check convergence condition if prev_labels is not None and all(prev_labels == labels): break prev_labels = labels[:] return labels, centers ``` This code snippet demonstrates implementing capacity-constrained k-means by ensuring no cluster exceeds its specified capacity when assigning points. It iteratively updates both point-to-cluster assignments and cluster centroids until either reaching `max_iter` iterations or achieving stable results between consecutive passes over the dataset. --related questions-- 1. How would varying initial conditions affect performance? 2. What alternative strategies exist beyond simple distance-based selection? 3. Can parallel processing techniques improve execution speed significantly here? 4. Are there specific use cases better suited for capacity-constrained versus regular k-means?
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值