【Python】机器学习笔记09-K-means

最新推荐文章于 2025-06-03 02:54:13 发布

RM -RF /星

最新推荐文章于 2025-06-03 02:54:13 发布

阅读量263

点赞数 1

分类专栏：数据科学与人工智能文章标签： python 机器学习数据挖掘

本文链接：https://blog.youkuaiyun.com/weixin_41429999/article/details/108047753

版权

数据科学与人工智能专栏收录该内容

11 篇文章

订阅专栏

本文的参考资料：《Python数据科学手册》；
本文的源代上传到了Gitee上；

本文用到的包：

%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from datetime import datetime


from sklearn.datasets import make_blobs, make_moons, load_sample_image
from sklearn.cluster import KMeans, SpectralClustering, MiniBatchKMeans

sns.set()
plt.rc('font', family='SimHei')
plt.rc('axes', unicode_minus=False)

K-means算法

K-means算法在不带标签的数据中寻找一定数量的簇，每个簇的中心点是属于这个簇的所有点的算数平均值，且每一个点到它所在的簇的中心的距离比到其他的簇中心的距离短；

简单示例

在sklearn中，K-means算法由KMeans类实现；

x, y_true = make_blobs(
    n_samples=150,
    n_features=2,
    centers=4,
    random_state=233,
    cluster_std=1.5,
)

fig, axs = plt.subplots(1, 2, figsize=(16, 8))  # type: plt.Figure, list
ax_raw = axs[0]  # type: plt.Axes
ax_k_means = axs[1]  # type: plt.Axes

ax_raw.scatter(x=x[:, 0], y=x[:, 1], c=y_true, cmap=plt.cm.get_cmap('rainbow', lut=4), alpha=0.6, edgecolors='k')
ax_raw.set_title('用于聚类的测试数据')

model = KMeans(n_clusters=4)
model.fit(x)
y_predict = model.predict(x)

ax_k_means.scatter(x=x[:, 0], y=x[:, 1], c=y_predict, cmap=plt.cm.get_cmap('rainbow', lut=4), alpha=0.6, edgecolors='k')
ax_k_means.scatter(
    x=model.cluster_centers_[:, 0],
    y=model.cluster_centers_[:, 1],
    c='blue',
    s=600,
    alpha=0.3,
    edgecolors='k',
    label='簇中心',
)
ax_k_means.legend(loc='lower right')
ax_k_means.set_title('聚类结果')

fig.suptitle('K-means简单示例')

在这里插入图片描述

期望最大化

K-means使用期望最大化（expectation-maximization，E-M）算法找到最适合的簇中心，大致步骤如下：

随机选择一些簇中心
执行如下步骤，直到结果收敛
- E-Step（期望步骤）：将每一个点分配至距其最近的点，完成一次分类；
- M-Step（最大化步骤）：将簇中心点设置为改簇所有点的坐标平均值

当然，期望最大化算法也有自己的固有缺陷：

最终结果依赖最初随机选择的簇中心，如果簇中心的选择不合适，则E-M无法获得最佳结果（所以K-means一般会执行多次取最佳结果）；
簇的数量必须事先确定
复杂度较高
无法处理非线性的数据

手动实现期望最大化算法：

def my_k_means_using_em(x: np.ndarray, n_cluster: int, seed: int) -> (np.ndarray, np.ndarray):
    x_min, x_max = x[:, 0].min(), x[:, 1].max()
    y_min, y_max = x[:, 1].min(), x[:, 1].max()
    rng = np.random.RandomState(seed=seed)

    centers = np.hstack((
        (rng.rand(n_cluster) * (x_max - x_min) + x_min)[:, np.newaxis],
        (rng.rand(n_cluster) * (y_max - y_min) + y_min)[:, np.newaxis],
    ))
    centers_new = np.zeros(centers.shape)
    dis = np.zeros(shape=(n_cluster, x.shape[0]))
    labels = np.zeros(shape=(x.shape[0]))

    while True:
        for i in range(x.shape[0]):
            for j in range(n_cluster):
                dx = x[i, 0] - centers[j, 0]
                dy = x[i, 1] - centers[j, 1]
                dis[j, i] = np.sqrt(dx ** 2 + dy ** 2)

        for i in range(x.shape[0]):
            idx = 0
            for j in range(n_cluster):
                if dis[j, i] < dis[idx, i]:
                    idx = j
            labels[i] = idx

        for i in range(n_cluster):
            tmp = x[labels == i]  # type: np.ndarray
            if tmp.size > 0:
                centers_new[i] = tmp.mean(axis=0)
            else:
                centers_new[i] = centers[i]

        if np.all(centers == centers_new):
            break
        else:
            centers = centers_new
    return centers, labels.reshape((x.shape[0], 1))

x_train, y_true = make_blobs(
    n_samples=200,
    centers=4,
    n_features=2,
    random_state=233,
    cluster_std=1.2,
)

cluster_center, y_predict = my_k_means_using_em(x=x_train, n_cluster=4, seed=int(datetime.now().timestamp()))

fig, axs = plt.subplots(1, 2, figsize=(20, 10))  # type: plt.Figure, list
ax_raw, ax_model = axs[0], axs[1]  # type: plt.Axes, plt.Axes

cm = plt.cm.get_cmap('rainbow', lut=4)
ax_raw.scatter(x=x_train[:, 0], y=x_train[:, 1], c=y_true, edgecolors='k', cmap=cm, alpha=0.6)
ax_raw.set_title('测试数据')

ax_model.scatter(x=x_train[:, 0], y=x_train[:, 1], c=y_predict, edgecolors='k', cmap=cm, alpha=0.6)
ax_model.scatter(x=cluster_center[:, 0], y=cluster_center[:, 1], c='blue', edgecolors='k', s=600, alpha=0.5, label='簇中心')
ax_model.legend(loc='lower right')
ax_model.set_title('使用E-M寻找簇中心')

fig.suptitle('使用期望最大化进行聚类')

一个比较理想的执行结果如下：
在这里插入图片描述

一个比较离谱的执行结果如下：
在这里插入图片描述

使用K-means处理非线性数据

可以将非线性的数据经过一定的核变换投影到高维空间，然后进行K-means聚类，sklearn中的SpectralClustering实现了这样的算法：

x_train, y_true = make_moons(n_samples=200, noise=0.08, random_state=233)

model1 = KMeans(n_clusters=2)
model2 = SpectralClustering(n_clusters=2, affinity='nearest_neighbors', assign_labels='kmeans')

y_p1 = model1.fit_predict(x_train)
y_p2 = model2.fit_predict(x_train)

fig, axs = plt.subplots(1, 3, figsize=(15, 5))  # type: plt.Figure, list
ax_raw = axs[0]  # type: plt.Axes
ax_k_means = axs[1]  # type: plt.Axes
ax_sp_cluster = axs[2]  # type: plt.Axes
cm = plt.cm.get_cmap('rainbow', lut=2)

ax_raw.scatter(x=x_train[:, 0], y=x_train[:, 1], c=y_true, cmap=cm, edgecolors='k', alpha=0.6)
ax_raw.set_title('原始数据（非线性边界）')

ax_k_means.scatter(x=x_train[:, 0], y=x_train[:, 1], c=y_p1, cmap=cm, edgecolors='k', alpha=0.6)
ax_k_means.scatter(
    x=model1.cluster_centers_[:, 0], y=model1.cluster_centers_[:, 1], c='blue',
    s=300, alpha=0.8, edgecolors='k', label='簇中心',
)
ax_k_means.set_title('直接使用K-means进行聚类')
ax_k_means.legend(loc='upper right')

ax_sp_cluster.scatter(x=x_train[:, 0], y=x_train[:, 1], c=y_p2, cmap=cm, edgecolors='k', alpha=0.6)
ax_sp_cluster.set_title('使用SpectralClusering聚类')

fig.suptitle('处理非线性数据')

在这里插入图片描述

案例：将K-means应用于色彩压缩

一幅8位色深的图像，需要使用 $255 \times 255 \times 255$ 种颜色，现在我们使用聚类算法将颜色数量压缩至16种，并查看压缩颜色之后图像的效果；

这里使用基于批处理的K-means（MiniBatchKMeans类）进行聚类，它比一般的K-means相比速度较快；

可以看到，即使颜色的种数减小到了原来的一百万分之一，图像的特征仍旧在总体上得到了保留；

img = load_sample_image('china.jpg')
plt.figure(figsize=(12, 12))
plt.imshow(img)
plt.xticks([])
plt.yticks([])
plt.title('china.jpg from sklearn dataset')

img_data = (img / 255).reshape((img.shape[0] * img.shape[1], img.shape[2]))

model = MiniBatchKMeans(n_clusters=16)
model.fit(img_data)
img_new = model.cluster_centers_[model.predict(img_data)]  # 生成新图像
img_new = img_new.reshape(img.shape)

plt.figure(figsize=(12, 12))
plt.imshow(img_new)
plt.xticks([])
plt.yticks([])
plt.title('经过K-means聚类之后，只保留了16种颜色的图像')

在这里插入图片描述

完整代码（Jupyter Notebook）

#%% md

# K-means

K-means算法在不带标签的数据中寻找一定数量的簇，每个簇的中心点是属于这个簇的所有点的算数平均值，且每一个点到它所在的簇的中心的距离比到其他的簇中心
的距离短；

## k-means简单示例

#%%

%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from datetime import datetime

from sklearn.datasets import make_blobs, make_moons, load_sample_image
from sklearn.cluster import KMeans, SpectralClustering, MiniBatchKMeans

sns.set()
plt.rc('font', family='SimHei')
plt.rc('axes', unicode_minus=False)

#%%

x, y_true = make_blobs(
    n_samples=150,
    n_features=2,
    centers=4,
    random_state=233,
    cluster_std=1.5,
)

fig, axs = plt.subplots(1, 2, figsize=(16, 8))  # type: plt.Figure, list
ax_raw = axs[0]  # type: plt.Axes
ax_k_means = axs[1]  # type: plt.Axes

ax_raw.scatter(x=x[:, 0], y=x[:, 1], c=y_true, cmap=plt.cm.get_cmap('rainbow', lut=4), alpha=0.6, edgecolors='k')
ax_raw.set_title('用于聚类的测试数据')

model = KMeans(n_clusters=4)
model.fit(x)
y_predict = model.predict(x)

ax_k_means.scatter(x=x[:, 0], y=x[:, 1], c=y_predict, cmap=plt.cm.get_cmap('rainbow', lut=4), alpha=0.6, edgecolors='k')
ax_k_means.scatter(
    x=model.cluster_centers_[:, 0],
    y=model.cluster_centers_[:, 1],
    c='blue',
    s=600,
    alpha=0.3,
    edgecolors='k',
    label='簇中心',
)
ax_k_means.legend(loc='lower right')
ax_k_means.set_title('聚类结果')

fig.suptitle('K-means简单示例')

#%% md

## 期望最大化

K-means使用**期望最大化**（expectation-maximization，E-M）算法找到最适合的簇中心，大致步骤如下：

-   随机选择一些簇中心
-   执行如下步骤，直到结果收敛
    -   E-Step（期望步骤）：将每一个点分配至距其最近的点，完成一次分类；
    -   M-Step（最大化步骤）：将簇中心点设置为改簇所有点的坐标平均值

期望最大化算法也有自己的固有缺陷：

-   最终结果依赖最初随机选择的簇中心，如果簇中心的选择不合适，则E-M无法获得最佳结果；
-   簇的数量必须事先确定
-   复杂度较高
-   无法处理非线性的数据

#%%

def my_k_means_using_em(x: np.ndarray, n_cluster: int, seed: int) -> (np.ndarray, np.ndarray):
    x_min, x_max = x[:, 0].min(), x[:, 1].max()
    y_min, y_max = x[:, 1].min(), x[:, 1].max()
    rng = np.random.RandomState(seed=seed)

    centers = np.hstack((
        (rng.rand(n_cluster) * (x_max - x_min) + x_min)[:, np.newaxis],
        (rng.rand(n_cluster) * (y_max - y_min) + y_min)[:, np.newaxis],
    ))
    centers_new = np.zeros(centers.shape)
    dis = np.zeros(shape=(n_cluster, x.shape[0]))
    labels = np.zeros(shape=(x.shape[0]))

    while True:
        for i in range(x.shape[0]):
            for j in range(n_cluster):
                dx = x[i, 0] - centers[j, 0]
                dy = x[i, 1] - centers[j, 1]
                dis[j, i] = np.sqrt(dx ** 2 + dy ** 2)

        for i in range(x.shape[0]):
            idx = 0
            for j in range(n_cluster):
                if dis[j, i] < dis[idx, i]:
                    idx = j
            labels[i] = idx

        for i in range(n_cluster):
            tmp = x[labels == i]  # type: np.ndarray
            if tmp.size > 0:
                centers_new[i] = tmp.mean(axis=0)
            else:
                centers_new[i] = centers[i]

        if np.all(centers == centers_new):
            break
        else:
            centers = centers_new
    return centers, labels.reshape((x.shape[0], 1))

x_train, y_true = make_blobs(
    n_samples=200,
    centers=4,
    n_features=2,
    random_state=233,
    cluster_std=1.2,
)

cluster_center, y_predict = my_k_means_using_em(x=x_train, n_cluster=4, seed=int(datetime.now().timestamp()))

fig, axs = plt.subplots(1, 2, figsize=(20, 10))  # type: plt.Figure, list
ax_raw, ax_model = axs[0], axs[1]  # type: plt.Axes, plt.Axes

cm = plt.cm.get_cmap('rainbow', lut=4)
ax_raw.scatter(x=x_train[:, 0], y=x_train[:, 1], c=y_true, edgecolors='k', cmap=cm, alpha=0.6)
ax_raw.set_title('测试数据')

ax_model.scatter(x=x_train[:, 0], y=x_train[:, 1], c=y_predict, edgecolors='k', cmap=cm, alpha=0.6)
ax_model.scatter(x=cluster_center[:, 0], y=cluster_center[:, 1], c='blue', edgecolors='k', s=600, alpha=0.5, label='簇中心')
ax_model.legend(loc='lower right')
ax_model.set_title('使用E-M寻找簇中心')

fig.suptitle('使用期望最大化进行聚类')

#%% md

## 使用K-means处理非线性数据

可以将非线性的数据经过一定的核变换投影到高维空间，然后进行K-means聚类，sklearn中的SpectralClustering实现了这样的算法：

#%%

x_train, y_true = make_moons(n_samples=200, noise=0.08, random_state=233)

model1 = KMeans(n_clusters=2)
model2 = SpectralClustering(n_clusters=2, affinity='nearest_neighbors', assign_labels='kmeans')

y_p1 = model1.fit_predict(x_train)
y_p2 = model2.fit_predict(x_train)

fig, axs = plt.subplots(1, 3, figsize=(15, 5))  # type: plt.Figure, list
ax_raw = axs[0]  # type: plt.Axes
ax_k_means = axs[1]  # type: plt.Axes
ax_sp_cluster = axs[2]  # type: plt.Axes
cm = plt.cm.get_cmap('rainbow', lut=2)

ax_raw.scatter(x=x_train[:, 0], y=x_train[:, 1], c=y_true, cmap=cm, edgecolors='k', alpha=0.6)
ax_raw.set_title('原始数据（非线性边界）')

ax_k_means.scatter(x=x_train[:, 0], y=x_train[:, 1], c=y_p1, cmap=cm, edgecolors='k', alpha=0.6)
ax_k_means.scatter(
    x=model1.cluster_centers_[:, 0], y=model1.cluster_centers_[:, 1], c='blue',
    s=300, alpha=0.8, edgecolors='k', label='簇中心',
)
ax_k_means.set_title('直接使用K-means进行聚类')
ax_k_means.legend(loc='upper right')

ax_sp_cluster.scatter(x=x_train[:, 0], y=x_train[:, 1], c=y_p2, cmap=cm, edgecolors='k', alpha=0.6)
ax_sp_cluster.set_title('使用SpectralClusering聚类')

fig.suptitle('处理非线性数据')

#%% md

# 案例：将K-means应用于色彩压缩

一幅8位色深的图像，需要使用$255 \times 255 \times 255$种颜色，现在我们使用聚类算法将颜色数量压缩至16种，并查看压缩颜色之后图像的效果；

这里使用基于批处理的K-means（MiniBatchKMeans）进行聚类，它比一般的K-means相比速度较快；

可以看到，即使颜色的种数减小到了原来的一百万分之一，图像的特征仍旧在总体上得到了保留；

#%%

img = load_sample_image('china.jpg')
plt.figure(figsize=(12, 12))
plt.imshow(img)
plt.xticks([])
plt.yticks([])
plt.title('china.jpg from sklearn dataset')

img_data = (img / 255).reshape((img.shape[0] * img.shape[1], img.shape[2]))

model = MiniBatchKMeans(n_clusters=16)
model.fit(img_data)
img_new = model.cluster_centers_[model.predict(img_data)]  # 生成新图像
img_new = img_new.reshape(img.shape)

plt.figure(figsize=(12, 12))
plt.imshow(img_new)
plt.xticks([])
plt.yticks([])
plt.title('经过K-means聚类之后，只保留了16种颜色的图像')