本文的参考资料:《Python数据科学手册》;
本文的源代上传到了Gitee上;
本文用到的包:
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
from sklearn.datasets import make_blobs, make_moons, load_sample_image
from sklearn.cluster import KMeans, SpectralClustering, MiniBatchKMeans
sns.set()
plt.rc('font', family='SimHei')
plt.rc('axes', unicode_minus=False)
K-means算法
K-means算法在不带标签的数据中寻找一定数量的簇,每个簇的中心点是属于这个簇的所有点的算数平均值,且每一个点到它所在的簇的中心的距离比到其他的簇中心的距离短;
简单示例
在sklearn中,K-means算法由KMeans类实现;
x, y_true = make_blobs(
n_samples=150,
n_features=2,
centers=4,
random_state=233,
cluster_std=1.5,
)
fig, axs = plt.subplots(1, 2, figsize=(16, 8)) # type: plt.Figure, list
ax_raw = axs[0] # type: plt.Axes
ax_k_means = axs[1] # type: plt.Axes
ax_raw.scatter(x=x[:, 0], y=x[:, 1], c=y_true, cmap=plt.cm.get_cmap('rainbow', lut=4), alpha=0.6, edgecolors='k')
ax_raw.set_title('用于聚类的测试数据')
model = KMeans(n_clusters=4)
model.fit(x)
y_predict = model.predict(x)
ax_k_means.scatter(x=x[:, 0], y=x[:, 1], c=y_predict, cmap=plt.cm.get_cmap('rainbow', lut=4), alpha=0.6, edgecolors='k')
ax_k_means.scatter(
x=model.cluster_centers_[:, 0],
y=model.cluster_centers_[:, 1],
c='blue',
s=600,
alpha=0.3,
edgecolors='k',
label='簇中心',
)
ax_k_means.legend(loc='lower right')
ax_k_means.set_title('聚类结果')
fig.suptitle('K-means简单示例')
期望最大化
K-means使用期望最大化(expectation-maximization,E-M)算法找到最适合的簇中心,大致步骤如下:
- 随机选择一些簇中心
- 执行如下步骤,直到结果收敛
- E-Step(期望步骤):将每一个点分配至距其最近的点,完成一次分类;
- M-Step(最大化步骤):将簇中心点设置为改簇所有点的坐标平均值
当然,期望最大化算法也有自己的固有缺陷:
- 最终结果依赖最初随机选择的簇中心,如果簇中心的选择不合适,则E-M无法获得最佳结果(所以K-means一般会执行多次取最佳结果);
- 簇的数量必须事先确定
- 复杂度较高
- 无法处理非线性的数据
手动实现期望最大化算法:
def my_k_means_using_em(x: np.ndarray, n_cluster: int, seed: int) -> (np.ndarray, np.ndarray):
x_min, x_max = x[:, 0].min(), x[:, 1].max()
y_min, y_max = x[:, 1].min(), x[:, 1].max()
rng = np.random.RandomState(seed=seed)
centers = np.hstack((
(rng.rand(n_cluster) * (x_max - x_min) + x_min)[:, np.newaxis],
(rng.rand(n_cluster) * (y_max - y_min) + y_min)[:, np.newaxis],
))
centers_new = np.zeros(centers.shape)
dis = np.zeros(shape=(n_cluster, x.shape[0]))
labels = np.zeros(shape=(x.shape[0]))
while True:
for i in range(x.shape[0]):
for j in range(n_cluster):
dx = x[i, 0] - centers[j, 0]
dy = x[i, 1] - centers[j, 1]
dis[j, i] = np.sqrt(dx ** 2 + dy ** 2)
for i in range(x.shape[0]):
idx = 0
for j in range(n_cluster):
if dis[j, i] < dis[idx, i]:
idx = j
labels[i] = idx
for i in range(n_cluster):
tmp = x[labels == i] # type: np.ndarray
if tmp.size > 0:
centers_new[i] = tmp.mean(axis=0)
else:
centers_new[i] = centers[i]
if np.all(centers == centers_new):
break
else:
centers = centers_new
return centers, labels.reshape((x.shape[0], 1))
x_train, y_true = make_blobs(
n_samples=200,
centers=4,
n_features=2,
random_state=233,
cluster_std=1.2,
)
cluster_center, y_predict = my_k_means_using_em(x=x_train, n_cluster=4, seed=int(datetime.now().timestamp()))
fig, axs = plt.subplots(1, 2, figsize=(20, 10)) # type: plt.Figure, list
ax_raw, ax_model = axs[0], axs[1] # type: plt.Axes, plt.Axes
cm = plt.cm.get_cmap('rainbow', lut=4)
ax_raw.scatter(x=x_train[:, 0], y=x_train[:, 1], c=y_true, edgecolors='k', cmap=cm, alpha=0.6)
ax_raw.set_title('测试数据')
ax_model.scatter(x=x_train[:, 0], y=x_train[:, 1], c=y_predict, edgecolors='k', cmap=cm, alpha=0.6)
ax_model.scatter(x=cluster_center[:, 0], y=cluster_center[:, 1], c='blue', edgecolors='k', s=600, alpha=0.5, label='簇中心')
ax_model.legend(loc='lower right')
ax_model.set_title('使用E-M寻找簇中心')
fig.suptitle('使用期望最大化进行聚类')
一个比较理想的执行结果如下:
一个比较离谱的执行结果如下:
使用K-means处理非线性数据
可以将非线性的数据经过一定的核变换投影到高维空间,然后进行K-means聚类,sklearn中的SpectralClustering实现了这样的算法:
x_train, y_true = make_moons(n_samples=200, noise=0.08, random_state=233)
model1 = KMeans(n_clusters=2)
model2 = SpectralClustering(n_clusters=2, affinity='nearest_neighbors', assign_labels='kmeans')
y_p1 = model1.fit_predict(x_train)
y_p2 = model2.fit_predict(x_train)
fig, axs = plt.subplots(1, 3, figsize=(15, 5)) # type: plt.Figure, list
ax_raw = axs[0] # type: plt.Axes
ax_k_means = axs[1] # type: plt.Axes
ax_sp_cluster = axs[2] # type: plt.Axes
cm = plt.cm.get_cmap('rainbow', lut=2)
ax_raw.scatter(x=x_train[:, 0], y=x_train[:, 1], c=y_true, cmap=cm, edgecolors='k', alpha=0.6)
ax_raw.set_title('原始数据(非线性边界)')
ax_k_means.scatter(x=x_train[:, 0], y=x_train[:, 1], c=y_p1, cmap=cm, edgecolors='k', alpha=0.6)
ax_k_means.scatter(
x=model1.cluster_centers_[:, 0], y=model1.cluster_centers_[:, 1], c='blue',
s=300, alpha=0.8, edgecolors='k', label='簇中心',
)
ax_k_means.set_title('直接使用K-means进行聚类')
ax_k_means.legend(loc='upper right')
ax_sp_cluster.scatter(x=x_train[:, 0], y=x_train[:, 1], c=y_p2, cmap=cm, edgecolors='k', alpha=0.6)
ax_sp_cluster.set_title('使用SpectralClusering聚类')
fig.suptitle('处理非线性数据')
案例:将K-means应用于色彩压缩
一幅8位色深的图像,需要使用 255 × 255 × 255 255 \times 255 \times 255 255×255×255种颜色,现在我们使用聚类算法将颜色数量压缩至16种,并查看压缩颜色之后图像的效果;
这里使用基于批处理的K-means(MiniBatchKMeans类)进行聚类,它比一般的K-means相比速度较快;
可以看到,即使颜色的种数减小到了原来的一百万分之一,图像的特征仍旧在总体上得到了保留;
img = load_sample_image('china.jpg')
plt.figure(figsize=(12, 12))
plt.imshow(img)
plt.xticks([])
plt.yticks([])
plt.title('china.jpg from sklearn dataset')
img_data = (img / 255).reshape((img.shape[0] * img.shape[1], img.shape[2]))
model = MiniBatchKMeans(n_clusters=16)
model.fit(img_data)
img_new = model.cluster_centers_[model.predict(img_data)] # 生成新图像
img_new = img_new.reshape(img.shape)
plt.figure(figsize=(12, 12))
plt.imshow(img_new)
plt.xticks([])
plt.yticks([])
plt.title('经过K-means聚类之后,只保留了16种颜色的图像')
完整代码(Jupyter Notebook)
#%% md
# K-means
K-means算法在不带标签的数据中寻找一定数量的簇,每个簇的中心点是属于这个簇的所有点的算数平均值,且每一个点到它所在的簇的中心的距离比到其他的簇中心
的距离短;
## k-means简单示例
#%%
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
from sklearn.datasets import make_blobs, make_moons, load_sample_image
from sklearn.cluster import KMeans, SpectralClustering, MiniBatchKMeans
sns.set()
plt.rc('font', family='SimHei')
plt.rc('axes', unicode_minus=False)
#%%
x, y_true = make_blobs(
n_samples=150,
n_features=2,
centers=4,
random_state=233,
cluster_std=1.5,
)
fig, axs = plt.subplots(1, 2, figsize=(16, 8)) # type: plt.Figure, list
ax_raw = axs[0] # type: plt.Axes
ax_k_means = axs[1] # type: plt.Axes
ax_raw.scatter(x=x[:, 0], y=x[:, 1], c=y_true, cmap=plt.cm.get_cmap('rainbow', lut=4), alpha=0.6, edgecolors='k')
ax_raw.set_title('用于聚类的测试数据')
model = KMeans(n_clusters=4)
model.fit(x)
y_predict = model.predict(x)
ax_k_means.scatter(x=x[:, 0], y=x[:, 1], c=y_predict, cmap=plt.cm.get_cmap('rainbow', lut=4), alpha=0.6, edgecolors='k')
ax_k_means.scatter(
x=model.cluster_centers_[:, 0],
y=model.cluster_centers_[:, 1],
c='blue',
s=600,
alpha=0.3,
edgecolors='k',
label='簇中心',
)
ax_k_means.legend(loc='lower right')
ax_k_means.set_title('聚类结果')
fig.suptitle('K-means简单示例')
#%% md
## 期望最大化
K-means使用**期望最大化**(expectation-maximization,E-M)算法找到最适合的簇中心,大致步骤如下:
- 随机选择一些簇中心
- 执行如下步骤,直到结果收敛
- E-Step(期望步骤):将每一个点分配至距其最近的点,完成一次分类;
- M-Step(最大化步骤):将簇中心点设置为改簇所有点的坐标平均值
期望最大化算法也有自己的固有缺陷:
- 最终结果依赖最初随机选择的簇中心,如果簇中心的选择不合适,则E-M无法获得最佳结果;
- 簇的数量必须事先确定
- 复杂度较高
- 无法处理非线性的数据
#%%
def my_k_means_using_em(x: np.ndarray, n_cluster: int, seed: int) -> (np.ndarray, np.ndarray):
x_min, x_max = x[:, 0].min(), x[:, 1].max()
y_min, y_max = x[:, 1].min(), x[:, 1].max()
rng = np.random.RandomState(seed=seed)
centers = np.hstack((
(rng.rand(n_cluster) * (x_max - x_min) + x_min)[:, np.newaxis],
(rng.rand(n_cluster) * (y_max - y_min) + y_min)[:, np.newaxis],
))
centers_new = np.zeros(centers.shape)
dis = np.zeros(shape=(n_cluster, x.shape[0]))
labels = np.zeros(shape=(x.shape[0]))
while True:
for i in range(x.shape[0]):
for j in range(n_cluster):
dx = x[i, 0] - centers[j, 0]
dy = x[i, 1] - centers[j, 1]
dis[j, i] = np.sqrt(dx ** 2 + dy ** 2)
for i in range(x.shape[0]):
idx = 0
for j in range(n_cluster):
if dis[j, i] < dis[idx, i]:
idx = j
labels[i] = idx
for i in range(n_cluster):
tmp = x[labels == i] # type: np.ndarray
if tmp.size > 0:
centers_new[i] = tmp.mean(axis=0)
else:
centers_new[i] = centers[i]
if np.all(centers == centers_new):
break
else:
centers = centers_new
return centers, labels.reshape((x.shape[0], 1))
x_train, y_true = make_blobs(
n_samples=200,
centers=4,
n_features=2,
random_state=233,
cluster_std=1.2,
)
cluster_center, y_predict = my_k_means_using_em(x=x_train, n_cluster=4, seed=int(datetime.now().timestamp()))
fig, axs = plt.subplots(1, 2, figsize=(20, 10)) # type: plt.Figure, list
ax_raw, ax_model = axs[0], axs[1] # type: plt.Axes, plt.Axes
cm = plt.cm.get_cmap('rainbow', lut=4)
ax_raw.scatter(x=x_train[:, 0], y=x_train[:, 1], c=y_true, edgecolors='k', cmap=cm, alpha=0.6)
ax_raw.set_title('测试数据')
ax_model.scatter(x=x_train[:, 0], y=x_train[:, 1], c=y_predict, edgecolors='k', cmap=cm, alpha=0.6)
ax_model.scatter(x=cluster_center[:, 0], y=cluster_center[:, 1], c='blue', edgecolors='k', s=600, alpha=0.5, label='簇中心')
ax_model.legend(loc='lower right')
ax_model.set_title('使用E-M寻找簇中心')
fig.suptitle('使用期望最大化进行聚类')
#%% md
## 使用K-means处理非线性数据
可以将非线性的数据经过一定的核变换投影到高维空间,然后进行K-means聚类,sklearn中的SpectralClustering实现了这样的算法:
#%%
x_train, y_true = make_moons(n_samples=200, noise=0.08, random_state=233)
model1 = KMeans(n_clusters=2)
model2 = SpectralClustering(n_clusters=2, affinity='nearest_neighbors', assign_labels='kmeans')
y_p1 = model1.fit_predict(x_train)
y_p2 = model2.fit_predict(x_train)
fig, axs = plt.subplots(1, 3, figsize=(15, 5)) # type: plt.Figure, list
ax_raw = axs[0] # type: plt.Axes
ax_k_means = axs[1] # type: plt.Axes
ax_sp_cluster = axs[2] # type: plt.Axes
cm = plt.cm.get_cmap('rainbow', lut=2)
ax_raw.scatter(x=x_train[:, 0], y=x_train[:, 1], c=y_true, cmap=cm, edgecolors='k', alpha=0.6)
ax_raw.set_title('原始数据(非线性边界)')
ax_k_means.scatter(x=x_train[:, 0], y=x_train[:, 1], c=y_p1, cmap=cm, edgecolors='k', alpha=0.6)
ax_k_means.scatter(
x=model1.cluster_centers_[:, 0], y=model1.cluster_centers_[:, 1], c='blue',
s=300, alpha=0.8, edgecolors='k', label='簇中心',
)
ax_k_means.set_title('直接使用K-means进行聚类')
ax_k_means.legend(loc='upper right')
ax_sp_cluster.scatter(x=x_train[:, 0], y=x_train[:, 1], c=y_p2, cmap=cm, edgecolors='k', alpha=0.6)
ax_sp_cluster.set_title('使用SpectralClusering聚类')
fig.suptitle('处理非线性数据')
#%% md
# 案例:将K-means应用于色彩压缩
一幅8位色深的图像,需要使用$255 \times 255 \times 255$种颜色,现在我们使用聚类算法将颜色数量压缩至16种,并查看压缩颜色之后图像的效果;
这里使用基于批处理的K-means(MiniBatchKMeans)进行聚类,它比一般的K-means相比速度较快;
可以看到,即使颜色的种数减小到了原来的一百万分之一,图像的特征仍旧在总体上得到了保留;
#%%
img = load_sample_image('china.jpg')
plt.figure(figsize=(12, 12))
plt.imshow(img)
plt.xticks([])
plt.yticks([])
plt.title('china.jpg from sklearn dataset')
img_data = (img / 255).reshape((img.shape[0] * img.shape[1], img.shape[2]))
model = MiniBatchKMeans(n_clusters=16)
model.fit(img_data)
img_new = model.cluster_centers_[model.predict(img_data)] # 生成新图像
img_new = img_new.reshape(img.shape)
plt.figure(figsize=(12, 12))
plt.imshow(img_new)
plt.xticks([])
plt.yticks([])
plt.title('经过K-means聚类之后,只保留了16种颜色的图像')