multi-view clustering指标

文章介绍了如何计算多视图聚类的匹配指标,包括使用Munkres算法和linear_sum_assignment进行最佳匹配,以及计算成本矩阵和重排序过程。同时,提到了聚类和分类指标的计算,如purity、NMI、AMI和ARI等。

几种 multi-view clustering 的指标代码,介绍见 [1-3],[4-6] 有实现。

Matching / Assignment

由于聚类没有类顺序,而有些指标用到 ground-truth labels(如 accuracy 等分类指标),需要求聚类簇跟 ground-truth 类之间的对应关系。

总体来说就是跑一个匹配算法求最优匹配,记 ground-truth labels y y yy_true、聚类模型输出的 cluster assignment y ′ y' yy_assign、匹配并调整顺序后的 assignment y ′ ′ y'' y′′y_adjust,都是 N n \N^n Nn 的一维向量,长度 n 是 instance 数。

cost matrix

算匹配之前,要求一个 cost 矩阵 W,在此之前先求一个 co-occurrence 矩阵(临时起的名 C ∈ N d × c C\in \N^{d \times c} CNd×c C i j = ∣ { k ∣ y k ′ = i ∧ y k = j , k = 1 , … , n } ∣ C_{ij}=\left|\{k|y'_k=i\wedge y_k=j, k=1,\dots,n\}\right| Cij={kyk=iyk=j,k=1,,n} 其中 c 是 ground-truth 类数,d 是聚类簇数,MVC 里一般 c = d。然后 W i j = m − C i j , m = max ⁡ r , c C r c W_{ij}=m - C_{ij}, \quad m = \max_{r,c} C_{rc} Wij=mCij,m=r,cmaxCrc

有些数据集的 y_true 的 class ID 是从 0 开始,以这里假设 class ID、cluster ID 都是从 0 开始。

# import numpy as np

def calc_cost_matrix(y_true, y_assign, n_classes, n_clusters):
    """calculate cost matrix W
    Input:
        y_true: [n], in {0, ..., n_classes - 1}
        y_assign: [n], in {0, ..., n_clusters - 1}
        n_classes: int, provide in case that y_true.max() != n_classes
        n_clusters: int, provide in case that y_assign.max() != n_clusters
    Output:
        W: [n_clusters, n_classes]
    """
    y_true = y_true.astype(np.int64)
    y_assign = y_assign.astype(np.int64)
    assert y_assign.size == y_true.size # n
    # C = np.zeros((y_assign.max() + 1, y_true.max() + 1), dtype=np.int64)
    C = np.zeros((n_clusters, n_classes), dtype=np.int64)
    for i in range(y_assign.size):
        C[y_assign[i], y_true[i]] += 1
    W = C.max() - C
    return W

munkres

[4] 用 munkres 包求,代码在 get_y_preds

# import numpy as np
# from munkres import Munkres

def reorder_assignment(y_true, y_assign, n_classes, n_clusters):
    """(munkres) re-order y_assign to be y_adjust so that it has the same order as y_true
    Input:
        y_true: [n], in {0, ..., c - 1}
        y_assign: [n], in {0, ..., d - 1}
        n_classes: int, provide in case that y_true.max() != n_classes
        n_clusters: int, provide in case that y_assign.max() != n_clusters
    Output:
        y_adjust: [n], in {0, ..., c - 1}, in same order as y_true
    """
    W = calc_cost_matrix(y_true, y_assign, n_classes, n_clusters)
    indices = Munkres().compute(W)
    map_a2t = np.zeros(n_clusters, dtype=np.int64)
    for i in range(n_clusters):
        map_a2t[i] = indices[i][1]
    y_adjust = map_a2t[y_assign]
    return y_adjust

linear_sum_assignment

[5,6] 用 scipy.optimize.linear_sum_assignment 求,代码分别在 cluster_accordered_cmat;[7] 也有示例。

# import numpy as np
# from scipy.optimize import linear_sum_assignment

def reorder_assignment(y_true, y_assign, n_classes, n_clusters):
    """(linear_sum_assignment) re-order y_assign to be y_adjust so that it has the same order as y_true
    Input:
        y_true: [n], in {0, ..., c - 1}
        y_assign: [n], in {0, ..., d - 1}
        n_classes: int, provide in case that y_true.max() != n_classes
        n_clusters: int, provide in case that y_assign.max() != n_clusters
    Output:
        y_adjust: [n], in {0, ..., c - 1}, in same order as y_true
    """
    W = calc_cost_matrix(y_true, y_assign, n_classes, n_clusters)
    row_idx, col_idx = linear_sum_assignment(W)
    map_a2t = np.zeros(n_clusters, dtype=np.int64)
    for i, j in zip(row_idx, col_idx):
        map_a2t[i] = j
    y_adjust = map_a2t[y_assign]
    return y_adjust

Clustering Metrics

这里区分「聚类」指标和「分类」指标,这一节聚类指标指需重排 cluster assignment 以对齐 ground-truth label 顺序的指标。

purity

介绍见 [1,2],范围 [0, 1],越大越好。[5] 有实现:purity

# import numpy as np
# from sklearn.metrics import accuracy_score

def purity(y_true, y_assign):
    y_voted_labels = np.zeros(y_true.shape)
    labels = np.unique(y_true)
    ordered_labels = np.arange(labels.shape[0])
    for k in range(labels.shape[0]):
        y_true[y_true == labels[k]] = ordered_labels[k]
    labels = np.unique(y_true)
    bins = np.concatenate((labels, [np.max(labels)+1]), axis=0)

    for cluster in np.unique(y_assign):
        hist, _ = np.histogram(y_true[y_assign == cluster], bins=bins)
        winner = np.argmax(hist)
        y_voted_labels[y_assign == cluster] = winner

    return accuracy_score(y_true, y_voted_labels)

NMI

Normalized mutual information,介绍见 [1,3],范围 [0, 1],越大越好。[4,6] 用 sklearn.metrics.normalized_mutual_info_scoreclustering_metriccalc_metrics),[5] 用 sklearn.metrics.v_measure_scoreevaluate)。

  • 关于 v_measure_score,scikit-learn 说:This score is identical to normalized_mutual_info_score with the ‘arithmetic’ option for averaging.
# from sklearn.metrics import normalized_mutual_info_score, v_measure_score

def nmi(y_true, y_assign):
    # return v_measure_score(y_true, y_assign)
    return normalized_mutual_info_score(y_true, y_assign)

AMI

Adjusted mutual information,介绍见 [3],范围 [-1, 1],越大越好。[4] 用 sklearn.metrics.adjusted_mutual_info_score,见 clustering_metric

# from sklearn.metrics import adjusted_mutual_info_score

def ami(y_true, y_assign):
    return adjusted_mutual_info_score(y_true, y_assign)

ARI

Adjusted Rand index,介绍见 [1-3],范围 [-1, 1],越大越好。[4,6] 用 sklearn.metrics.adjusted_rand_score,见 clustering_metriccalc_metrics

# from sklearn.metrics import adjusted_rand_score

def ari(y_true, y_assign):
    return adjusted_rand_score(y_true, y_assign)

Classification Metrics

这一节的分类指标指重排 assignment 得到 y_adjust 的指标。

ACC

Accuracy,范围 [0, 1],越大越好。[4] 用 sklearn.metrics.accuracy_score,[5,6] 手写,分别见 classification_metriccluster_accordered_cmat

# from sklearn.metrics import accuracy_score

def acc(y_true, y_adjust):
    return accuracy_score(y_true, y_adjust)

precision

范围 [0, 1],越大越好。[4] 用 sklearn.metrics.precision_score,见 classification_metric

# from sklearn.metrics import precision_score

def precision(y_true, y_adjust, average='macro'):
    return precision_score(y_true, y_adjust, average=average)

recall

范围 [0, 1],越大越好。[4] 用 sklearn.metrics.recall_score,见 classification_metric

# from sklearn.metrics import recall_score

def recall(y_true, y_adjust, average='macro'):
    return recall_score(y_true, y_adjust, average=average)

f1-score

范围 [0, 1],越大越好。[4] 用 sklearn.metrics.f1_score,见 classification_metric

# from sklearn.metrics import f1_score

def f1_score(y_true, y_adjust, average='macro'):
    return f1_score(y_true, y_adjust, average=average)

Combination

写在一起方便调用

# evaluate.py
import numpy as np
from scipy.optimize import linear_sum_assignment
import sklearn.metrics as metrics


def calc_cost_matrix(y_true, y_assign, n_classes, n_clusters):
    """calculate cost matrix W
    Input:
        y_true: [n], in {0, ..., n_classes - 1}
        y_assign: [n], in {0, ..., n_clusters - 1}
        n_classes: int, provide in case that y_true.max() != n_classes
        n_clusters: int, provide in case that y_assign.max() != n_clusters
    Output:
        W: [n_clusters, n_classes]
    """
    y_true = y_true.astype(np.int64)
    y_assign = y_assign.astype(np.int64)
    assert y_assign.size == y_true.size # n
    # C = np.zeros((y_assign.max() + 1, y_true.max() + 1), dtype=np.int64)
    C = np.zeros((n_clusters, n_classes), dtype=np.int64)
    for i in range(y_assign.size):
        C[y_assign[i], y_true[i]] += 1
    W = C.max() - C
    return W


def reorder_assignment(y_true, y_assign, n_classes, n_clusters):
    """(linear_sum_assignment) re-order y_assign to be y_adjust so that it has the same order as y_true
    Input:
        y_true: [n], in {0, ..., c - 1}
        y_assign: [n], in {0, ..., d - 1}
        n_classes: int, provide in case that y_true.max() != n_classes
        n_clusters: int, provide in case that y_assign.max() != n_clusters
    Output:
        y_adjust: [n], in {0, ..., c - 1}, in same order as y_true
    """
    W = calc_cost_matrix(y_true, y_assign, n_classes, n_clusters)
    row_idx, col_idx = linear_sum_assignment(W)
    map_a2t = np.zeros(n_clusters, dtype=np.int64)
    for i, j in zip(row_idx, col_idx):
        map_a2t[i] = j
    y_adjust = map_a2t[y_assign]
    return y_adjust


def purity(y_true, y_assign):
    y_voted_labels = np.zeros(y_true.shape)
    labels = np.unique(y_true)
    ordered_labels = np.arange(labels.shape[0])
    for k in range(labels.shape[0]):
        y_true[y_true == labels[k]] = ordered_labels[k]
    labels = np.unique(y_true)
    bins = np.concatenate((labels, [np.max(labels)+1]), axis=0)

    for cluster in np.unique(y_assign):
        hist, _ = np.histogram(y_true[y_assign == cluster], bins=bins)
        winner = np.argmax(hist)
        y_voted_labels[y_assign == cluster] = winner

    return metrics.accuracy_score(y_true, y_voted_labels)


def evaluate(y_true, y_assign, n_classes, n_clusters, average='macro'):
    y_adjust = reorder_assignment(y_true, y_assign, n_classes, n_clusters)
    return {
        # clustering
        'purity': purity(y_true, y_assign),
        'nmi': metrics.normalized_mutual_info_score(y_true, y_assign),
        'ami': metrics.adjusted_mutual_info_score(y_true, y_assign),
        'ari': metrics.adjusted_rand_score(y_true, y_assign),
        # classification
        'acc': metrics.accuracy_score(y_true, y_adjust),
        'precision': metrics.precision_score(y_true, y_adjust, average=average),
        'recall': metrics.recall_score(y_true, y_adjust, average=average),
        'f1-score': metrics.f1_score(y_true, y_adjust, average=average)
    }

References

  1. [ML] 聚类评价指标
  2. 几种常见的聚类外部评价指标
  3. 十分钟掌握聚类算法的评估指标
  4. XLearning-SCU/2021-CVPR-Completer
  5. Gasteinh/DSIMVC
  6. DanielTrosten/DeepMVC
  7. 利用python解决指派问题(匈牙利算法)
深度学习多视图聚类是近年来多视图聚类研究中的热点,结合了深度学习技术与多视图聚类方法,旨在处理具有多个不同视图数据的聚类问题。 ### 原理 近年来,深度学习技术在多视图聚类中的应用成为研究热点。深度嵌入式聚类(DEC)利用自动编码器从原始特征中提取低维潜在表示,然后优化学生的t - 分布和特征表示的目标分布,以实现聚类。相比之下,传统的多视图聚类算法大多使用线性和浅层嵌入来学习多视图数据的潜在结构 [^2]。 多视图子空间聚类(MVSC)的目标是通过学习多视图数据的融合表示,寻找到一个统一的子空间,然后在相应的子空间中分离数据。不过现有方法存在高时间和空间复杂度的问题,而基于锚点的多视图子空间聚类方法能够在显著减少存储和计算时间的同时,取得良好的聚类性能。通常,锚点图是等权重的,并融合为共识图,然后进行谱聚类以获得聚类结果 [^3]。 ### 应用场景 虽然给定引用未提及具体应用场景,但多视图聚类一般可应用于计算机视觉领域,例如图像分类,不同视图可以是图像的颜色、纹理、形状等特征;在生物信息学中,不同视图可以是基因表达数据、蛋白质相互作用数据等,用于对生物样本进行聚类分析;在社交网络分析中,不同视图可以是用户的行为数据、社交关系数据等,以发现用户群体。 ### 算法实现 给定引用未提供具体算法实现代码,但可以参考一般的深度学习多视图聚类的实现思路。以下是一个简单的伪代码示例,展示了基于深度嵌入式聚类(DEC)的基本流程: ```python import torch import torch.nn as nn import torch.optim as optim # 定义自动编码器模型 class Autoencoder(nn.Module): def __init__(self, input_dim, hidden_dim): super(Autoencoder, self).__init__() self.encoder = nn.Sequential( nn.Linear(input_dim, hidden_dim), nn.ReLU() ) self.decoder = nn.Sequential( nn.Linear(hidden_dim, input_dim), nn.Sigmoid() ) def forward(self, x): x = self.encoder(x) x = self.decoder(x) return x # 初始化模型、损失函数和优化器 input_dim = 100 hidden_dim = 20 model = Autoencoder(input_dim, hidden_dim) criterion = nn.MSELoss() optimizer = optim.Adam(model.parameters(), lr=0.001) # 训练自动编码器 def train_autoencoder(data, epochs): for epoch in range(epochs): outputs = model(data) loss = criterion(outputs, data) optimizer.zero_grad() loss.backward() optimizer.step() if (epoch + 1) % 10 == 0: print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}') # 假设 data 是多视图数据 data = torch.randn(1000, input_dim) train_autoencoder(data, epochs=100) # 提取低维潜在表示 latent_representation = model.encoder(data) # 后续步骤:优化学生的t - 分布和特征表示的目标分布以实现聚类 # 这里省略具体代码,因为这部分涉及更复杂的数学和算法 ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值