Precision and Recall of clustering for python

本文详细介绍了聚类评估中真阳性、假阳性、真阴性和假阴性的概念,并通过一个具体的例子展示了如何计算这些指标。此外,还提供了Python实现代码来帮助理解聚类评估过程中的精度、召回率和F1分数的计算。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

A true positive (TP) decision assigns two similar documents to the same cluster, a true negative (TN) decision assigns two dissimilar documents to different clusters. There are two types of errors we can commit. A (FP) decision assigns two dissimilar documents to the same cluster. A (FN) decision assigns two similar documents to different clusters.

for example:
这里写图片描述
Here we have three actual groups: x’s, o’s, and diamonds which we have tried to cluster into cluster 1, cluster 2, and cluster 3. Some mistakes were made. For example, cluster 2 has an x, four o’s, and a diamond included in it. Now to quantify the TP’s, FP’s, TN’s, FN’s.

We will consider all pairs of documents, of which there are N(N−1)/2=136, since we have N=17 documents.

Now for TP+FP (all positives), we need to find out all pairs of x’s, o’s, and diamonds (not necessarily matching types) that exist in the same cluster. (62) pairs of x’s in cluster 1, etc. This gives us

TP+FP=(62)+(62)+(52)=40 total positives

True positives are only the pairs that are of same type. For example, pairs of x’s in cluster 1 is (52). This gives us,

TP=(52)+(42)+(32)+(22)=20
This leaves 40−20=20 FP’s.

Now for the total number of negatives, which is not in the link I provided. The total negatives plus positives must equal N, and thus N−totalPostives=totalNegatives. So there are 136−40=96 negatives in total.

The number of FN’s can be found by looking at pairs that should be grouped together, but are not. I will do the x’s first. Cluster 1 has 5 x’s each paired to three mismatches (3*5=15) plus cluster 2 has 1 x that is paired to two mismatched x’s in cluster three that have not been accounted for (2*1=2). The o’s are the same. Cluster 1 has one o, which is paired to 4 mismatched o’s (1*4=4) in cluster 2. Now for the diamonds. Cluster 2 has one diamond, which is paired to 3 mismatched diamonds in cluster 3 (1*3). Add them up,

FN=3∗5+2∗1+1∗4+1∗3=24
Since total negatives is 96, there must be 96-24=72 TN’s.

The final confusion matrix is (as per the website):
这里写图片描述
And as Karl said precision and recall are:
这里写图片描述
According to the above example, We assume that x, o, and diamonds of the cluster represent the three samples a, b,and c respectively. The python implementation is as follows:

cluster_list = [['a','a','a','a','a','b'],['a','b','b','b','b','c'],['a','a','c','c','c']]
    num_doc= 0
    positives = 0
    negatives = 0
    TP = 0
    FP = 0
    FN = 0
    TN = 0
    c_list = []
    for c in range(0, len(cluster_list)): 
        # calculating num_doc count...
        num_doc += len(cluster_list[c])

        # calculating positives...
        positives +=  (len(cluster_list[c])*(len(cluster_list[c])-1))/2

        # calculating TP...
        c = Counter(cluster_list[c])
        c_list.append(c)
        tp_temp = 0 
        for k,v in dict(c).items():
            if v>1:
                tp_temp += (v*(v-1))/2
        TP += tp_temp 

    FP = positives - TP
    negatives = ((num_doc*(num_doc-1))/2) - positives
    # Add all the cluster together
    sum = Counter()
    for c in c_list:
        sum += c
    # calculating FN...
    for ct in c_list:
        fn_temp = 0 
        for k,v in dict(ct).items():
            fn_temp += v*(sum[k]-v)
        sum -= ct
        FN += fn_temp
    TN = negatives -FN
    print("num_doc is %d " % num_doc)
    print("positives is %d " % positives)
    print("TP is %d " % TP)
    print("FP is %d " % FP)
    print("FN is %d " % FN)
    print("TN is %d " % TN)

    Precision = TP/(TP+FP)
    print("Precision is %.2f " % Precision)

    Recall = TP/(TP+FN)
    print("Recall is %.2f " % Recall)

    F1=(2*Recall*Precision)/(Recall+Precision)
    print("F1 is %.2f " % F1)

Welcome criticism.

Reference link:
Precision and recall for clustering?

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值