A true positive (TP) decision assigns two similar documents to the same cluster, a true negative (TN) decision assigns two dissimilar documents to different clusters. There are two types of errors we can commit. A (FP) decision assigns two dissimilar documents to the same cluster. A (FN) decision assigns two similar documents to different clusters.
for example:
Here we have three actual groups: x’s, o’s, and diamonds which we have tried to cluster into cluster 1, cluster 2, and cluster 3. Some mistakes were made. For example, cluster 2 has an x, four o’s, and a diamond included in it. Now to quantify the TP’s, FP’s, TN’s, FN’s.
We will consider all pairs of documents, of which there are N(N−1)/2=136, since we have N=17 documents.
Now for TP+FP (all positives), we need to find out all pairs of x’s, o’s, and diamonds (not necessarily matching types) that exist in the same cluster. (62) pairs of x’s in cluster 1, etc. This gives us
TP+FP=(62)+(62)+(52)=40 total positives
True positives are only the pairs that are of same type. For example, pairs of x’s in cluster 1 is (52). This gives us,
TP=(52)+(42)+(32)+(22)=20
This leaves 40−20=20 FP’s.
Now for the total number of negatives, which is not in the link I provided. The total negatives plus positives must equal N, and thus N−totalPostives=totalNegatives. So there are 136−40=96 negatives in total.
The number of FN’s can be found by looking at pairs that should be grouped together, but are not. I will do the x’s first. Cluster 1 has 5 x’s each paired to three mismatches (3*5=15) plus cluster 2 has 1 x that is paired to two mismatched x’s in cluster three that have not been accounted for (2*1=2). The o’s are the same. Cluster 1 has one o, which is paired to 4 mismatched o’s (1*4=4) in cluster 2. Now for the diamonds. Cluster 2 has one diamond, which is paired to 3 mismatched diamonds in cluster 3 (1*3). Add them up,
FN=3∗5+2∗1+1∗4+1∗3=24
Since total negatives is 96, there must be 96-24=72 TN’s.
The final confusion matrix is (as per the website):
And as Karl said precision and recall are:
According to the above example, We assume that x, o, and diamonds of the cluster represent the three samples a, b,and c respectively. The python implementation is as follows:
cluster_list = [['a','a','a','a','a','b'],['a','b','b','b','b','c'],['a','a','c','c','c']]
num_doc= 0
positives = 0
negatives = 0
TP = 0
FP = 0
FN = 0
TN = 0
c_list = []
for c in range(0, len(cluster_list)):
# calculating num_doc count...
num_doc += len(cluster_list[c])
# calculating positives...
positives += (len(cluster_list[c])*(len(cluster_list[c])-1))/2
# calculating TP...
c = Counter(cluster_list[c])
c_list.append(c)
tp_temp = 0
for k,v in dict(c).items():
if v>1:
tp_temp += (v*(v-1))/2
TP += tp_temp
FP = positives - TP
negatives = ((num_doc*(num_doc-1))/2) - positives
# Add all the cluster together
sum = Counter()
for c in c_list:
sum += c
# calculating FN...
for ct in c_list:
fn_temp = 0
for k,v in dict(ct).items():
fn_temp += v*(sum[k]-v)
sum -= ct
FN += fn_temp
TN = negatives -FN
print("num_doc is %d " % num_doc)
print("positives is %d " % positives)
print("TP is %d " % TP)
print("FP is %d " % FP)
print("FN is %d " % FN)
print("TN is %d " % TN)
Precision = TP/(TP+FP)
print("Precision is %.2f " % Precision)
Recall = TP/(TP+FN)
print("Recall is %.2f " % Recall)
F1=(2*Recall*Precision)/(Recall+Precision)
print("F1 is %.2f " % F1)
Welcome criticism.
Reference link:
Precision and recall for clustering?