Agglomerative Clustering

本文深入探讨层次聚类算法的两种实现方式:自底向上的聚合聚类和自顶向下的分裂聚类。通过具体实例和伪代码,详细解释了这两种方法的工作原理和优缺点,并对比了它们在计算复杂性和群集质量方面的差异。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

 

The tree of a hierarchical clustering can be produced either bottom-up,by starting with the individual objects and grouping the most similar ones, ortop-down, by starting with all the objects and dividing them into groups. Thebottom-up algorithm is also called agglomerative clustering, while thetop-down one is called divisive clustering. The ideas of the two arequite similar expect the former is to merge similar clusters while the latteris to split dissimilar clusters.

 

Agglomerative clustering is a greedy algorithm that starts with aseparate cluster for each object. In each step, the two most similar clustersare determined, and merged into a new cluster. The algorithm stops when acertain stop criterion is met. Usually, it stops when one large cluster containingall objects are formed. Figure 2 is the pseudo code of the agglomerativeclustering.

 

 

            

 

Figure 1: agglomerative hierarchicalclustering

 

Besides the algorithm itself, there is a trick on how to compute groupsimilarity based on individual object similarity. There are three similarityfunctions commonly used as shown on table 1

 
 

Function 

 
 

Definition  

 
 

single  link

 
 

similarity  of the closest pair

 

compete  link

 
 

similarity  of the farthest pair

 
 

average  link

 
 

average  similarity between all pairs

 

Table 1: similarity functionsused in clustering

 

In single link clustering, the similarity between two clusters is thesimilarity of the two closest objects in the clusters. Clusters based on thisfunction have good local coherence since the similarity function is locallydefined. However, it can result in bad global quality, since it has no way totake into account the global context. As opposed to locally coherent clustersas in the case of single-link clustering, complete-link clustering has asimilarity function that focuses on global cluster quality. Also the result canbe looked as “tight” clusters since it comes from the similarity of two mostdissimilar members. Both of the two functions are based on individual objects,and thus are sensitive to outliers. However, the average link function isimmune to such sensitivity since it is based on group decision. As we see,these functions have their own advantages and are good for differentapplications. Nevertheless, in most patient clustering applications, the globalcoherence is preferable to local coherence. In terms of computationalcomplexity: single link clustering is O(n2), but complete link isO(n3). The average link similarity can be computed efficiently insome cases so that the complexity of the algorithm is only O(n2).Therefore, the average link function can be an efficient alternative tocomplete link function while avoiding the bad global coherence of single linkfunction.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值