Agglomerative Clustering

最新推荐文章于 2025-05-21 14:37:33 发布

pf1492536

最新推荐文章于 2025-05-21 14:37:33 发布

阅读量3.1k

点赞数

分类专栏：数据挖掘文章标签： algorithm function each merge object tree

数据挖掘专栏收录该内容

12 篇文章

订阅专栏

本文深入探讨层次聚类算法的两种实现方式：自底向上的聚合聚类和自顶向下的分裂聚类。通过具体实例和伪代码，详细解释了这两种方法的工作原理和优缺点，并对比了它们在计算复杂性和群集质量方面的差异。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

The tree of a hierarchical clustering can be produced either bottom-up,by starting with the individual objects and grouping the most similar ones, ortop-down, by starting with all the objects and dividing them into groups. Thebottom-up algorithm is also called agglomerative clustering, while thetop-down one is called divisive clustering. The ideas of the two arequite similar expect the former is to merge similar clusters while the latteris to split dissimilar clusters.

Agglomerative clustering is a greedy algorithm that starts with aseparate cluster for each object. In each step, the two most similar clustersare determined, and merged into a new cluster. The algorithm stops when acertain stop criterion is met. Usually, it stops when one large cluster containingall objects are formed. Figure 2 is the pseudo code of the agglomerativeclustering.

Figure 1: agglomerative hierarchicalclustering

Besides the algorithm itself, there is a trick on how to compute groupsimilarity based on individual object similarity. There are three similarityfunctions commonly used as shown on table 1

Function

Definition

single link

similarity of the closest pair

compete link

similarity of the farthest pair

average link

average similarity between all pairs

Table 1: similarity functionsused in clustering

In single link clustering, the similarity between two clusters is thesimilarity of the two closest objects in the clusters. Clusters based on thisfunction have good local coherence since the similarity function is locallydefined. However, it can result in bad global quality, since it has no way totake into account the global context. As opposed to locally coherent clustersas in the case of single-link clustering, complete-link clustering has asimilarity function that focuses on global cluster quality. Also the result canbe looked as “tight” clusters since it comes from the similarity of two mostdissimilar members. Both of the two functions are based on individual objects,and thus are sensitive to outliers. However, the average link function isimmune to such sensitivity since it is based on group decision. As we see,these functions have their own advantages and are good for differentapplications. Nevertheless, in most patient clustering applications, the globalcoherence is preferable to local coherence. In terms of computationalcomplexity: single link clustering is O(n²), but complete link isO(n³). The average link similarity can be computed efficiently insome cases so that the complexity of the algorithm is only O(n²).Therefore, the average link function can be an efficient alternative tocomplete link function while avoiding the bad global coherence of single linkfunction.