Algorithm-part3-week2_# of clusters desired-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_31805127/article/details/80492809

本文介绍最小生成树(MST)算法原理及其在聚类分析中的应用，重点讲解Kruskal算法，并提出一种基于MST的聚类方法，旨在通过最大间距实现k-聚类。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

MST Review
Input: Undirected graph G = (V, E),edges costs $c_e$ .
Output: Min-cost spanning tree(no cycles, connected).
Assumptions : G is connected, distinct edge costs.
Cut Property: If e is the cheapest edge crossing some cut(A, B), then e belongs to the MST.
Kruskal’s MST Algorithm
Sort edges in order of increasing cost.
[Rename edges 1,2,…,m so that $c_1 < c_2 < ...<c_m$ .]
$-T=\emptyset$
$-$ For $i=1$ to $m$ O(m) time
$---$ If $T \cup \{ i\}$ has no cycles. O(n) time to check for cycle.
[ Use BFS or DFS in the graph(V,T)which contains $\leq$ n-1 edges.

$---$ Add $i$ to $T$ .

$-$ Return $T$ .
Running time of straightforward implementation:
(m = # of edges, n= # of vertices] O(mlogn)+O(mn) = O(mn).
Plan: Data structure for O(1)-time cycle checks $\rightarrow$
$O(mlogn)$ time.

The Union-Find Data structure.
Maintain partition of a set of objects.
FIND(X): Return name of group that X belongs to.
UNION( $C_i$ , $C_j$ ): Fuse groups $C_i, C_J$ into a single one.
Objects = vertices
$-$ Groups = Connected components chosen edges T.
$-$ Adding new edge(u,v) to T.
Motivation: O(1)-time cycle checks in Kruskal’s algorithm.
Idea#1: -Maintain one linked structure per connected component of(V,T).
$-$ Each component has an arbitrary leader vertex.
Invariant: Each vertex points to the leaders of its component.
Key point: Given edge(u,v), can check if u&v already in same component in O(1) time.
[if and only if leader pointers of u, v match,i,e.,FIND(u) = FIND(v) ] $\rightarrow$ O(1)-time cycle checks!

Runing Time of Fast Implementation
Scored:
O(mlogn) time for sorting .
O(m) times for cycle checks[O(1) per iteration]
O(nlog(n)) time overall for leader pointer updates.

$\rightarrow$ total (Matching Prim’s algorithm)

Clustering [unsupervised learning]
Informal goal: Given n “points”,[Web pages, images, genome fragments,etc] classfify into “coherent groups”;
Assumptions:
(1) As input, given a (dis) similarity measure
a distance $d(p,q)$ between each point pair.
(2)Symmetric [d(p,q) = d(q,p)]
Examples: Euclidean distance, genome similarity, etc.
Goal: Same cluster $\Leftrightarrow$ “nearby”
Max-Spacing k-Clusterings
Assume: We know $k:=$ # of clusters desired.
[In practice, can experiment with a range of values]
Call points p & q separated is they’re assigned to different clusters.
Definition: The spacing of a k-clustering is $min_{separatedp,q}$ $d(p,q)$ .
Problem statement: Given a distance measure $d$ and $k$ , compute the $k-clustering$ with maximum spacing.
A Greedy Algorithm
$-$ Initially, each point in a sepatate cluster
$-$ Repeat until only k clusters.
$--$ Let $p,q =$ closest pair of separated points.
(determines the current spacing)
$-$ Merge the cluster containg p & q into a single cluster.
Note: Just like Krusskal’s MST algorithm, but stopped early.
$-$ Points $\Leftrightarrow$ vertices, distances $\leftrightarrow$ edges costs, point pairs. $\leftrightarrow$ edges.
$\Rightarrow$ Called single-link clustering.