MST Review
Input: Undirected graph G = (V, E),edges costs cece.
Output: Min-cost spanning tree(no cycles, connected).
Assumptions : G is connected, distinct edge costs.
Cut Property: If e is the cheapest edge crossing some cut(A, B), then e belongs to the MST.
Kruskal’s MST Algorithm
Sort edges in order of increasing cost.
[Rename edges 1,2,…,m so that c1<c2<...<cmc1<c2<...<cm.]
−T=∅−T=∅
−− For to mm O(m) time
If T∪{i}T∪{i} has no cycles. O(n) time to check for cycle.
[ Use BFS or DFS in the graph(V,T)which contains ≤≤ n-1 edges.
−−−−−− Add ii to .
−−Return .
Running time of straightforward implementation:
(m = # of edges, n= # of vertices] O(mlogn)+O(mn) = O(mn).
Plan: Data structure for O(1)-time cycle checks →→
O(mlogn)O(mlogn) time.
The Union-Find Data structure.
Maintain partition of a set of objects.
FIND(X): Return name of group that X belongs to.
UNION(CiCi,CjCj): Fuse groups Ci,CJCi,CJ into a single one.
Objects = vertices
−− Groups = Connected components chosen edges T.
Adding new edge(u,v) to T.
Motivation: O(1)-time cycle checks in Kruskal’s algorithm.
Idea#1: -Maintain one linked structure per connected component of(V,T).
−− Each component has an arbitrary leader vertex.
Invariant: Each vertex points to the leaders of its component.
Key point: Given edge(u,v), can check if u&v already in same component in O(1) time.
[if and only if leader pointers of u, v match,i,e.,FIND(u) = FIND(v) ] O(1)-time cycle checks!
Runing Time of Fast Implementation
Scored:
O(mlogn) time for sorting .
O(m) times for cycle checks[O(1) per iteration]
O(nlog(n)) time overall for leader pointer updates.
→→ total (Matching Prim’s algorithm)
Clustering [unsupervised learning]
Informal goal: Given n “points”,[Web pages, images, genome fragments,etc] classfify into “coherent groups”;
Assumptions:
(1) As input, given a (dis) similarity measure
a distance d(p,q)d(p,q) between each point pair.
(2)Symmetric [d(p,q) = d(q,p)]
Examples: Euclidean distance, genome similarity, etc.
Goal: Same cluster ⇔⇔ “nearby”
Max-Spacing k-Clusterings
Assume: We know k:=k:= # of clusters desired.
[In practice, can experiment with a range of values]
Call points p & q separated is they’re assigned to different clusters.
Definition: The spacing of a k-clustering is minseparatedp,qminseparatedp,q d(p,q)d(p,q) .
Problem statement: Given a distance measure dd and , compute the k−clusteringk−clustering with maximum spacing.
A Greedy Algorithm
−− Initially, each point in a sepatate cluster
Repeat until only k clusters.
−−−− Let p,q=p,q= closest pair of separated points.
(determines the current spacing)
−− Merge the cluster containg p & q into a single cluster.
Note: Just like Krusskal’s MST algorithm, but stopped early.
Points⇔⇔ vertices, distances ↔↔ edges costs, point pairs. ↔↔ edges.
⇒⇒ Called single-link clustering.