聚类1

最新推荐文章于 2024-10-25 14:06:52 发布

原创最新推荐文章于 2024-10-25 14:06:52 发布 · 1.6k 阅读

0 ·

CC 4.0 BY-SA版权

Foundations and Trends R in

Machine Learning

Vol. 2, No. 3 (2009) 235–274

c 2010 U. von Luxburg

DOI: 10.1561/2200000008

Clustering Stability: An Overview

By Ulrike von Luxburg

Contents

1 Introduction

2 Clustering Stability: Deﬁnition and

Implementation

3 Stability Analysis of the K-Means Algorithm

3.1

3.2

3.3

The Idealized K-Means Algorithm

The Actual K-Means Algorithm

Relationships between the results

236

239

246

248

257

262

266

269

272

4 Beyond K-Means

5 Outlook

References

Foundations and Trends R in

Machine Learning

Vol. 2, No. 3 (2009) 235–274

c 2010 U. von Luxburg

DOI: 10.1561/2200000008

Clustering Stability: An Overview

Ulrike von Luxburg

Max Planck Institute for Biological Cybernetics, T¨bingen, Germany,u

ulrike.luxburg@tuebingen.mpg.de

Abstract

A popular method for selecting the number of clusters is based on

stability arguments: one chooses the number of clusters such that the

corresponding clustering results are “most stable”. In recent years, a

series of papers has analyzed the behavior of this method from a theo-

retical point of view. However, the results are very technical and diﬃ-

cult to interpret for non-experts. In this monograph we give a high-level

overview about the existing literature on clustering stability. In addi-

tion to presenting the results in a slightly informal but accessible way,

we relate them to each other and discuss their diﬀerent implications.

Introduction

Model selection is a diﬃcult problem in non-parametric clustering. The

obvious reason is that, as opposed to supervised classiﬁcation, there is

no ground truth against which we could “test” our clustering results.

One of the most pressing questions in practice is how to determine the

number of clusters. Various ad hoc methods have been suggested in

the literature, but none of them is entirely convincing. These methods

usually suﬀer from the fact that they implicitly have to deﬁne “what a

clustering is” before they can assign diﬀerent scores to diﬀerent num-

bers of clusters. In recent years a new method has become increasingly

popular: selecting the number of clusters based on clustering stability.

Instead of deﬁning “what is a clustering”, the basic philosophy is simply

that a clustering should be a structure on the data set that is “stable”.

That is, if applied to several data sets from the same underlying model

or of the same data-generating process, a clustering algorithm should

obtain similar results. In this philosophy it is not so important how

the clusters look (this is taken care of by the clustering algorithm), but

that they can be constructed in a stable manner.

The basic intuition of why people believe that this is a good principle

can be described by Figure 1.1. Shown is a data distribution with four

236

237

Sample 1

k = 2:

Sample 2

k = 5:

Fig. 1.1 Idea of clustering stability. Instable clustering solutions if the number of clusters

is too small (ﬁrst row) or too large (second row). See text for details.

underlying clusters (depicted by the black circles), and diﬀerent sam-

ples from this distribution (depicted by red diamonds). If we cluster this

data set into K = 2 clusters, there are two reasonable solutions: a hori-

zontal and a vertical split. If a clustering algorithm is applied repeatedly

to diﬀerent samples from this distribution, it might sometimes con-

struct the horizontal and sometimes the vertical solution. Obviously,

these two solutions are very diﬀerent from each other, hence the clus-

tering results are instable. Similar eﬀects take place if we start with

K = 5. In this case, we necessarily have to split an existing cluster into

two clusters, and depending on the sample this could happen to any

of the four clusters. Again the clustering solution is instable. Finally,

if we apply the algorithm with the correct number K = 4, we observe

stable results (not shown in the ﬁgure): the clustering algorithm always

discovers the correct clusters (maybe up to a few outlier points). In this

example, the stability principle detects the correct number of clusters.

At ﬁrst glance, using stability-based principles for model selection

appears to be very attractive. It is elegant as it avoids to deﬁne what a

good clustering is. It is a meta-principle that can be applied to any basic

clustering algorithm and does not require a particular clustering model.

Finally, it sounds “very fundamental” from a philosophy of inference

point of view.

238

Introduction

However, the longer one thinks about this principle, the less obvious

it becomes that model selection based on clustering stability “always

works”. What is clear is that solutions that are completely instable

should not be considered at all. However, if there are several stable

solutions, is it always the best choice to select the one corresponding

to the most stable results? One could conjecture that the most sta-

ble parameter always corresponds to the simplest solution, but clearly

there exist situations where the most simple solution is not what we

are looking for. To ﬁnd out how model selection based on clustering

stability works we need theoretical results.

In this monograph we discuss a series of theoretical results on clus-

tering stability that have been obtained in recent years. In Section 2

we review diﬀerent protocols for how clustering stability is computed

and used for model selection. In Section 3 we concentrate on theoretical

results for the K-means algorithm and discuss their various relations.

This is the main section of the paper. Results for more general cluster-

ing algorithms are presented in Section 4.

Clustering Stability:

Deﬁnition and Implementation

A clustering CK of a data set S = {X1 , . . . , Xn } is a function that

assigns labels to all points of S, that is CK : S → {1, . . . , K}. Here K

denotes the number of clusters. A clustering algorithm is a procedure

that takes a set S of points as input and outputs a clustering of S.

The clustering algorithms considered in this monograph take an addi-

tional parameter as input, namely the number K of clusters they are

supposed to construct. We analyze clustering stability in a statistical

setup. The data set S is assumed to consist of n data points X1 , . . . , Xn

that have been drawn independently from some unknown underlying

distribution P on some space X . The ﬁnal goal is to use these sample

points to construct a good partition of the underlying space X . For

some theoretical results it will be easier to ignore sampling eﬀects and

directly work on the underlying space X endowed with the probability

distribution P . This can be considered as the case of having “inﬁnitely

many” data points. We sometimes call this the limit case for n → ∞.

Assume we agree on a way to compute distances d(C, C ) between

diﬀerent clusterings C and C (see below for details). Then, for a ﬁxed

probability distribution P , a ﬁxed number K of clusters and a ﬁxed

sample size n, the instability of a clustering algorithm is deﬁned as the

239

240

Clustering Stability: Deﬁnition and Implementation

expected distance between two clusterings CK (Sn ), CK (Sn ) on diﬀerent

data sets Sn , Sn of size n, that is:

Instab(K, n) := E d(CK (Sn ), CK (Sn )) .

(2.1)

The expectation is taken with respect to the drawing of the two sam-

ples.

In practice, a large variety of methods has been devised to compute

stability scores and use them for model selection. On a very general

level they work as follows:

Given: a set S of data points, a clustering algorithm A that takes

the number k of clusters as input

(1) For k = 2, . . . , kmax

(a) Generate perturbed versions Sb (b = 1, . . . , bmax ) of the

original data set (for example by subsampling or

adding noise, see below).

(b) For b = 1, . . . , bmax :

Cluster the data set Sb with algorithm A into k

clusters to obtain clustering Cb .

Compute pairwise distances d(Cb , Cb ) between these

clusterings (using one of the distance functions

described below).

(d) Compute instability as the mean distance between

clusterings Cb :

Instab(k, n) =

b2max

bmax

d(Cb , Cb ).

b,b =1

(2) Choose the parameter k that gives the best stability, in the

simplest case as follows:

K := argmin Instab(k, n)

(see below for more options).

This scheme gives a very rough overview of how clustering stability

can be used for model selection. In practice, many details have to be

taken into account, and they will be discussed in the next section.

Finally, we want to mention an approach that is vaguely related to

clustering stability, namely the ensemble method [26]. Here, an ensem-

ble of algorithms is applied to one ﬁxed data set. Then a ﬁnal clustering

241

is built from the results of the individual algorithms. We are not going

to discuss this approach in our monograph.

Generating perturbed versions of the data set. To be able to

evaluate the stability of a ﬁxed clustering algorithm we need to run

the clustering algorithm several times on slightly diﬀerent data sets.

To this end we need to generate perturbed versions of the original data

set. In practice, the following schemes have been used:

• Draw a random subsample of the original data set without

replacement [5, 12, 15, 17].

• Add random noise to the original data points [8, 19].

• If the original data set is high-dimensional, use diﬀerent ran-

dom projections in low-dimensional spaces, and then cluster

the low-dimensional data sets [25].

• If we work in a model-based framework, sample data from

the model [14].

• Draw a random sample of the original data with replacement.

This approach has not been reported in the literature yet, but

it avoids the problem of setting the size of the subsample. For

good reasons, this kind of sampling is the standard in the

bootstrap literature [11] and might also have advantages in

the stability setting. This scheme requires that the algorithm

can deal with weighted data points (because some data points

will occur several times in the sample).

In all cases, there is a trade-oﬀ that has to be treated carefully. If we

change the data set too much (for example, the subsample is too small,

or the noise too large), then we might destroy the structure we want

to discover by clustering. If we change the data set too little, then the

clustering algorithm will always obtain the same results, and we will

observe trivial stability. It is hard to quantify this trade-oﬀ in practice.

Which clusterings to compare? Diﬀerent protocols are used to com-

pare the clusterings on the diﬀerent data sets Sb .

• Compare the clustering of the original data set with the clus-

terings obtained on subsamples [17].

242

Clustering Stability: Deﬁnition and Implementation

• Compare clusterings of overlapping subsamples on the data

points where both clusterings are deﬁned [5].

• Compare clusterings of disjoint subsamples [12, 15]. Here we

ﬁrst need to apply an extension operator to extend each clus-

tering to the domain of the other clustering.

Distances between clusterings. If two clusterings are deﬁned on the

same data points, then it is straightforward to compute a distance score

between these clusterings based on any of the well-known clustering

distances such as the Rand index, Jaccard index, Hamming distance,

minimal matching distance, and Variation of Information distance [18].

All these distances count, in some way or the other, points or pairs of

points on which the two clusterings agree or disagree. The most conve-

nient choice from a theoretical point of view is the minimal matching

distance. For two clusterings C, C of the same data set of n points it is

deﬁned as:

dMM (C, C ) := min

πn

1 {C(Xi )=π(C (Xi ))} ,l

i=1

(2.2)

where the minimum is taken over all permutations π of the K labels.

Intuitively, the minimal matching distance measures the same quantity

as the 0–1-loss used in supervised classiﬁcation. For a stability study

involving the adjusted Rand index or an adjusted mutual information

index see Vinh and Epps [27].

If two clusterings are deﬁned on diﬀerent data sets one has two

choices. If the two data sets have a big overlap one can use a restriction

operator to restrict the clusterings to the points that are contained in

both data sets. On this restricted set one can then compute a standard

distance between the two clusterings. The other possibility is to use

an extension operator to extend both clusterings from their domain to

the domain of the other clustering. Then one can compute a standard

distance between the two clusterings as they are now both deﬁned

on the joint domain. For center-based clusterings, as constructed by

the K-means algorithm, a natural extension operator exists. Namely,

to a new data point we simply assign the label of the closest cluster

center. A more general scheme to extend an existing clustering to new

243

data points is to train a classiﬁer on the old data points and use its

predictions as labels on the new data points. However, in the context

of clustering stability it is not obvious what kind of bias we introduce

with this approach.

Stability scores and their normalization. The stability protocol

outlined above results in a set of distance values (d(Cb , Cb ))b,b =1,...,bmax .

In most approaches, one summarizes these values by taking their mean:

Instab(k, n) =

b2max

bmax

d(Cb , Cb ).

b,b =1

Note that the mean is the simplest summary statistic one can compute

based on the distance values d(Cb , Cb ). A diﬀerent approach is to use the

area under the cumulative distribution function of the distance values

as the stability score, see Ben-Hur et al. [5] or Bertoni and Valentitni [6]

for details. In principle one could also come up with more elaborate

statistics based on distance values. To the best of our knowledge, such

concepts have not been used so far.

The simplest way to select the number K of clusters is to minimize

the instability:

K = argmin Instab(k, n).

k=2,...,kmax

This approach has been suggested in Levine and Domany [17]. However,

an important fact to note is that Instab(k, n) trivially scales with k,

regardless of what the underlying data structure is. For example, in

the top left plot in Figure 2.1 we can see that even for a completely

unclustered data set, Instab(n, k) increases with k. When using stability

for model selection, one should correct for the trivial scaling of Instab,

otherwise it might be meaningless to take the minimum afterwards.

There exist several diﬀerentnormalization protocols:

• Normalization using a reference null distribution [6, 12]. One

repeatedly samples data sets from some reference null distri-

bution. Such a distribution is deﬁned on the same domain as

the data points, but does not possess any cluster structure.

In simple cases one can use the uniform distribution on the

244

Clustering Stability: Deﬁnition and Implementation

Data set: uniform

stability (not normalized)

Data set: four Gaussians

stability (not normalized)

0.5

0.8

0.6

0.4

0.8

0.6

0.4

1.5

0.5

510

stability (normalized)

510

stability on reference distribution

0.8

0.6

0.4

510

stability on reference distribution

510

stability (normalized)

Fig. 2.1 Normalized stability scores. Left plots: Data points from a uniform density on

[0, 1]2 . Right plots: Data points from a mixture of four well-separated Gaussians in R2 . The

ﬁrst row always shows the unnormalized instability Instab for K = 2, . . . , 15. The second row

shows the instability Instabnorm obtained on a reference distribution (uniform distribution).

The third row shows the normalized stability Instabnorm .

data domain as null distribution. A more practical approach

is to scramble the individual dimensions of the existing data

points and use the “scrambled points” as null distribution

(see [6, 12] for details). Once we have drawn several data

sets from the null distribution, we cluster them using our

clustering algorithm and compute the corresponding stabil-

ity score Instabnull as above. The normalized stability is then

deﬁned as Instabnorm := Instab/Instabnull .

• Normalization by random labels [15]. First, we cluster each

of the data sets Sb as in the protocol above to obtain the

clusterings Cb . Then, we randomly permute these labels. That

is, we assign the label to data point Xi that belonged to

Xπ(i) , where π is a permutation of {1, . . . , n}. This leads to a

permuted clustering Cb, perm . We then compute the stability

score Instab as above, and similarly we compute Instabperm

for the permuted clusterings. The normalized stability is then

deﬁned as Instabnorm := Instab/Instabperm .

Once we computed the normalized stability scores Instabnorm we can

choose the number of clusters that has smallest normalized instability,

245

that is:

K = argmin Instabnorm (k, n).

k=2,...,kmax

This approach has been taken for example in Ben-Hur et al. [5] and

Lange et al. [15].

Selecting K based on statistical tests. A second approach to select

the ﬁnal number of clusters is to use a statistical test. Similarly to

the normalization considered above, the idea is to compute stability

scores not only on the actual data set, but also on “null data sets”

drawn from some reference null distribution. Then one tests whether,

for a given parameter k, the stability on the actual data is signiﬁcantly

larger than the one computed on the null data. If there are several

values k for which this is the case, then one selects the one that is most

signiﬁcant. The most well-known implementation of such a procedure

uses bootstrap methods [12]. Other authors use a χ2 -test [6] or a test

based on the Bernstein inequality [7].

To summarize, there are many diﬀerent implementations for select-

ing the number K of clusters based on stability scores. Until now,

there does not exist any convincing empirical study that thoroughly

compares all these approaches on a variety of data sets. In my opin-

ion, even fundamental issues such as the normalization have not been

investigated in enough detail. For example, in my experience normal-

ization often has no eﬀect whatsoever (but I did not conduct a thorough

study either). To put stability-based model selection on a ﬁrm ground

it would be crucial to compare the diﬀerent approaches with each other

in an extensive case study.

Stability Analysis of the K-Means Algorithm

The vast majority of papers about clustering stability use the K-means

algorithm as basic clustering algorithm. In this section we discuss the

stability results for the K-means algorithm in depth. Later, in Sec-

tion 4 we will see how these results can be extended to other clustering

algorithms.

For simpler reference we brieﬂy recapitulate the K-means algorithm

(details can be found in many text books, for example [13]). Given a set

of n data points X1 , . . . , Xn ∈Rd and a ﬁxed number K of clusters to

construct, the K-means algorithm attempts to minimize the clustering

objective function:

(n)

QK (c1 , . . . , cK )

i=1

k=1,...,K

min

Xi − ck 2 ,

(3.1)

where c1 , . . . , cK denote the centers of the K clusters. In the limit

n → ∞, the K-means clustering is the one that minimizes the limit

objective function:

QK (c1 , . . . , cK ) =

(∞)

k=1,...,K

min

X − ck

dP (X),

(3.2)

where P is the underlying probability distribution.

246

247

Given an initial set c<0> = {c<0> , . . . , c<0> } of centers, the K-means1K

algorithm iterates the following two steps until convergence:

(1) Assign data points to closest cluster centers:

∀i= 1, . . . , n :

C <t> (Xi ) := argmin Xi − c<t> .k

k=1,...K

(2) Re-adjust cluster means:

∀k= 1, . . . , K :

c<t+1> :=k

Xi ,

{i | C <t> (Xi )=k}

where Nk denotes the number of points in cluster k.

It is well known that, in general, the K-means algorithm terminates

(n)

in a local optimum of QK and does not necessarily ﬁnd the global

optimum. We study the K-means algorithm in two diﬀerent scenarios:

The idealized scenario: Here we assume an idealized algorithm that

always ﬁnds the global optimum of the K-means objective function

(n)

QK . For simplicity, we call this algorithm the idealized K-means

algorithm.

The realistic scenario: Here we analyze the actual K-means

algorithm as described above. In particular, we take into account its

property of getting stuck in local optima. We also take into account

the initialization of the algorithm.

In both scenarios, our theoretical investigations are based on the

following simple protocol to compute the stability of the K-means

algorithm:

(1) We assume to have access to as many independent samples

of size n of the underlying distribution as we want. That is,

we ignore artifacts introduced by the fact that in practice we

draw subsamples of one ﬁxed, given sample and thus might

introduce a bias.

(2) As distance between two K-means clusterings of two samples

S, S we use the minimal matching distance between the

extended clusterings on the domain S ∪S .

248

Stability Analysis of the K-Means Algorithm

(3) We work with the expected minimal matching distance as

in Equation (2.1), that is we analyze Instab rather than

the practically used Instab. This does not do much harm as

instability scores are highly concentrated around their means

anyway.

3.1

The Idealized K-Means Algorithm

In this section we focus on the idealized K-means algorithm, that is the

algorithm that always ﬁnds the global optimum c(n) of the K-means

objective function:

c(n) := (c1 , . . . , cK ) := argmin QK (c).

(n)

3.1.1

First Convergence Result and the Role of Symmetry

The starting point for the results in this section is the following obser-

vation [4]. Consider the situation in Figure 3.1a. Here the data contains

three clusters, but two of them are closer to each other than to the third

cluster. Assume we run the idealized K-means algorithm with K = 2 on

such a data set. Separating the left two clusters from the right cluster

(n)

(solid line) leads to a much better value of QK than, say, separating

the top two clusters from the bottom one (dashed line). Hence, as soon

as we have a reasonable amount of data, idealized (!) K-means with

K = 2 always constructs the ﬁrst solution (solid line). Consequently, it

is stable in spite of the fact that K = 2 is the wrong number of clus-

ters. Note that this would not happen if the data set was symmetric,

as depicted in Figure 3.1b. Here neither the solution depicted by the

dashed line nor the one with the solid line is clearly superior, which

leads to instability if the idealized K-means algorithm is applied to

diﬀerent samples. Similar examples can be constructed to detect that

K is too large, see Figure 3.1c and d. With K = 3 it is clearly the best

solution to split the big cluster in Figure 3.1c, thus clustering this data

set is stable. In Figure 3.1d, however, due to symmetry reasons neither

splitting the top nor the bottom cluster leads to a clear advantage.

Again this leads to instability.

3.1 The Idealized K-Means Algorithm

249

(a)

(b)

(c)

(d)

Fig. 3.1 If data sets are not symmetric, idealized K-means is stable even if the number K

of clusters is too small (a) or too large (c). Instability of the wrong number of clusters only

occurs in symmetric data sets (b and d).

These informal observations suggest that unless the data set con-

tains perfect symmetries, the idealized K-means algorithm is stable

even if K is wrong. This can be formalized with the following theorem.

Theorem 3.1 (Stability and global optima of the objective

(∞)function). Let P be a probability distribution on Rd and QK the

limit K-means objective function as deﬁned in Equation (3.2), for some

ﬁxed value K > 1.

(1) If QK has a unique global minimum, then the idealized

K-means algorithm is perfectly stable when n → ∞, that is:

n→∞

(∞)

lim Instab(K, n) = 0.

(2) If QK has several global minima (for example, because

the probability distribution is symmetric), then the idealized

K-means algorithm is instable, that is:

n→∞

lim Instab(K, n) > 0.

This theorem has been proved (in a slightly more general setting) in

references [2, 4].

Proof sketch, Part 1. It is well known that if the objective function

(∞)QK has a unique global minimum, then the centers c(n) constructed

by the idealized K-means algorithm on a sample of n points almost

250

Stability Analysis of the K-Means Algorithm

surely converge to the true population centers c(∗)as n → ∞ [20]. This

means that given some ε > 0 we can ﬁnd some large n such that c(n) is

ε-close to c(∗)with high probability. As a consequence, if we compare

two clusterings on diﬀerent samples of size n, the centers of the two

clusterings are at most 2ε-close to each other. Finally, one can show that

if the cluster centers of two clusterings are ε-close, then their minimal

matching distance is small as well. Thus, the expected distance between

the clusterings constructed on two samples of size n becomes arbitrarily

small and eventually converges to 0 as n → ∞.

Part 2. For simplicity, consider the symmetric situation in Figure 3.1a.

Here the probability distribution has three axes of symmetry. For K = 2

(∞)the objective function Q2 has three global minima c(∗1), c(∗2), c(∗3)

corresponding to the three symmetric solutions. In such a situation, the

idealized K-means algorithm on a sample of n points gets arbitrarily

close to one of the global optima, that is mini=1,...,3 d(c(n) , c(∗i)) → 0 [16].

In particular, the sequence (c(n) )n of empirical centers has three con-

vergent subsequences, each of which converges to one of the global

solutions. One can easily conclude that if we compare two clusterings

on random samples with probability 1/3 they belong to “the same sub-

sequence” and thus their distance will become arbitrarily small. With

probability 2/3 they “belong to diﬀerent subsequences”, and thus their

distance remains larger than a constant a > 0. From the latter we can

conclude that Instab(K, n) is always larger than 2a/3.

The interpretation of this theorem is distressing. The stability or

instability of parameter K does not depend on whether K is “correct”

or “wrong”, but only on whether the K-means objective function for

this particular value K has one or several global minima. However, the

number of global minima is usually not related to the number of clus-

ters, but rather to the fact that the underlying probability distribution

has symmetries. In particular, if we consider “natural” data distribu-

tions, such distributions are rarely perfectly symmetric. Consequently,

(∞)

the corresponding functions QK usually only have one global mini-

mum, for any value of K. In practice this means that for a large sample

size n, the idealized K-means algorithm is stable for any value of K.

This seems to suggest that model selection based on clustering stability

3.1 The Idealized K-Means Algorithm

251

does not work. However, we will see later in Section 3.3 that this result

is essentially an artifact of the idealized clustering setting and does not

carry over to the realistic setting.

3.1.2

Reﬁned Convergence Results for the Case of a

Unique Global Minimum

Above we have seen that if, for a particular distribution P and a

(∞)

particular value K, the objective function QK has a unique global

minimum, then the idealized K-means algorithm is stable in the sense

that limn→∞ Instab(K, n) = 0. At ﬁrst glance, this seems to suggest

that stability cannot distinguish between diﬀerent values k1 and k2 (at

least for large n). However, this point of view is too simplistic. It can

happen that even though both Instab(k1 , n) and Instab(k2 , n) converge

to 0 as n → ∞, this happens “faster” for k1 than for k2 . If measured

relative to the absolute values of Instab(k1 , n) and Instab(k2 , n), the dif-

ference between Instab(k1 , n) and Instab(k2 , n) can still be large enough

to be “signiﬁcant”.

The key in verifying this intuition is to study the limit process

more closely. This line of work has been established by Shamir and

Tishby in a series of papers [22, 23, 24]. The main idea is that instead

of studying the convergence of Instab(k, n) one needs to consider the

√

rescaled instability n · Instab(k, n). One can prove that the rescaled

instability converges in distribution, and the limit distribution depends

on k. In particular, the means of the limit distributions are diﬀerent

for diﬀerent values of k. This can be formalized as follows.

Theorem 3.2 (Convergence of rescaled stability). Assume that

the probability distribution P has a density p. Consider a ﬁxed param-

eter K, and assume that the corresponding limit objective function

(∗)(∗)(∞)QK has a unique global minimum c(∗)= (c1 , . . . , cK ). The bound-

ary between clusters i and j is denoted by Bij . Let m ∈N, and

Sn,1 , . . . , Sn,2m be samples of size n drawn independently from P . Let

CK (Sn,i ) be the result of the idealized K-means clustering on sample

Sn,i . Compute the instability as mean distance between clusterings of

252

Stability Analysis of the K-Means Algorithm

disjoint pairs of samples, that is:

Instab(K, n) :=

dMM CK (Sn,2i−1 ), CK (Sn,2i ) .

i=1

(3.3)

Then, as n → ∞ and m → ∞, the rescaled instability

converges in probability to

RInstab(K) :=

1≤i<j≤K

Bij

√

n · Instab(K, n)

Vij

(∗)

−

(∗)

p(x)dx,

(3.4)

where Vij stands for a term describing the asymptotics of the random

ﬂuctuations of the cluster boundary between cluster i and cluster j

(exact formula given in [23, 24]).

Note that even though the deﬁnition of instability in Equation (3.3)

diﬀers slightly from the deﬁnition in Equation (2.1), intuitively it mea-

sures the same quantity. The deﬁnition in Equation (3.3) just has the

technical advantage that all pairs of samples are independent from one

another.

(∞)

Proof sketch. It is well known that if QK has a unique global

minimum, then the centers constructed by the idealized K-means algo-

rithm on a ﬁnite sample satisfy a central limit theorem [21]. That is,

if we rescale the distances between the sample-based centers and the

√

true centers with the factor n, these rescaled distances converge to a

normal distribution as n → ∞. When the cluster centers converge, the

same can be said about the cluster boundaries. In this case, instabil-

ity essentially counts how many points change side when the cluster

boundaries move by some small amount. The points that potentially

change side are the points close to the boundary of the true limit clus-

tering. Counting these points is what the integrals Bij . . . p(x)dx in

the deﬁnition of RInstab take care of. The exact characterization of

how the cluster boundaries “jitter” can be derived from the central

(∗)(∗)

in the inte-

limit theorem. This leads to the term Vij / ci − cj

gral. Vij characterizes how the cluster centers themselves “jitter”. The

(∗)(∗)normalization ci − cj

is needed to transform jittering of cluster

centers to jittering of cluster boundaries: if two cluster centers are

3.1 The Idealized K-Means Algorithm

253

very far apart from each other, the cluster boundary only jitters by

a small amount if the centers move by ε, say. However, if the centers

are very close to each other (say, they have distance 3ε), then mov-

ing the centers by ε has a large impact on the cluster boundary. The

details of this proof are very technical, we refer the interested reader to

references [23, 24].

Let us brieﬂy explain how the result in Theorem 3.2 is compatible

with the result in Theorem 3.1. On a high level, the diﬀerence between

both results resembles the diﬀerence between the law of large numbers

and the central limit theorem in probability theory. The LLN stud-

ies the convergence of the mean of a sum of random variables to its

expectation (note that Instab has the form of a sum of random vari-

ables). The CLT is concerned with the same expression, but rescaled

√

with a factor n. For the rescaled sum, the CLT then gives results

on the convergence in distribution. Note that in the particular case of

instability, the distribution of distances lives on the non-negative num-

bers only. This is why the rescaled instability in Theorem 3.2 is positive

and not 0 as in the limit of Instab in Theorem 3.1. A toy ﬁgure explain-

ing the diﬀerent convergence processes can be seen in Figure 3.2.

Theorem 3.2 tells us that diﬀerent parameters k usually lead to dif-

ferent rescaled stabilities in the limit for n → ∞. Thus we can hope

that if the sample size n is large enough we can distinguish between

diﬀerent values of k based on the stability of the corresponding clus-

terings. An important question is now which values of k lead to stable

and which ones lead to instable results, for a given distribution P .

3.1.3

Characterizing Stable Clusterings

It is a straightforward consequence of Theorem 3.2 that if we consider

(∞)

diﬀerent values k1 and k2 and the clustering objective functions Qk1

and Qk2 have unique global minima, then the rescaled stability values

RInstab(k1 ) and RInstab(k2 ) can diﬀer from each other. Now we want

to investigate which values of k lead to high stability and which ones

lead to low stability.

Conclusion 3.3 (Instable clusterings). Assume that QK has a

unique global optimum. If Instab(K, n) is large, the idealized K-means

(∞)

254

Stability Analysis of the K-Means Algorithm

Fig. 3.2 Diﬀerent convergence processes. The left column shows the convergence studied

in Theorem 3.1. As the sample size n → ∞, the distribution of distances dMM (C, C ) is

degenerate, all mass is concentrated on 0. The right column shows the convergence studied

in Theorem 3.2. The rescaled distances converge to a non-trivial distribution, and its mean

(depicted by the cross) is positive. To go from the left to the right side one has to rescale

√

by n.

clustering tends to have cluster boundaries in high-density regions of

the space.

There exist two diﬀerent derivations of this conclusion, which have

been obtained independently from each other by completely diﬀerent

methods [3, 22]. On a high level, the reason why the conclusion tends

to hold is that if cluster boundaries jitter in a region of high-density,

then more points “change side” than if the boundaries jitter in a region

of low density.

First derivation, informal, based on references [22, 24]. Assume that

n is large enough such that we are already in the asymptotic regime

(that is, the solution c(n) constructed on the ﬁnite sample is close to the

true population solution c(∗)). Then the rescaled instability computed

on the sample is close to the expression given in Equation (3.4). If the

cluster boundaries Bij lie in a high-density region of the space, then

the integral in Equation (3.4) is large — compared to a situation where

the cluster boundaries lie in low-density regions of the space. From a

high level point of view, this justiﬁes the conclusion above. However,

3.1 The Idealized K-Means Algorithm

255

note that it is diﬃcult to identify how exactly the quantities p, Bij ,

and Vij inﬂuence RInstab, as they are not independent of each other.

Second derivation, more formal, based on Ben-David and von

Luxburg [3]. A formal way to prove the conclusion is as follows. We

introduce a new distance dboundary between two clusterings. This dis-

tance measures how far the cluster boundaries of two clusterings are

apart from each other. One can prove that the K-means quality func-

(∞)

tion QK is continuous with respect to this distance function. This

means that if two clusterings C, C are close with respect to dboundary ,

(∞)

then they have similar quality values. Moreover, if QK has a unique

global optimum, we can invert this argument and show that if a clus-

tering C is close to the optimal limit clustering C ∗, then the distance

dboundary (C, C ∗) is small. Now consider the clustering C (n) based on a

sample of size n. One can prove the following key statement. If C (n) con-

verges uniformly (over the space of all probability distributions) in the

sense that with probability at least 1 − δ we have dboundary (Cn , C) ≤ γ,

then:

Instab(K, n) ≤ 2δ + P (Tγ (B)).

(3.5)

Here P (Tγ (B)) denotes the probability mass of a tube of width γ

around the cluster boundaries B of C. Results in [1] establish the uni-

form convergence of the idealized K-means algorithm. This proves the

conjecture: Equation (3.5) shows that if Instab is high, then there is a

lot of mass around the cluster boundaries, namely the cluster bound-

aries are in a region of high density.

For stable clusterings, the situation is not as simple. It is tempting

to make the following conjecture.

Conjecture 3.4 (Stable clusterings). Assume that QK has a

unique global optimum. If Instab(K, n) is “small”, the idealized K-

means clustering tends to have cluster boundaries in low-density regions

of the space.

Argument in favor of the conjecture: As in the ﬁrst approach above,

considering the limit expression of RInstab reveals that if the cluster

(∞)

256

Stability Analysis of the K-Means Algorithm

boundary lies in a low density area of the space, then the integral in

RInstab tends to have a low value. In the extreme case where the cluster

boundaries go through a region of zero density, the rescaled instability

is even 0.

Argument against the conjecture: counter-examples! One can con-

struct artiﬁcial examples where clusterings are stable although their

decision boundary lies in a high-density region of the space ([3]). The

way to construct such examples is to ensure that the variations of the

cluster centers happen in parallel to cluster boundaries and not orthog-

onal to cluster boundaries. In this case, the sampling variation does

not lead to jittering of the cluster boundary, hence the result is rather

stable.

These counter-examples show that Conjecture 3.4 cannot be true in

general. However, my personal opinion is that the counter-examples are

rather artiﬁcial, and that similar situations will rarely be encountered

in practice. I believe that the conjecture “tends to hold” in practice.

It might be possible to formalize this intuition by proving that the

statement of the conjecture holds on a subset of “nice” and “natural”

probability distributions.

The important consequence of Conclusion 3.3 and Conjecture 3.4

(if true) is the following.

Conclusion 3.5 (Stability of idealized K-means detects

whether K is too large). Assume that the underlying distribution

P has K well-separated clusters, and assume that these clusters can

be represented by a center-based clustering model. Then the following

statements tend to hold for the idealized K-means algorithm.

(1) If K is too large, then the clusterings obtained by the ideal-

ized K-means algorithm tend to be instable.

(2) If K is correct or too small, then the clusterings obtained by

the idealized K-means algorithm tend to be stable (unless

the objective function has several global minima, for example

due to symmetries).

3.2 The Actual K-Means Algorithm

257

Given Conclusion 3.3 and Conjecture 3.4 it is easy to see why Con-

clusion 3.5 is true. If K is larger than the correct number of clusters,

one necessarily has to split a true cluster into several smaller clusters.

The corresponding boundary goes through a region of high density (the

cluster which is being split). According to Conclusion 3.3 this leads to

instability. If K is correct, then the idealized (!) K-means algorithm dis-

covers the correct clustering and thus has decision boundaries between

the true clusters, that is in low-density regions of the space. If K is

too small, then the K-means algorithm has to group clusters together.

In this situation, the cluster boundaries are still between true clusters,

hence in a low-density region of the space.