|
Foundations and Trends R in Machine Learning Vol. 2, No. 3 (2009) 235–274 c 2010 U. von Luxburg DOI: 10.1561/2200000008 |
|
Clustering Stability: An Overview |
|
By Ulrike von Luxburg |
|
Contents |
|
1 Introduction |
|
2 Clustering Stability: Definition and Implementation |
|
3 Stability Analysis of the K-Means Algorithm |
|
3.1 3.2 3.3 |
|
The Idealized K-Means Algorithm The Actual K-Means Algorithm Relationships between the results |
|
236 |
|
239 |
|
246 |
|
248 257 262 |
|
266 |
|
269 |
|
272 |
|
4 Beyond K-Means |
|
5 Outlook |
|
References |
|
Foundations and Trends R in Machine Learning Vol. 2, No. 3 (2009) 235–274 c 2010 U. von Luxburg DOI: 10.1561/2200000008 |
|
Clustering Stability: An Overview |
|
Ulrike von Luxburg |
|
Max Planck Institute for Biological Cybernetics, T¨bingen, Germany,u ulrike.luxburg@tuebingen.mpg.de |
|
Abstract |
|
A popular method for selecting the number of clusters is based on stability arguments: one chooses the number of clusters such that the corresponding clustering results are “most stable”. In recent years, a series of papers has analyzed the behavior of this method from a theo- retical point of view. However, the results are very technical and diffi- cult to interpret for non-experts. In this monograph we give a high-level overview about the existing literature on clustering stability. In addi- tion to presenting the results in a slightly informal but accessible way, we relate them to each other and discuss their different implications. |
|
1 |
|
Introduction |
|
Model selection is a difficult problem in non-parametric clustering. The obvious reason is that, as opposed to supervised classification, there is no ground truth against which we could “test” our clustering results. One of the most pressing questions in practice is how to determine the number of clusters. Various ad hoc methods have been suggested in the literature, but none of them is entirely convincing. These methods usually suffer from the fact that they implicitly have to define “what a clustering is” before they can assign different scores to different num- bers of clusters. In recent years a new method has become increasingly popular: selecting the number of clusters based on clustering stability. Instead of defining “what is a clustering”, the basic philosophy is simply that a clustering should be a structure on the data set that is “stable”. That is, if applied to several data sets from the same underlying model or of the same data-generating process, a clustering algorithm should obtain similar results. In this philosophy it is not so important how the clusters look (this is taken care of by the clustering algorithm), but that they can be constructed in a stable manner. The basic intuition of why people believe that this is a good principle can be described by Figure 1.1. Shown is a data distribution with four |
|
236 |
|
237 |
|
Sample 1 |
|
k = 2: |
|
Sample 2 |
|
k = 5: |
|
Fig. 1.1 Idea of clustering stability. Instable clustering solutions if the number of clusters is too small (first row) or too large (second row). See text for details. |
|
underlying clusters (depicted by the black circles), and different sam- ples from this distribution (depicted by red diamonds). If we cluster this data set into K = 2 clusters, there are two reasonable solutions: a hori- zontal and a vertical split. If a clustering algorithm is applied repeatedly to different samples from this distribution, it might sometimes con- struct the horizontal and sometimes the vertical solution. Obviously, these two solutions are very different from each other, hence the clus- tering results are instable. Similar effects take place if we start with K = 5. In this case, we necessarily have to split an existing cluster into two clusters, and depending on the sample this could happen to any of the four clusters. Again the clustering solution is instable. Finally, if we apply the algorithm with the correct number K = 4, we observe stable results (not shown in the figure): the clustering algorithm always discovers the correct clusters (maybe up to a few outlier points). In this example, the stability principle detects the correct number of clusters. At first glance, using stability-based principles for model selection appears to be very attractive. It is elegant as it avoids to define what a good clustering is. It is a meta-principle that can be applied to any basic clustering algorithm and does not require a particular clustering model. Finally, it sounds “very fundamental” from a philosophy of inference point of view. |
|
238 |
|
Introduction |
|
However, the longer one thinks about this principle, the less obvious it becomes that model selection based on clustering stability “always works”. What is clear is that solutions that are completely instable should not be considered at all. However, if there are several stable solutions, is it always the best choice to select the one corresponding to the most stable results? One could conjecture that the most sta- ble parameter always corresponds to the simplest solution, but clearly there exist situations where the most simple solution is not what we are looking for. To find out how model selection based on clustering stability works we need theoretical results. In this monograph we discuss a series of theoretical results on clus- tering stability that have been obtained in recent years. In Section 2 we review different protocols for how clustering stability is computed and used for model selection. In Section 3 we concentrate on theoretical results for the K-means algorithm and discuss their various relations. This is the main section of the paper. Results for more general cluster- ing algorithms are presented in Section 4. |
|
2 |
|
Clustering Stability: Definition and Implementation |
|
A clustering CK of a data set S = {X1 , . . . , Xn } is a function that assigns labels to all points of S, that is CK : S → {1, . . . , K}. Here K denotes the number of clusters. A clustering algorithm is a procedure that takes a set S of points as input and outputs a clustering of S. The clustering algorithms considered in this monograph take an addi- tional parameter as input, namely the number K of clusters they are supposed to construct. We analyze clustering stability in a statistical setup. The data set S is assumed to consist of n data points X1 , . . . , Xn that have been drawn independently from some unknown underlying distribution P on some space X . The final goal is to use these sample points to construct a good partition of the underlying space X . For some theoretical results it will be easier to ignore sampling effects and directly work on the underlying space X endowed with the probability distribution P . This can be considered as the case of having “infinitely many” data points. We sometimes call this the limit case for n → ∞. Assume we agree on a way to compute distances d(C, C ) between different clusterings C and C (see below for details). Then, for a fixed probability distribution P , a fixed number K of clusters and a fixed sample size n, the instability of a clustering algorithm is defined as the |
|
239 |
|
240 |
|
Clustering Stability: Definition and Implementation |
|
expected distance between two clusterings CK (Sn ), CK (Sn ) on different data sets Sn , Sn of size n, that is: |
|
Instab(K, n) := E d(CK (Sn ), CK (Sn )) . |
|
(2.1) |
|
The expectation is taken with respect to the drawing of the two sam- ples. In practice, a large variety of methods has been devised to compute stability scores and use them for model selection. On a very general level they work as follows: |
|
Given: a set S of data points, a clustering algorithm A that takes the number k of clusters as input |
|
(1) For k = 2, . . . , kmax |
|
(a) Generate perturbed versions Sb (b = 1, . . . , bmax ) of the original data set (for example by subsampling or adding noise, see below). (b) For b = 1, . . . , bmax : Cluster the data set Sb with algorithm A into k clusters to obtain clustering Cb . (c) For b, b = 1, . . . , bmax : Compute pairwise distances d(Cb , Cb ) between these clusterings (using one of the distance functions described below). (d) Compute instability as the mean distance between clusterings Cb : |
|
Instab(k, n) = |
|
1 |
|
b2max |
|
bmax |
|
d(Cb , Cb ). |
|
b,b =1 |
|
(2) Choose the parameter k that gives the best stability, in the simplest case as follows: |
|
K := argmin Instab(k, n) |
|
k |
|
(see below for more options). |
|
This scheme gives a very rough overview of how clustering stability can be used for model selection. In practice, many details have to be taken into account, and they will be discussed in the next section. Finally, we want to mention an approach that is vaguely related to clustering stability, namely the ensemble method [26]. Here, an ensem- ble of algorithms is applied to one fixed data set. Then a final clustering |
|
241 |
|
is built from the results of the individual algorithms. We are not going to discuss this approach in our monograph. |
|
Generating perturbed versions of the data set. To be able to evaluate the stability of a fixed clustering algorithm we need to run the clustering algorithm several times on slightly different data sets. To this end we need to generate perturbed versions of the original data set. In practice, the following schemes have been used: |
|
• Draw a random subsample of the original data set without replacement [5, 12, 15, 17]. • Add random noise to the original data points [8, 19]. • If the original data set is high-dimensional, use different ran- dom projections in low-dimensional spaces, and then cluster the low-dimensional data sets [25]. • If we work in a model-based framework, sample data from the model [14]. • Draw a random sample of the original data with replacement. This approach has not been reported in the literature yet, but it avoids the problem of setting the size of the subsample. For good reasons, this kind of sampling is the standard in the bootstrap literature [11] and might also have advantages in the stability setting. This scheme requires that the algorithm can deal with weighted data points (because some data points will occur several times in the sample). |
|
In all cases, there is a trade-off that has to be treated carefully. If we change the data set too much (for example, the subsample is too small, or the noise too large), then we might destroy the structure we want to discover by clustering. If we change the data set too little, then the clustering algorithm will always obtain the same results, and we will observe trivial stability. It is hard to quantify this trade-off in practice. |
|
Which clusterings to compare? Different protocols are used to com- pare the clusterings on the different data sets Sb . |
|
• Compare the clustering of the original data set with the clus- terings obtained on subsamples [17]. |
|
242 |
|
Clustering Stability: Definition and Implementation |
|
• Compare clusterings of overlapping subsamples on the data points where both clusterings are defined [5]. • Compare clusterings of disjoint subsamples [12, 15]. Here we first need to apply an extension operator to extend each clus- tering to the domain of the other clustering. |
|
Distances between clusterings. If two clusterings are defined on the same data points, then it is straightforward to compute a distance score between these clusterings based on any of the well-known clustering distances such as the Rand index, Jaccard index, Hamming distance, minimal matching distance, and Variation of Information distance [18]. All these distances count, in some way or the other, points or pairs of points on which the two clusterings agree or disagree. The most conve- nient choice from a theoretical point of view is the minimal matching distance. For two clusterings C, C of the same data set of n points it is defined as: |
|
1 dMM (C, C ) := min πn |
|
n |
|
1 {C(Xi )=π(C (Xi ))} ,l |
|
i=1 |
|
(2.2) |
|
where the minimum is taken over all permutations π of the K labels. Intuitively, the minimal matching distance measures the same quantity as the 0–1-loss used in supervised classification. For a stability study involving the adjusted Rand index or an adjusted mutual information index see Vinh and Epps [27]. If two clusterings are defined on different data sets one has two choices. If the two data sets have a big overlap one can use a restriction operator to restrict the clusterings to the points that are contained in both data sets. On this restricted set one can then compute a standard distance between the two clusterings. The other possibility is to use an extension operator to extend both clusterings from their domain to the domain of the other clustering. Then one can compute a standard distance between the two clusterings as they are now both defined on the joint domain. For center-based clusterings, as constructed by the K-means algorithm, a natural extension operator exists. Namely, to a new data point we simply assign the label of the closest cluster center. A more general scheme to extend an existing clustering to new |
|
243 |
|
data points is to train a classifier on the old data points and use its predictions as labels on the new data points. However, in the context of clustering stability it is not obvious what kind of bias we introduce with this approach. |
|
Stability scores and their normalization. The stability protocol outlined above results in a set of distance values (d(Cb , Cb ))b,b =1,...,bmax . In most approaches, one summarizes these values by taking their mean: |
|
Instab(k, n) = |
|
1 |
|
b2max |
|
bmax |
|
d(Cb , Cb ). |
|
b,b =1 |
|
Note that the mean is the simplest summary statistic one can compute based on the distance values d(Cb , Cb ). A different approach is to use the area under the cumulative distribution function of the distance values as the stability score, see Ben-Hur et al. [5] or Bertoni and Valentitni [6] for details. In principle one could also come up with more elaborate statistics based on distance values. To the best of our knowledge, such concepts have not been used so far. The simplest way to select the number K of clusters is to minimize the instability: |
|
K = argmin Instab(k, n). |
|
k=2,...,kmax |
|
This approach has been suggested in Levine and Domany [17]. However, an important fact to note is that Instab(k, n) trivially scales with k, regardless of what the underlying data structure is. For example, in the top left plot in Figure 2.1 we can see that even for a completely unclustered data set, Instab(n, k) increases with k. When using stability for model selection, one should correct for the trivial scaling of Instab, otherwise it might be meaningless to take the minimum afterwards. There exist several differentnormalization protocols: |
|
• Normalization using a reference null distribution [6, 12]. One repeatedly samples data sets from some reference null distri- bution. Such a distribution is defined on the same domain as the data points, but does not possess any cluster structure. In simple cases one can use the uniform distribution on the |
|
244 |
|
Clustering Stability: Definition and Implementation |
|
Data set: uniform |
|
stability (not normalized) |
|
Data set: four Gaussians |
|
stability (not normalized) |
|
1 |
|
0.5 |
|
0.8 |
|
0.6 |
|
0.4 0 |
|
0.8 |
|
0.6 |
|
0.4 0 |
|
1.5 |
|
1 |
|
0.5 0 |
|
5 |
|
10 |
|
15 |
|
510 stability (normalized) |
|
15 |
|
510 stability on reference distribution |
|
15 |
|
0 0 |
|
0.8 |
|
0.6 |
|
0.4 0 |
|
2 |
|
1 |
|
0 0 |
|
510 stability on reference distribution |
|
15 |
|
510 stability (normalized) |
|
15 |
|
5 |
|
10 |
|
15 |
|
Fig. 2.1 Normalized stability scores. Left plots: Data points from a uniform density on [0, 1]2 . Right plots: Data points from a mixture of four well-separated Gaussians in R2 . The first row always shows the unnormalized instability Instab for K = 2, . . . , 15. The second row shows the instability Instabnorm obtained on a reference distribution (uniform distribution). The third row shows the normalized stability Instabnorm . |
|
data domain as null distribution. A more practical approach is to scramble the individual dimensions of the existing data points and use the “scrambled points” as null distribution (see [6, 12] for details). Once we have drawn several data sets from the null distribution, we cluster them using our clustering algorithm and compute the corresponding stabil- ity score Instabnull as above. The normalized stability is then defined as Instabnorm := Instab/Instabnull . • Normalization by random labels [15]. First, we cluster each of the data sets Sb as in the protocol above to obtain the clusterings Cb . Then, we randomly permute these labels. That is, we assign the label to data point Xi that belonged to Xπ(i) , where π is a permutation of {1, . . . , n}. This leads to a permuted clustering Cb, perm . We then compute the stability score Instab as above, and similarly we compute Instabperm for the permuted clusterings. The normalized stability is then defined as Instabnorm := Instab/Instabperm . |
|
Once we computed the normalized stability scores Instabnorm we can choose the number of clusters that has smallest normalized instability, |
|
245 |
|
that is: |
|
K = argmin Instabnorm (k, n). |
|
k=2,...,kmax |
|
This approach has been taken for example in Ben-Hur et al. [5] and Lange et al. [15]. |
|
Selecting K based on statistical tests. A second approach to select the final number of clusters is to use a statistical test. Similarly to the normalization considered above, the idea is to compute stability scores not only on the actual data set, but also on “null data sets” drawn from some reference null distribution. Then one tests whether, for a given parameter k, the stability on the actual data is significantly larger than the one computed on the null data. If there are several values k for which this is the case, then one selects the one that is most significant. The most well-known implementation of such a procedure uses bootstrap methods [12]. Other authors use a χ2 -test [6] or a test based on the Bernstein inequality [7]. To summarize, there are many different implementations for select- ing the number K of clusters based on stability scores. Until now, there does not exist any convincing empirical study that thoroughly compares all these approaches on a variety of data sets. In my opin- ion, even fundamental issues such as the normalization have not been investigated in enough detail. For example, in my experience normal- ization often has no effect whatsoever (but I did not conduct a thorough study either). To put stability-based model selection on a firm ground it would be crucial to compare the different approaches with each other in an extensive case study. |
|
3 |
|
Stability Analysis of the K-Means Algorithm |
|
The vast majority of papers about clustering stability use the K-means algorithm as basic clustering algorithm. In this section we discuss the stability results for the K-means algorithm in depth. Later, in Sec- tion 4 we will see how these results can be extended to other clustering algorithms. For simpler reference we briefly recapitulate the K-means algorithm (details can be found in many text books, for example [13]). Given a set of n data points X1 , . . . , Xn ∈Rd and a fixed number K of clusters to construct, the K-means algorithm attempts to minimize the clustering objective function: |
|
(n) QK (c1 , . . . , cK ) |
|
1 = n |
|
n |
|
i=1 |
|
k=1,...,K |
|
min |
|
Xi − ck 2 , |
|
(3.1) |
|
where c1 , . . . , cK denote the centers of the K clusters. In the limit n → ∞, the K-means clustering is the one that minimizes the limit objective function: |
|
QK (c1 , . . . , cK ) = |
|
(∞) |
|
k=1,...,K |
|
min |
|
X − ck |
|
2 |
|
dP (X), |
|
(3.2) |
|
where P is the underlying probability distribution. |
|
246 |
|
247 |
|
Given an initial set c<0> = {c<0> , . . . , c<0> } of centers, the K-means1K algorithm iterates the following two steps until convergence: |
|
(1) Assign data points to closest cluster centers: |
|
∀i= 1, . . . , n : |
|
C <t> (Xi ) := argmin Xi − c<t> .k |
|
k=1,...K |
|
(2) Re-adjust cluster means: |
|
∀k= 1, . . . , K : |
|
c<t+1> :=k |
|
1 Nk |
|
Xi , |
|
{i | C <t> (Xi )=k} |
|
where Nk denotes the number of points in cluster k. |
|
It is well known that, in general, the K-means algorithm terminates (n) in a local optimum of QK and does not necessarily find the global optimum. We study the K-means algorithm in two different scenarios: |
|
The idealized scenario: Here we assume an idealized algorithm that always finds the global optimum of the K-means objective function (n) QK . For simplicity, we call this algorithm the idealized K-means algorithm. |
|
The realistic scenario: Here we analyze the actual K-means algorithm as described above. In particular, we take into account its property of getting stuck in local optima. We also take into account the initialization of the algorithm. In both scenarios, our theoretical investigations are based on the following simple protocol to compute the stability of the K-means algorithm: |
|
(1) We assume to have access to as many independent samples of size n of the underlying distribution as we want. That is, we ignore artifacts introduced by the fact that in practice we draw subsamples of one fixed, given sample and thus might introduce a bias. (2) As distance between two K-means clusterings of two samples S, S we use the minimal matching distance between the extended clusterings on the domain S ∪S . |
|
248 |
|
Stability Analysis of the K-Means Algorithm |
|
(3) We work with the expected minimal matching distance as in Equation (2.1), that is we analyze Instab rather than the practically used Instab. This does not do much harm as instability scores are highly concentrated around their means anyway. |
|
3.1 |
|
The Idealized K-Means Algorithm |
|
In this section we focus on the idealized K-means algorithm, that is the algorithm that always finds the global optimum c(n) of the K-means objective function: |
|
c(n) := (c1 , . . . , cK ) := argmin QK (c). |
|
c |
|
(n) |
|
(n) |
|
(n) |
|
3.1.1 |
|
First Convergence Result and the Role of Symmetry |
|
The starting point for the results in this section is the following obser- vation [4]. Consider the situation in Figure 3.1a. Here the data contains three clusters, but two of them are closer to each other than to the third cluster. Assume we run the idealized K-means algorithm with K = 2 on such a data set. Separating the left two clusters from the right cluster (n) (solid line) leads to a much better value of QK than, say, separating the top two clusters from the bottom one (dashed line). Hence, as soon as we have a reasonable amount of data, idealized (!) K-means with K = 2 always constructs the first solution (solid line). Consequently, it is stable in spite of the fact that K = 2 is the wrong number of clus- ters. Note that this would not happen if the data set was symmetric, as depicted in Figure 3.1b. Here neither the solution depicted by the dashed line nor the one with the solid line is clearly superior, which leads to instability if the idealized K-means algorithm is applied to different samples. Similar examples can be constructed to detect that K is too large, see Figure 3.1c and d. With K = 3 it is clearly the best solution to split the big cluster in Figure 3.1c, thus clustering this data set is stable. In Figure 3.1d, however, due to symmetry reasons neither splitting the top nor the bottom cluster leads to a clear advantage. Again this leads to instability. |
|
3.1 The Idealized K-Means Algorithm |
|
249 |
|
(a) |
|
(b) |
|
(c) |
|
(d) |
|
Fig. 3.1 If data sets are not symmetric, idealized K-means is stable even if the number K of clusters is too small (a) or too large (c). Instability of the wrong number of clusters only occurs in symmetric data sets (b and d). |
|
These informal observations suggest that unless the data set con- tains perfect symmetries, the idealized K-means algorithm is stable even if K is wrong. This can be formalized with the following theorem. |
|
Theorem 3.1 (Stability and global optima of the objective (∞)function). Let P be a probability distribution on Rd and QK the limit K-means objective function as defined in Equation (3.2), for some fixed value K > 1. |
|
(1) If QK has a unique global minimum, then the idealized K-means algorithm is perfectly stable when n → ∞, that is: |
|
n→∞ |
|
(∞) |
|
(∞) |
|
lim Instab(K, n) = 0. |
|
(2) If QK has several global minima (for example, because the probability distribution is symmetric), then the idealized K-means algorithm is instable, that is: |
|
n→∞ |
|
lim Instab(K, n) > 0. |
|
This theorem has been proved (in a slightly more general setting) in references [2, 4]. Proof sketch, Part 1. It is well known that if the objective function (∞)QK has a unique global minimum, then the centers c(n) constructed by the idealized K-means algorithm on a sample of n points almost |
|
250 |
|
Stability Analysis of the K-Means Algorithm |
|
surely converge to the true population centers c(∗)as n → ∞ [20]. This means that given some ε > 0 we can find some large n such that c(n) is ε-close to c(∗)with high probability. As a consequence, if we compare two clusterings on different samples of size n, the centers of the two clusterings are at most 2ε-close to each other. Finally, one can show that if the cluster centers of two clusterings are ε-close, then their minimal matching distance is small as well. Thus, the expected distance between the clusterings constructed on two samples of size n becomes arbitrarily small and eventually converges to 0 as n → ∞. Part 2. For simplicity, consider the symmetric situation in Figure 3.1a. Here the probability distribution has three axes of symmetry. For K = 2 (∞)the objective function Q2 has three global minima c(∗1), c(∗2), c(∗3) corresponding to the three symmetric solutions. In such a situation, the idealized K-means algorithm on a sample of n points gets arbitrarily close to one of the global optima, that is mini=1,...,3 d(c(n) , c(∗i)) → 0 [16]. In particular, the sequence (c(n) )n of empirical centers has three con- vergent subsequences, each of which converges to one of the global solutions. One can easily conclude that if we compare two clusterings on random samples with probability 1/3 they belong to “the same sub- sequence” and thus their distance will become arbitrarily small. With probability 2/3 they “belong to different subsequences”, and thus their distance remains larger than a constant a > 0. From the latter we can conclude that Instab(K, n) is always larger than 2a/3. |
|
The interpretation of this theorem is distressing. The stability or instability of parameter K does not depend on whether K is “correct” or “wrong”, but only on whether the K-means objective function for this particular value K has one or several global minima. However, the number of global minima is usually not related to the number of clus- ters, but rather to the fact that the underlying probability distribution has symmetries. In particular, if we consider “natural” data distribu- tions, such distributions are rarely perfectly symmetric. Consequently, (∞) the corresponding functions QK usually only have one global mini- mum, for any value of K. In practice this means that for a large sample size n, the idealized K-means algorithm is stable for any value of K. This seems to suggest that model selection based on clustering stability |
|
3.1 The Idealized K-Means Algorithm |
|
251 |
|
does not work. However, we will see later in Section 3.3 that this result is essentially an artifact of the idealized clustering setting and does not carry over to the realistic setting. |
|
3.1.2 |
|
Refined Convergence Results for the Case of a Unique Global Minimum |
|
Above we have seen that if, for a particular distribution P and a (∞) particular value K, the objective function QK has a unique global minimum, then the idealized K-means algorithm is stable in the sense that limn→∞ Instab(K, n) = 0. At first glance, this seems to suggest that stability cannot distinguish between different values k1 and k2 (at least for large n). However, this point of view is too simplistic. It can happen that even though both Instab(k1 , n) and Instab(k2 , n) converge to 0 as n → ∞, this happens “faster” for k1 than for k2 . If measured relative to the absolute values of Instab(k1 , n) and Instab(k2 , n), the dif- ference between Instab(k1 , n) and Instab(k2 , n) can still be large enough to be “significant”. The key in verifying this intuition is to study the limit process more closely. This line of work has been established by Shamir and Tishby in a series of papers [22, 23, 24]. The main idea is that instead of studying the convergence of Instab(k, n) one needs to consider the √ rescaled instability n · Instab(k, n). One can prove that the rescaled instability converges in distribution, and the limit distribution depends on k. In particular, the means of the limit distributions are different for different values of k. This can be formalized as follows. |
|
Theorem 3.2 (Convergence of rescaled stability). Assume that the probability distribution P has a density p. Consider a fixed param- eter K, and assume that the corresponding limit objective function (∗)(∗)(∞)QK has a unique global minimum c(∗)= (c1 , . . . , cK ). The bound- ary between clusters i and j is denoted by Bij . Let m ∈N, and Sn,1 , . . . , Sn,2m be samples of size n drawn independently from P . Let CK (Sn,i ) be the result of the idealized K-means clustering on sample Sn,i . Compute the instability as mean distance between clusterings of |
|
252 |
|
Stability Analysis of the K-Means Algorithm |
|
disjoint pairs of samples, that is: |
|
1 Instab(K, n) := m |
|
m |
|
dMM CK (Sn,2i−1 ), CK (Sn,2i ) . |
|
i=1 |
|
(3.3) |
|
Then, as n → ∞ and m → ∞, the rescaled instability converges in probability to |
|
RInstab(K) := |
|
1≤i<j≤K |
|
Bij |
|
√ |
|
n · Instab(K, n) |
|
Vij |
|
(∗) ci |
|
− |
|
(∗) cj |
|
p(x)dx, |
|
(3.4) |
|
where Vij stands for a term describing the asymptotics of the random fluctuations of the cluster boundary between cluster i and cluster j (exact formula given in [23, 24]). |
|
Note that even though the definition of instability in Equation (3.3) differs slightly from the definition in Equation (2.1), intuitively it mea- sures the same quantity. The definition in Equation (3.3) just has the technical advantage that all pairs of samples are independent from one another. (∞) Proof sketch. It is well known that if QK has a unique global minimum, then the centers constructed by the idealized K-means algo- rithm on a finite sample satisfy a central limit theorem [21]. That is, if we rescale the distances between the sample-based centers and the √ true centers with the factor n, these rescaled distances converge to a normal distribution as n → ∞. When the cluster centers converge, the same can be said about the cluster boundaries. In this case, instabil- ity essentially counts how many points change side when the cluster boundaries move by some small amount. The points that potentially change side are the points close to the boundary of the true limit clus- tering. Counting these points is what the integrals Bij . . . p(x)dx in the definition of RInstab take care of. The exact characterization of how the cluster boundaries “jitter” can be derived from the central (∗)(∗) in the inte- limit theorem. This leads to the term Vij / ci − cj gral. Vij characterizes how the cluster centers themselves “jitter”. The (∗)(∗)normalization ci − cj is needed to transform jittering of cluster centers to jittering of cluster boundaries: if two cluster centers are |
|
3.1 The Idealized K-Means Algorithm |
|
253 |
|
very far apart from each other, the cluster boundary only jitters by a small amount if the centers move by ε, say. However, if the centers are very close to each other (say, they have distance 3ε), then mov- ing the centers by ε has a large impact on the cluster boundary. The details of this proof are very technical, we refer the interested reader to references [23, 24]. |
|
Let us briefly explain how the result in Theorem 3.2 is compatible with the result in Theorem 3.1. On a high level, the difference between both results resembles the difference between the law of large numbers and the central limit theorem in probability theory. The LLN stud- ies the convergence of the mean of a sum of random variables to its expectation (note that Instab has the form of a sum of random vari- ables). The CLT is concerned with the same expression, but rescaled √ with a factor n. For the rescaled sum, the CLT then gives results on the convergence in distribution. Note that in the particular case of instability, the distribution of distances lives on the non-negative num- bers only. This is why the rescaled instability in Theorem 3.2 is positive and not 0 as in the limit of Instab in Theorem 3.1. A toy figure explain- ing the different convergence processes can be seen in Figure 3.2. Theorem 3.2 tells us that different parameters k usually lead to dif- ferent rescaled stabilities in the limit for n → ∞. Thus we can hope that if the sample size n is large enough we can distinguish between different values of k based on the stability of the corresponding clus- terings. An important question is now which values of k lead to stable and which ones lead to instable results, for a given distribution P . |
|
3.1.3 |
|
Characterizing Stable Clusterings |
|
It is a straightforward consequence of Theorem 3.2 that if we consider (∞) different values k1 and k2 and the clustering objective functions Qk1 |
|
and Qk2 have unique global minima, then the rescaled stability values RInstab(k1 ) and RInstab(k2 ) can differ from each other. Now we want to investigate which values of k lead to high stability and which ones lead to low stability. |
|
Conclusion 3.3 (Instable clusterings). Assume that QK has a unique global optimum. If Instab(K, n) is large, the idealized K-means |
|
(∞) |
|
(∞) |
|
254 |
|
Stability Analysis of the K-Means Algorithm |
|
Fig. 3.2 Different convergence processes. The left column shows the convergence studied in Theorem 3.1. As the sample size n → ∞, the distribution of distances dMM (C, C ) is degenerate, all mass is concentrated on 0. The right column shows the convergence studied in Theorem 3.2. The rescaled distances converge to a non-trivial distribution, and its mean (depicted by the cross) is positive. To go from the left to the right side one has to rescale √ by n. |
|
clustering tends to have cluster boundaries in high-density regions of the space. |
|
There exist two different derivations of this conclusion, which have been obtained independently from each other by completely different methods [3, 22]. On a high level, the reason why the conclusion tends to hold is that if cluster boundaries jitter in a region of high-density, then more points “change side” than if the boundaries jitter in a region of low density. First derivation, informal, based on references [22, 24]. Assume that n is large enough such that we are already in the asymptotic regime (that is, the solution c(n) constructed on the finite sample is close to the true population solution c(∗)). Then the rescaled instability computed on the sample is close to the expression given in Equation (3.4). If the cluster boundaries Bij lie in a high-density region of the space, then the integral in Equation (3.4) is large — compared to a situation where the cluster boundaries lie in low-density regions of the space. From a high level point of view, this justifies the conclusion above. However, |
|
3.1 The Idealized K-Means Algorithm |
|
255 |
|
note that it is difficult to identify how exactly the quantities p, Bij , and Vij influence RInstab, as they are not independent of each other. Second derivation, more formal, based on Ben-David and von Luxburg [3]. A formal way to prove the conclusion is as follows. We introduce a new distance dboundary between two clusterings. This dis- tance measures how far the cluster boundaries of two clusterings are apart from each other. One can prove that the K-means quality func- (∞) tion QK is continuous with respect to this distance function. This means that if two clusterings C, C are close with respect to dboundary , (∞) then they have similar quality values. Moreover, if QK has a unique global optimum, we can invert this argument and show that if a clus- tering C is close to the optimal limit clustering C ∗, then the distance dboundary (C, C ∗) is small. Now consider the clustering C (n) based on a sample of size n. One can prove the following key statement. If C (n) con- verges uniformly (over the space of all probability distributions) in the sense that with probability at least 1 − δ we have dboundary (Cn , C) ≤ γ, then: |
|
Instab(K, n) ≤ 2δ + P (Tγ (B)). |
|
(3.5) |
|
Here P (Tγ (B)) denotes the probability mass of a tube of width γ around the cluster boundaries B of C. Results in [1] establish the uni- form convergence of the idealized K-means algorithm. This proves the conjecture: Equation (3.5) shows that if Instab is high, then there is a lot of mass around the cluster boundaries, namely the cluster bound- aries are in a region of high density. |
|
For stable clusterings, the situation is not as simple. It is tempting to make the following conjecture. |
|
Conjecture 3.4 (Stable clusterings). Assume that QK has a unique global optimum. If Instab(K, n) is “small”, the idealized K- means clustering tends to have cluster boundaries in low-density regions of the space. |
|
Argument in favor of the conjecture: As in the first approach above, considering the limit expression of RInstab reveals that if the cluster |
|
(∞) |
|
256 |
|
Stability Analysis of the K-Means Algorithm |
|
boundary lies in a low density area of the space, then the integral in RInstab tends to have a low value. In the extreme case where the cluster boundaries go through a region of zero density, the rescaled instability is even 0. Argument against the conjecture: counter-examples! One can con- struct artificial examples where clusterings are stable although their decision boundary lies in a high-density region of the space ([3]). The way to construct such examples is to ensure that the variations of the cluster centers happen in parallel to cluster boundaries and not orthog- onal to cluster boundaries. In this case, the sampling variation does not lead to jittering of the cluster boundary, hence the result is rather stable. These counter-examples show that Conjecture 3.4 cannot be true in general. However, my personal opinion is that the counter-examples are rather artificial, and that similar situations will rarely be encountered in practice. I believe that the conjecture “tends to hold” in practice. It might be possible to formalize this intuition by proving that the statement of the conjecture holds on a subset of “nice” and “natural” probability distributions. The important consequence of Conclusion 3.3 and Conjecture 3.4 (if true) is the following. |
|
Conclusion 3.5 (Stability of idealized K-means detects whether K is too large). Assume that the underlying distribution P has K well-separated clusters, and assume that these clusters can be represented by a center-based clustering model. Then the following statements tend to hold for the idealized K-means algorithm. |
|
(1) If K is too large, then the clusterings obtained by the ideal- ized K-means algorithm tend to be instable. (2) If K is correct or too small, then the clusterings obtained by the idealized K-means algorithm tend to be stable (unless the objective function has several global minima, for example due to symmetries). |
|
3.2 The Actual K-Means Algorithm |
|
257 |
|
Given Conclusion 3.3 and Conjecture 3.4 it is easy to see why Con- clusion 3.5 is true. If K is larger than the correct number of clusters, one necessarily has to split a true cluster into several smaller clusters. The corresponding boundary goes through a region of high density (the cluster which is being split). According to Conclusion 3.3 this leads to instability. If K is correct, then the idealized (!) K-means algorithm dis- covers the correct clustering and thus has decision boundaries between the true clusters, that is in low-density regions of the space. If K is too small, then the K-means algorithm has to group clusters together. In this situation, the cluster boundaries are still between true clusters, hence in a low-density region of the space. |
2187

被折叠的 条评论
为什么被折叠?



