The clustering method used is build around k-means (see for example Duda et al., 2001) and aimed at smaller data sets. An overview of the algorithm is as follows :
1. initialize the first 2 clusters to run k-means with k = 2
2. execute a single run of k-means
3. measure the quality of every cluster individually and remove those that pass
the test from the data
4. retain all centers of the bad clusters while splitting one cluster in two by
adding a new center
5. repeat from 2 for the new value of k.
The initialization of k-means can be done either randomly (which means that it has to be run multiple times), or by taking two centers that have a maximum separation. This last option avoids random outcomes of k-means and gives fixed reproducable results and for consecutive runs of k-means the centers of the previously found (and not accepted) clusters are reused, adding one more center in a cluster again as far away from its cluster's center as possible.
The quality of a cluster is measured using a normality test on the data projected in the direction of its principal component (following the g-means method as suggested in Hamerly and Elkan (2003)). Using the principal component has the advantage that the test can be performed quickly (since it is only 1-dimensional) and that the data does not have to resemble a perfect normal distribution in all dimensions, which may require a larger number of samples. This method allows most ellipsoidal clusters to be found as well. The normality test used is the Anderson-Darling statistic (Anderson and Darling (1952); D’Agostino and Stephens (1986); Romeu (2003); Stephens (1974)) given by
with zi = F(xi) and F the cumulative distribution function of the N(0,1) normal distribution and xi the standardized data in ascending order. A small modification is necessary due to the usage of the sample mean and variance (D’Agostino and Stephens, 1986) :
The critical values need to be looked up from a table since the p-values are not trivial to evaluate. The program uses the default cv of 0.9 which relates to an alpha-value of around 0.03, the exact value is not truly important since the goal is not to fit normal distributions but to find acceptable clustering results (ofcourse the used value should always be documented with the results).
Citations :
Anderson, T.W. and D.A. Darling (1952). Asymptotic theory of certain "goodness of fit" criteria based on stochastic processes. The Annals of Mathematical Statistics, 23(2):193-212.
D'Agostino, R.B. and M.A. Stephens (1986). Goodness-Of-Fit Techniques (Statistics, a Series of Textbooks and Monographs). Marcel Dekker.
Duda, O.D., P.E. Hart, and D.G. Stork (2001). Pattern Classification (2nd ed.). Wiley-Interscience.
Hamerly, G. and C. Elkan (2003). Learning the k in k-means. Neural Information Processing Systems Foundation.
Romeu, J.L. (2003). Anderson-darling: A goodness of fit test for small samples assumptions. Technical Report 3, RAC START.
Stephens, M.A. (1974). Edf statistics for goodness of fit and some comparisons. Journal of the American Statistical Association, 69(347):730-737.
Click here to go to the main index.