Estimating the number of clusters in multivariate data by various fittings of the L-curve

Rida Moustafa,Ali S. Hadi
DOI: https://doi.org/10.1007/s40314-024-02839-8
2024-11-27
Computational and Applied Mathematics
Abstract:The goal of this paper is to estimate the true but unknown number of clusters K in multivariate data. The contributions are two folds. The first is to narrow the search space for the estimates to . We propose a new method for finding , which is better than the existing ones. The second is to propose three indices for computing within the range : The R-Index, the FB Index, and the CSum Index. All three indices are based on the L -curve (the plot of vs. k ), where is the total within-cluster-similarity (withinness), for values of k in the above range. We give the rationale for each method. We investigate the performance of these three indices and compare them with six of the most commonly used indices using both real benchmark datasets and a challenging synthetic data of varying sample sizes ( to ) and varying number of true clusters K ranging from to . We use both the Hierarchical clustering and the k -Means clustering algorithms, but the approach can also be used with other clustering methods. The three indices are shown to outperform the existing ones. An additional advantage of our indices is computational complexity, where it is shown that they take much less time to compute than the existing ones.
mathematics, applied
What problem does this paper attempt to address?