DISCERN: Diversity-based Selection of Centroids for k-Estimation and Rapid Non-stochastic Clustering

Ali Hassani,Amir Iranmanesh,Mahdi Eftekhari,Abbas Salemi
DOI: https://doi.org/10.1007/s13042-020-01193-5
2020-09-22
Abstract:One of the applications of center-based clustering algorithms such as K-Means is partitioning data points into K clusters. In some examples, the feature space relates to the underlying problem we are trying to solve, and sometimes we can obtain a suitable feature space. Nevertheless, while K-Means is one of the most efficient offline clustering algorithms, it is not equipped to estimate the number of clusters, which is useful in some practical cases. Other practical methods which do are simply too complex, as they require at least one run of K-Means for each possible K. In order to address this issue, we propose a K-Means initialization similar to K-Means++, which would be able to estimate K based on the feature space while finding suitable initial centroids for K-Means in a deterministic manner. Then we compare the proposed method, DISCERN, with a few of the most practical K estimation methods, while also comparing clustering results of K-Means when initialized randomly, using K-Means++ and using DISCERN. The results show improvement in both the estimation and final clustering performance.
Machine Learning
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are two key challenges of the K - Means clustering algorithm in practical applications: 1. **The problem of estimating the number of clusters (K)**: The traditional K - Means algorithm requires the number of clusters K to be specified in advance, but in many practical situations, this parameter is unknown. Some existing methods such as the Elbow Method, the Silhouette Method and X - Means can be used to estimate K, but they usually need to run the K - Means algorithm multiple times, which leads to high computational complexity and time cost. 2. **The problem of selecting the initial centroids**: K - Means is very sensitive to the selection of initial centroids, and different initial values may lead to different clustering results. K - Means++ improves performance by improving the selection of initial centroids, but it still depends on the known K value and is random, and may get slightly different results each time it is run. To solve the above problems, the paper proposes a new diversity - based centroid selection method - DISCERN (Diversity - based Selection of Centroids for k - Estimation and Rapid Non - stochastic Clustering). The main features of DISCERN are as follows: - **Deterministic**: DISCERN is a deterministic method and will produce the same result regardless of the order of data points. - **Automatic K - value estimation**: DISCERN can estimate the appropriate number of clusters K while selecting the initial centroids without prior knowledge of the specific value of K. - **Efficient**: Compared with existing K - estimation methods, DISCERN does not need to run K - Means multiple times, thus reducing computational complexity and time cost. Specifically, DISCERN achieves these goals through the following steps: 1. **Similarity pre - calculation**: First, calculate the similarity matrix S between data points, using cosine similarity as a metric. 2. **Diversity - based selection**: Select the least similar data points as the initial centroids, and iteratively select new centroids that are most different from the existing centroids. 3. **Estimating the number of clusters**: Analyze the change trend of the vector pℓ generated during the selection process, calculate its curvature κ(R), and find the point with the smallest curvature as the estimated K value. Through this method, DISCERN can not only effectively initialize the centroids of K - Means, but also accurately estimate the number of clusters, thereby improving the overall performance and stability of clustering.