Abstract:One of the applications of center-based clustering algorithms such as K-Means is partitioning data points into K clusters. In some examples, the feature space relates to the underlying problem we are trying to solve, and sometimes we can obtain a suitable feature space. Nevertheless, while K-Means is one of the most efficient offline clustering algorithms, it is not equipped to estimate the number of clusters, which is useful in some practical cases. Other practical methods which do are simply too complex, as they require at least one run of K-Means for each possible K. In order to address this issue, we propose a K-Means initialization similar to K-Means++, which would be able to estimate K based on the feature space while finding suitable initial centroids for K-Means in a deterministic manner. Then we compare the proposed method, DISCERN, with a few of the most practical K estimation methods, while also comparing clustering results of K-Means when initialized randomly, using K-Means++ and using DISCERN. The results show improvement in both the estimation and final clustering performance.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are two key challenges of the K - Means clustering algorithm in practical applications: 1. **The problem of estimating the number of clusters (K)**: The traditional K - Means algorithm requires the number of clusters K to be specified in advance, but in many practical situations, this parameter is unknown. Some existing methods such as the Elbow Method, the Silhouette Method and X - Means can be used to estimate K, but they usually need to run the K - Means algorithm multiple times, which leads to high computational complexity and time cost. 2. **The problem of selecting the initial centroids**: K - Means is very sensitive to the selection of initial centroids, and different initial values may lead to different clustering results. K - Means++ improves performance by improving the selection of initial centroids, but it still depends on the known K value and is random, and may get slightly different results each time it is run. To solve the above problems, the paper proposes a new diversity - based centroid selection method - DISCERN (Diversity - based Selection of Centroids for k - Estimation and Rapid Non - stochastic Clustering). The main features of DISCERN are as follows: - **Deterministic**: DISCERN is a deterministic method and will produce the same result regardless of the order of data points. - **Automatic K - value estimation**: DISCERN can estimate the appropriate number of clusters K while selecting the initial centroids without prior knowledge of the specific value of K. - **Efficient**: Compared with existing K - estimation methods, DISCERN does not need to run K - Means multiple times, thus reducing computational complexity and time cost. Specifically, DISCERN achieves these goals through the following steps: 1. **Similarity pre - calculation**: First, calculate the similarity matrix S between data points, using cosine similarity as a metric. 2. **Diversity - based selection**: Select the least similar data points as the initial centroids, and iteratively select new centroids that are most different from the existing centroids. 3. **Estimating the number of clusters**: Analyze the change trend of the vector pℓ generated during the selection process, calculate its curvature κ(R), and find the point with the smallest curvature as the estimated K value. Through this method, DISCERN can not only effectively initialize the centroids of K - Means, but also accurately estimate the number of clusters, thereby improving the overall performance and stability of clustering.

DISCERN: Diversity-based Selection of Centroids for k-Estimation and Rapid Non-stochastic Clustering

Subspace Clustering by Directly Solving Discriminative K-means

An Initial Seed Selection Algorithm for K-means Clustering of Georeferenced Data to Improve Replicability of Cluster Assignments for Mapping Application

Determinantal consensus clustering

$k$-Median Clustering via Metric Embedding: Towards Better Initialization with Differential Privacy

Fuzzy K-Means Clustering With Discriminative Embedding

initKmix -- A Novel Initial Partition Generation Algorithm for Clustering Mixed Data using k-means-based Clustering

K and starting means for k-means algorithm

Careful seeding for the k-medoids algorithm with incremental k++ cluster construction

Stable Initialization Scheme for K-means Clustering

K-means Clustering Algorithm with Improved Initial Center

Randomized Dimensionality Reduction for k-means Clustering

ModEx and Seed-Detective: Two novel techniques for high quality clustering by using good initial seeds in K-Means

K-expectiles clustering

Simultaneous Estimation of Number of Clusters and Feature Sparsity in Clustering High-Dimensional Data

Neighborhood Density Method for Selecting Initial Cluster Centers in K-Means Clustering

Faster K-Means Cluster Estimation

A Novel Effective Distance Measure and a Relevant Algorithm for Optimizing the Initial Cluster Centroids of K-means

A Scalable Algorithm for Individually Fair K-means Clustering

Rapid Clustering with Semi-Supervised Ensemble Density Centers

An Efficient K-Means Clustering Initialization Using Optimization Algorithm