Abstract:Advances in recent techniques for scientific data collection in the era of big data allow for the systematic accumulation of large quantities of data at various data-capturing sites. Similarly, exponential growth in the development of different data analysis approaches has been reported in the literature, amongst which the K-means algorithm remains the most popular and straightforward clustering algorithm. The broad applicability of the algorithm in many clustering application areas can be attributed to its implementation simplicity and low computational complexity. However, the K-means algorithm has many challenges that negatively affect its clustering performance. In the algorithm's initialization process, users must specify the number of clusters in a given dataset apriori while the initial cluster centers are randomly selected. Furthermore, the algorithm's performance is susceptible to the selection of this initial cluster and for large datasets, determining the optimal number of clusters to start with becomes complex and is a very challenging task. Moreover, the random selection of the initial cluster centers sometimes results in minimal local convergence due to its greedy nature. A further limitation is that certain data object features are used in determining their similarity by using the Euclidean distance metric as a similarity measure, but this limits the algorithm's robustness in detecting other cluster shapes and poses a great challenge in detecting overlapping clusters. Many research efforts have been conducted and reported in literature with regard to improving the K-means algorithm's performance and robustness. The current work presents an overview and taxonomy of the K-means clustering algorithm and its variants. The history of the K-means, current trends, open issues and challenges, and recommended future research perspectives are also discussed.

Performance Evaluation of Threshold-Based and k-means Clustering Algorithms Using Iris Dataset

Performance evaluation of K-means clustering algorithm with various distance metrics

An Analytical Study on Behavior of Clusters Using K Means, EM and K* Means Algorithm

Performance evaluation of some clustering algorithms and validity indices

Performance analysis of Kmeans with modified initial centroid selection algorithms and developed Kmeans9+ model

The k-means Algorithm: A Comprehensive Survey and Performance Evaluation

K-means Clustering Algorithms: A Comprehensive Review, Variants Analysis, and Advances in the Era of Big Data

Research on K-Value Selection Method of K-Means Clustering Algorithm

Performance Evaluation of Simple K-Mean and Parallel K-Mean Clustering Algorithms: Big Data Business Process Management Concept

Comparison of Spectral Clustering, K-clustering and Hierarchical Clustering on E-Nose Datasets: Application to the Recognition of Material Freshness, Adulteration Levels and Pretreatment Approaches for Tomato Juices

Data clustering with modified K-means algorithm

Unique Metric for Health Analysis with Optimization of Clustering Activity and Cross Comparison of Results from Different Approach

Performance Analysis of Clustering Algorithms for Gene Expression Data

Effects of similarity/distance metrics on k-means algorithm with respect to its applications in IoT and multimedia: a review

Evaluating and Validating Cluster Results

Min max kurtosis distance based improved initial centroid selection approach of K-means clustering for big data mining on gene expression data

Clustering Large Datasets by Merging K-Means Solutions

Accuracy Evaluation of Overlapping and Multi-resolution Clustering Algorithms on Large Datasets

Research issues on K-means Algorithm : An Experimental Trial Using Matlab

Silhouette Analysis for Performance Evaluation in Machine Learning with Applications to Clustering