K-means clustering versus validation measures

XiongHui,WuJunjie,ChenJian

2009-01-01

Abstract:K-means is a well-known and widely used partitional clustering method. While there are considerable research efforts to characterize the key features of the K-means clustering algorithm, further in...

What problem does this paper attempt to address?

K-means Clustering Versus Validation Measures: a Data-Distribution Perspective

Hui Xiong,Junjie Wu,Jian Chen

DOI: https://doi.org/10.1109/tsmcb.2008.2004559

2008-01-01

IEEE Transactions on Systems Man and Cybernetics Part B (Cybernetics)

Abstract:K-means is a well-known and widely used partitional clustering method. While there are considerable research efforts to characterize the key features of the K-means clustering algorithm, further investigation is needed to understand how data distributions can have impact on the performance of K-means clustering. To that end, in this paper, we provide a formal and organized study of the effect of skewed data distributions on K-means clustering. Along this line, we first formally illustrate that K-means tends to produce clusters of relatively uniform size, even if input data have varied "true" cluster sizes. In addition, we show that some clustering validation measures, such as the entropy measure, may not capture this uniform effect and provide misleading information on the clustering performance. Viewed in this light, we provide the coefficient of variation (CV) as a necessary criterion to validate the clustering results. Our findings reveal that K-means tends to produce clusters in which the variations of cluster sizes, as measured by CV, are in a range of about 0.3-1.0. Specifically, for data sets with large variation in "true" cluster sizes (e.g., CV > 1.0), K-means reduces variation in resultant cluster sizes to less than 1.0. In contrast, for data sets with small variation in "true" cluster sizes (e.g., CV < 0.3), K-means increases variation in resultant cluster sizes to greater than 0.3. In other words, for the earlier two cases, K-means produces the clustering results which are away from the "true" cluster distributions.
External validation measures for K-means clustering

WuJunjie,ChenJian,XiongHui,XieMing

IF: 8.5

2009-01-01

Expert Systems with Applications

Abstract:Cluster validation is an important part of any cluster analysis. External measures such as entropy, purity and mutual information are often used to evaluate K-means clustering. However, whether the...
Adapting the Right Measures for K-means Clustering

Junjie Wu,Hui Xiong,Jian Chen

DOI: https://doi.org/10.1145/1557019.1557115

2009-01-01

Abstract:Clustering validation is a long standing challenge in the clustering literature. While many validation measures have been developed for evaluating the performance of clustering algorithms, these measures often provide inconsistent information about the clustering performance and the best suitable measures to use in practice remain unknown. This paper thus fills this crucial void by giving an organized study of 16 external validation measures for K-means clustering. Specifically, we first introduce the importance of measure normalization in the evaluation of the clustering performance on data with imbalanced class distributions. We also provide normalization solutions for several measures. In addition, we summarize the major properties of these external measures. These properties can serve as the guidance for the selection of validation measures in different application scenarios. Finally, we reveal the interrelationships among these external measures. By mathematical transformation, we show that some validation measures are equivalent. Also, some measures have consistent validation performances. Most importantly, we provide a guide line to select the most suitable validation measures for K-means clustering.
Combining multiple clusterings via k-modes algorithm

Huilan Luo,Fansheng Kong,Yixiao Li

DOI: https://doi.org/10.1007/11811305_34

2006-01-01

Abstract:Clustering ensembles have emerged as a powerful method for improving both the robustness and the stability of unsupervised classification solutions. However, finding a consensus clustering from multiple partitions is a difficult problem that can be approached from graph-based, combinatorial or statistical perspectives. A consensus scheme via the k-modes algorithm is proposed in this paper. A combined partition is found as a solution to the corresponding categorical data clustering problem using the k-modes algorithm. This study compares the performance of the k-modes consensus algorithm with other fusion approaches for clustering ensembles. Experimental results demonstrate the effectiveness of the proposed method.
External Validation Measures for K-means Clustering: A Data Distribution Perspective

Junjie Wu,Jian Chen,Hui Xiong,Ming Xie

DOI: https://doi.org/10.1016/j.eswa.2008.06.093

IF: 8.5

2009-01-01

Expert Systems with Applications

Abstract:Cluster validation is an important part of any cluster analysis. External measures such as entropy, purity and mutual information are often used to evaluate K-means clustering. However, whether these measures are indeed suitable for K-means clustering remains unknown. Along this line, in this paper, we show that a data distribution view is of great use to selecting the right measures for K-means clustering. Specifically, we first introduce the data distribution view of K-means, and the resultant uniform effect on highly imbalanced data sets. Eight external measures widely used in recent data mining tasks are also collected as candidates for K-means evaluation. Then, we demonstrate that only three measures, namely the variation of information (VI), the van Dongen criterion (VD) and the Mirkin metric (M), can detect the negative uniform effect of K-means in the clustering results. We also provide new normalization schemes for these three measures, i.e., VInorm′, VDnorm′ and Mnorm′, which enables the cross-data comparisons of clustering qualities. Finally, we explore some properties such as the consistency and sensitivity of the three measures, and give some advice on how to use them in K-means practice.
CVAP: Validation for Cluster Analyses

Kaijun Wang,Baijie Wang,Liuqing Peng

DOI: https://doi.org/10.2481/dsj.007-020

2009-01-01

Data Science Journal

Abstract:Evaluation of clustering results (or cluster validation) is an important and necessary step in cluster analysis, but it is often time-consuming and complicated work. We present a visual cluster validation tool, the Cluster Validity Analysis Platform (CVAP), to facilitate cluster validation. The CVAP provides necessary methods (e.g., many validity indices, several clustering algorithms and procedures) and an analysis environment for clustering, evaluation of clustering results, estimation of the number of clusters, and performance comparison among different clustering algorithms. It can help users accomplish their clustering tasks faster and easier and help achieve good clustering quality when there is little prior knowledge about the cluster structure of a data set.
Research on K-Value Selection Method of K-Means Clustering Algorithm

Haitao Yang,Chunhui Yuan

DOI: https://doi.org/10.3390/J2020016

2019-06-18

Abstract:Among many clustering algorithms, the K-means clustering algorithm is widely used because of its simple algorithm and fast convergence. However, the K-value of clustering needs to be given in advance and the choice of K-value directly affect the convergence result. To solve this problem, we mainly analyze four K-value selection algorithms, namely Elbow Method, Gap Statistic, Silhouette Coefficient, and Canopy; give the pseudo code of the algorithm; and use the standard data set Iris for experimental verification. Finally, the verification results are evaluated, the advantages and disadvantages of the above four algorithms in a K-value selection are given, and the clustering range of the data set is pointed out.

Mathematics,Computer Science
Fuzzy C-Means Clustering Validity Function Based on Multiple Clustering Performance Evaluation Components

Guan Wang,Jie-Sheng Wang,Hong-Yu Wang

DOI: https://doi.org/10.1007/s40815-021-01243-2

IF: 4.085

2022-02-21

International Journal of Fuzzy Systems

Abstract:Clustering is the process of grouping a set of physical or abstract objects into multiple similar objects. Fuzzy C-means (FCM) clustering is one of the most widely used clustering methods, whose main research goal is to find the optimal clustering number of data sets, which is related to whether the data can be effectively divided. The study of clustering validity function is the process of evaluating the clustering quality and determining the optimal clustering number. Based on the idea of components, six cluster performance evaluation components are proposed to define compactness, variation, similarity, overlap and separation of data sets, respectively. Then a new validity function based on FCM clustering algorithm is synthesized by these six components. Finally, the proposed validity function and eight typical validity functions are compared on five artificial data sets and eight UCI data sets. The simulation results show that the proposed clustering validity function can evaluate the clustering results more effectively and determine the optimal clustering number of different data sets.

computer science, information systems,automation & control systems, artificial intelligence
Validation of Overlapping Clustering: A Random Clustering Perspective

Junjie Wu,Hua Yuan,Hui Xiong,Guoqing Chen

DOI: https://doi.org/10.1016/j.ins.2010.07.028

IF: 8.1

2010-01-01

Information Sciences

Abstract:As a widely used clustering validation measure, the F-measure has received increased attention in the field of information retrieval. In this paper, we reveal that the F-measure can lead to biased views as to results of overlapped clusters when it is used for validating the data with different cluster numbers (incremental effect) or different prior probabilities of relevant documents (prior-probability effect). We propose a new “IMplication Intensity” (IMI) measure which is based on the F-measure and is developed from a random clustering perspective. In addition, we carefully investigate the properties of IMI. Finally, experimental results on real-world data sets show that IMI significantly alleviates biased incremental and prior-probability effects which are inherent to the F-measure.
Kernel k'-means algorithm for clustering analysis

Yue Zhao,Shuyi Zhang,Jinwen Ma

DOI: https://doi.org/10.1007/978-3-642-39482-9_27

2013-01-01

Abstract:k'-means algorithm is a new improvement of k-means algorithm. It implements a rewarding and penalizing competitive learning mechanism into the k-means paradigm such that the number of clusters can be automatically determined for a given dataset. This paper further proposes the kernelized versions of k'-means algorithms with four different discrepancy metrics. It is demonstrated by the experiments on both synthetic and real-world datasets that these kernel k'-means algorithms can automatically detect the number of actual clusters in a dataset, with a classification accuracy rate being considerably better than those of the corresponding k'-means algorithms.
A simple and fast method to determine the parameters for fuzzy c-means cluster validation

Veit Schwämmle,Ole N. Jensen

DOI: https://doi.org/10.48550/arXiv.1004.1307

2010-04-08

Abstract:Fuzzy c-means clustering is widely used to identify cluster structures in high-dimensional data sets, such as those obtained in DNA microarray and quantitative proteomics experiments. One of its main limitations is the lack of a computationally fast method to determine the two parameters fuzzifier and cluster number. Wrong parameter values may either lead to the inclusion of purely random fluctuations in the results or ignore potentially important data. The optimal solution has parameter values for which the clustering does not yield any results for a purely random data set but which detects cluster formation with maximum resolution on the edge of randomness. Estimation of the optimal parameter values is achieved by evaluation of the results of the clustering procedure applied to randomized data sets. In this case, the optimal value of the fuzzifier follows common rules that depend only on the main properties of the data set. Taking the dimension of the set and the number of objects as input values instead of evaluating the entire data set allows us to propose a functional relationship determining its value directly. This result speaks strongly against setting the fuzzifier equal to 2 as typically done in many previous studies. Validation indices are generally used for the estimation of the optimal number of clusters. A comparison shows that the minimum distance between the centroids provides results that are at least equivalent or better than those obtained by other computationally more expensive indices.

Quantitative Methods,Genomics
Cross-Validation Approach to Evaluate Clustering Algorithms: An Experimental Study Using Multi-Label Datasets

Adane Nega Tarekegn,Krzysztof Michalak,Mario Giacobini

DOI: https://doi.org/10.1007/s42979-020-00283-z

2020-08-11

SN Computer Science

Abstract:Clustering validation is one of the most important and challenging parts of clustering analysis, as there is no ground truth knowledge to compare the results with. Up till now, the evaluation methods for clustering algorithms have been used for determining the optimal number of clusters in the data, assessing the quality of clustering results through various validity criteria, comparison of results with other clustering schemes, etc. It is also often practically important to build a model on a large amount of training data and then apply the model repeatedly to smaller amounts of new data. This is similar to assigning new data points to existing clusters which are constructed on the training set. However, very little practical guidance is available to measure the prediction strength of the constructed model to predict cluster labels for new samples. In this study, we proposed an extension of the cross-validation procedure to evaluate the quality of the clustering model in predicting cluster membership for new data points. The performance score was measured in terms of the root mean squared error based on the information from multiple labels of the training and testing samples. The principal component analysis (PCA) followed by <i>k</i>-means clustering algorithm was used to evaluate the proposed method. The clustering model was tested using three benchmark multi-label datasets and has shown promising results with overall RMSE of less than 0.075 and MAPE of less than 12.5% in three datasets.
Comparison of Spectral Clustering, K-clustering and Hierarchical Clustering on E-Nose Datasets: Application to the Recognition of Material Freshness, Adulteration Levels and Pretreatment Approaches for Tomato Juices

Xuezhen Hong,Jun Wang,Guande Qi

DOI: https://doi.org/10.1016/j.chemolab.2014.01.017

IF: 4.175

2014-01-01

Chemometrics and Intelligent Laboratory Systems

Abstract:Various clustering algorithms have been developed since conventional hierarchical cluster analysis (HCA) and partitioning clustering algorithms have their own limitations and scopes of applications. However, in the area of e-nose where clustering is applied, the conventional algorithms (mostly HCA) still play a dominant role. In addition, comparison among different clustering methods or validation of clustering results was seldom mentioned. In this paper, we present a state-of-the-art clustering method – spectral clustering – and compare it with six conventional clustering methods: K-clustering (ISODATA, FCM and k-means) and HCA (single linkage, complete linkage and Ward's). Three external validation criteria – mutual information criteria (MI), precision and rand index (RI) – were used to evaluate clustering performances on three independent e-nose datasets. The spectral clustering outperforms with statistical significance (alpha=0.05) the performance of other methods, and the single linkage presents the worst (unacceptable) clustering result. In addition, the proposed approach – cluster validation criteria in combination with majority voting – in a way makes clustering a semi-supervised classification technique. Using this approach it is possible to compare clustering based semi-supervised methods with classification methods to find which method is better for discrimination of a certain e-nose dataset.
On the Efficiency of K-Means Clustering: Evaluation, Optimization, and Algorithm Selection

Sheng Wang,Yuan Sun,Zhifeng Bao

DOI: https://doi.org/10.48550/arXiv.2010.06654

2020-10-13

Databases

Abstract:This paper presents a thorough evaluation of the existing methods that accelerate Lloyd's algorithm for fast k-means clustering. To do so, we analyze the pruning mechanisms of existing methods, and summarize their common pipeline into a unified evaluation framework UniK. UniK embraces a class of well-known methods and enables a fine-grained performance breakdown. Within UniK, we thoroughly evaluate the pros and cons of existing methods using multiple performance metrics on a number of datasets. Furthermore, we derive an optimized algorithm over UniK, which effectively hybridizes multiple existing methods for more aggressive pruning. To take this further, we investigate whether the most efficient method for a given clustering task can be automatically selected by machine learning, to benefit practitioners and researchers.
A Novel Effective Distance Measure and a Relevant Algorithm for Optimizing the Initial Cluster Centroids of K-means

Yang Liu,Shuaifeng Ma,Xinxin Du

DOI: https://doi.org/10.1109/access.2020.3044069

IF: 3.9

2021-01-01

IEEE Access

Abstract:The traditional K-means algorithm is very sensitive to the selection of the initial clustering point and the calculation of the distance measure, which is likely to result in the convergence of only partly optimal solutions. An improved k-means algorithm is proposed to solve the problem of unbalanced clustering effect caused by the fact that the first initial clustering centre falls in the non-dense region of the boundary in the initial clustering centre optimisation process. An improved k-means algorithm for initial clustering centres is proposed, namely, the optimal matching algorithm for K-means clustering, and related experimental analysis of the algorithm is carried out. The improved algorithm first selects the initial points of the traditional K-means clustering algorithm and analyses the clustering results. Then, the initial clustering centre selection and distance determination were tested and the clustering effect was evaluated by introducing the contour coefficient. Experiments on both artificial data sets and UCI data sets show that the algorithm can achieve better clustering results. The experimental results indicate that the improved algorithm has a much higher clustering quality than the traditional K-means algorithm and other improved algorithms.

computer science, information systems,telecommunications,engineering, electrical & electronic
Deep Clustering Evaluation: How to Validate Internal Clustering Validation Measures

Zeya Wang,Chenglong Ye

2024-03-22

Abstract:Deep clustering, a method for partitioning complex, high-dimensional data using deep neural networks, presents unique evaluation challenges. Traditional clustering validation measures, designed for low-dimensional spaces, are problematic for deep clustering, which involves projecting data into lower-dimensional embeddings before partitioning. Two key issues are identified: 1) the curse of dimensionality when applying these measures to raw data, and 2) the unreliable comparison of clustering results across different embedding spaces stemming from variations in training procedures and parameter settings in different clustering models. This paper addresses these challenges in evaluating clustering quality in deep learning. We present a theoretical framework to highlight ineffectiveness arising from using internal validation measures on raw and embedded data and propose a systematic approach to applying clustering validity indices in deep clustering contexts. Experiments show that this framework aligns better with external validation measures, effectively reducing the misguidance from the improper use of clustering validity indices in deep learning.

Machine Learning
Stable Initialization Scheme for K-means Clustering

Junling Xu,Baowen Xu,Weifeng Zhang,Wei Zhang,Jun Hou

DOI: https://doi.org/10.1007/s11859-009-0106-z

2009-01-01

Wuhan University Journal of Natural Sciences

Abstract:Though K-means is very popular for general clustering, its performance which generally converges to numerous local minima depends highly on initial cluster centers. In this paper a novel initialization scheme to select initial cluster centers for K-means clustering is proposed. This algorithm is based on reverse nearest neighbor (RNN) search which retrieves all points in a given data set whose nearest neighbor is a given query point. The initial cluster centers computed using this methodology are found to be very close to the desired cluster centers for iterative clustering algorithms. This procedure is applicable to clustering algorithms for continuous data. The application of proposed algorithm to K-means clustering algorithm is demonstrated. Experiment is carried out on several popular datasets and the results show the advantages of the proposed method.
Centerless Clustering: An Efficient Variant of K-means Based on K-NN Graph

Shenfei Pei,Huimin Chen,Feiping Nie,Rong Wang,Xuelong Li

DOI: https://doi.org/10.1109/tpami.2022.3150981

IF: 23.6

2022-01-01

IEEE Transactions on Pattern Analysis and Machine Intelligence

Abstract:Although lots of clustering models have been proposed recently, k-means and the family of spectral clustering methods are both still drawing a lot of attention due to their simplicity and efficacy. We first reviewed the unified framework of k-means and graph cut models, and then proposed a clustering method called k-sums where a k-nearest neighbor ( k-NN) graph is adopted. The main idea of k-sums is to minimize directly the sum of the distances between points in the same cluster. To deal with the situation where the graph is unavailable, we proposed k-sums-x that takes features as input. The computational and memory overhead of k-sums are both O(nk), indicating that it can scale linearly w.r.t. the number of objects to group. Moreover, the costs of computational and memory are Irrelevant to the product of the number of points and clusters. The computational and memory complexity of k-sums-x are both linear w.r.t. the number of points. To validate the advantage of k-sums and k-sums-x on facial datasets, extensive experiments have been conducted on 10 synthetic datasets and 17 benchmark datasets. While having a low time complexity, the performance of k-sums is comparable with several state-of-the-art clustering methods.

computer science, artificial intelligence,engineering, electrical & electronic
An Initial Seed Selection Algorithm for K-means Clustering of Georeferenced Data to Improve Replicability of Cluster Assignments for Mapping Application

Fouad Khan

DOI: https://doi.org/10.1016/j.asoc.2012.07.021

2016-04-18

Abstract:K-means is one of the most widely used clustering algorithms in various disciplines, especially for large datasets. However the method is known to be highly sensitive to initial seed selection of cluster centers. K-means++ has been proposed to overcome this problem and has been shown to have better accuracy and computational efficiency than k-means. In many clustering problems though -such as when classifying georeferenced data for mapping applications- standardization of clustering methodology, specifically, the ability to arrive at the same cluster assignment for every run of the method i.e. replicability of the methodology, may be of greater significance than any perceived measure of accuracy, especially when the solution is known to be non-unique, as in the case of k-means clustering. Here we propose a simple initial seed selection algorithm for k-means clustering along one attribute that draws initial cluster boundaries along the 'deepest valleys' or greatest gaps in dataset. Thus, it incorporates a measure to maximize distance between consecutive cluster centers which augments the conventional k-means optimization for minimum distance between cluster center and cluster members. Unlike existing initialization methods, no additional parameters or degrees of freedom are introduced to the clustering algorithm. This improves the replicability of cluster assignments by as much as 100% over k-means and k-means++, virtually reducing the variance over different runs to zero, without introducing any additional parameters to the clustering process. Further, the proposed method is more computationally efficient than k-means++ and in some cases, more accurate.

Machine Learning,Data Structures and Algorithms
From A-to-Z Review of Clustering Validation Indices

Bryar A. Hassan,Noor Bahjat Tayfor,Alla A. Hassan,Aram M. Ahmed,Tarik A. Rashid,Naz N. Abdalla

DOI: https://doi.org/10.1016/j.neucom.2024.128198

2024-07-18

Abstract:Data clustering involves identifying latent similarities within a dataset and organizing them into clusters or groups. The outcomes of various clustering algorithms differ as they are susceptible to the intrinsic characteristics of the original dataset, including noise and dimensionality. The effectiveness of such clustering procedures directly impacts the homogeneity of clusters, underscoring the significance of evaluating algorithmic outcomes. Consequently, the assessment of clustering quality presents a significant and complex endeavor. A pivotal aspect affecting clustering validation is the cluster validity metric, which aids in determining the optimal number of clusters. The main goal of this study is to comprehensively review and explain the mathematical operation of internal and external cluster validity indices, but not all, to categorize these indices and to brainstorm suggestions for future advancement of clustering validation research. In addition, we review and evaluate the performance of internal and external clustering validation indices on the most common clustering algorithms, such as the evolutionary clustering algorithm star (ECA*). Finally, we suggest a classification framework for examining the functionality of both internal and external clustering validation measures regarding their ideal values, user-friendliness, responsiveness to input data, and appropriateness across various fields. This classification aids researchers in selecting the appropriate clustering validation measure to suit their specific requirements.

Machine Learning

K-means clustering versus validation measures

K-means Clustering Versus Validation Measures: a Data-Distribution Perspective

External validation measures for K-means clustering

Adapting the Right Measures for K-means Clustering

Combining multiple clusterings via k-modes algorithm

External Validation Measures for K-means Clustering: A Data Distribution Perspective

CVAP: Validation for Cluster Analyses

Research on K-Value Selection Method of K-Means Clustering Algorithm

Fuzzy C-Means Clustering Validity Function Based on Multiple Clustering Performance Evaluation Components

Validation of Overlapping Clustering: A Random Clustering Perspective

Kernel k'-means algorithm for clustering analysis

A simple and fast method to determine the parameters for fuzzy c-means cluster validation

Cross-Validation Approach to Evaluate Clustering Algorithms: An Experimental Study Using Multi-Label Datasets

Comparison of Spectral Clustering, K-clustering and Hierarchical Clustering on E-Nose Datasets: Application to the Recognition of Material Freshness, Adulteration Levels and Pretreatment Approaches for Tomato Juices

On the Efficiency of K-Means Clustering: Evaluation, Optimization, and Algorithm Selection

A Novel Effective Distance Measure and a Relevant Algorithm for Optimizing the Initial Cluster Centroids of K-means

Deep Clustering Evaluation: How to Validate Internal Clustering Validation Measures

Stable Initialization Scheme for K-means Clustering

Centerless Clustering: An Efficient Variant of K-means Based on K-NN Graph

An Initial Seed Selection Algorithm for K-means Clustering of Georeferenced Data to Improve Replicability of Cluster Assignments for Mapping Application

From A-to-Z Review of Clustering Validation Indices