Abstract:Cluster analysis (Jain & Dubes, 1988) provides insight into the data by dividing the objects into groups (clusters), such that objects in a cluster are more similar to each other than objects in other clusters. Cluster analysis has long played an important role in a wide variety of fields, such as psychology, bioinformatics, pattern recognition, information retrieval, machine learning, and data mining. Many clustering algorithms, such as K-means and Unweighted Pair Group Method with Arithmetic Mean (UPGMA), have been wellestablished. A recent research focus on clustering analysis is to understand the strength and weakness of various clustering algorithms with respect to data factors. Indeed, people have identified some data characteristics that may strongly affect clustering analysis including high dimensionality and sparseness, the large size, noise, types of attributes and data sets, and scales of attributes (Tan, Steinbach, & Kumar, 2005). However, further investigation is expected to reveal whether and how the data distributions can have the impact on the performance of clustering algorithms. Along this line, we study clustering algorithms by answering three questions: 1. What are the systematic differences between the distributions of the resultant clusters by different clustering algorithms? 2. How can the distribution of the “true” cluster sizes make impact on the performances of clustering algorithms? 3. How to choose an appropriate clustering algorithm in practice? The answers to these questions can guide us for the better understanding and the use of clustering methods. This is noteworthy, since 1) in theory, people seldom realized that there are strong relationships between the clustering algorithms and the cluster size distributions, and 2) in practice, how to choose an appropriate clustering algorithm is still a challenging task, especially after an algorithm boom in data mining area. This chapter thus tries to fill this void initially. To this end, we carefully select two widely used categories of clustering algorithms, i.e., K-means and Agglomerative Hierarchical Clustering (AHC), as the representative algorithms for illustration. In the chapter, we first show that K-means tends to generate the clusters with a relatively uniform distribution on the cluster sizes. Then we demonstrate that UPGMA, one of the robust AHC methods, acts in an opposite way to K-means; that is, UPGMA tends to generate the clusters with high variation on the cluster sizes. Indeed, the experimental results indicate that the variations of the resultant cluster sizes by K-means and UPGMA, measured by the Coefficient of Variation (CV), are in the specific intervals, say [0.3, 1.0] and [1.0, 2.5] respectively. Finally, we put together K-means and UPGMA for a further comparison, and propose some rules for the better choice of the clustering schemes from the data distribution point of view.

An Analytical Study on Behavior of Clusters Using K Means, EM and K* Means Algorithm

K – Means Algorithm

Performance Analysis for Clustering Algorithms

Hybridization of Expectation-Maximization and K-Means Algorithms for Better Clustering Performance

Performance Evaluation of Simple K-Mean and Parallel K-Mean Clustering Algorithms: Big Data Business Process Management Concept

An Abnormal Behavior Clustering Algorithm Based on K-means.

Research and Application of Clustering Algorithm for Text Big Data

Data clustering with modified K-means algorithm

Canonical PSO Based k-Means Clustering Approach for Real Datasets

An Efficient K-Means Clustering Initialization Using Optimization Algorithm

Performance analysis of Kmeans with modified initial centroid selection algorithms and developed Kmeans9+ model

K-means Clustering Algorithms: A Comprehensive Review, Variants Analysis, and Advances in the Era of Big Data

Data Clustering: Integrating Different Distance Measures with Modified k-Means Algorithm

Performance evaluation of K-means clustering algorithm with various distance metrics

K*-Means: an Effective and Efficient K-Means Clustering Algorithm

A Data Distribution View of Clustering Algorithms

An Information-Theoretic Analysis of Hard and Soft Assignment Methods for Clustering

Performance Evaluation of Threshold-Based and k-means Clustering Algorithms Using Iris Dataset

The k-means Algorithm: A Comprehensive Survey and Performance Evaluation

Improvement Study and Application Based on K-Means Clustering Algorithm

Parametric entropy based Cluster Centriod Initialization for k-means clustering of various Image datasets