Abstract:K-means clustering is a popular clustering algorithm based on the partition of data. However, K-means clustering algorithm suffers from some shortcomings, such as its requiring a user to give out the number of clusters at first, and its sensitiveness to initial conditions, and its being easily trapped into a local solution et cetera. The global K-means algorithm proposed by Likas et al is an incremental approach to clustering that dynamically adds one cluster center at a time through a deterministic global search procedure consisting of N (with N being the size of the data set) runs of the K-means algorithm from suitable initial positions. It avoids the depending on any initial conditions or parameters, and considerably outperforms the K-means algorithms, but it has a heavy computational load. In this paper, we propose a new version of the global K-means algorithm. That is an efficient global K-means clustering algorithm. The outstanding feature of our algorithm is its superiority in execution time. It takes less run time than that of the available global K-means algorithms do. In this algorithm we modified the way of finding the optimal initial center of the next new cluster by defining a new function as the criterion to select the optimal candidate center for the next new cluster. Our idea grew under enlightened by Park and Jun's idea of K-medoids clustering algorithm. We chose the best candidate initial center for the next cluster by calculating the value of our new function which uses the information of the natural distribution of data, so that the optimal initial center we chose is the point which is not only with the highest density, but also apart from the available cluster centers. Experiments on fourteen well-known data sets from UCI machine learning repository show that our new algorithm can significantly reduce the computational time without affecting the performance of the global K-means algorithms. Further experiments demonstrate that our improved global K-means algorithm outperforms the global K-means algorithm greatly and is suitable for clustering large data sets. Experiments on colon cancer tissue data set revealed that our new global K-means algorithm can efficiently deal with gene expression data with high dimensions. And experiment results on synthetic data sets with different proportions noisy data points prove that our global k-means can avoid the influence of noisy data on clustering results efficiently.

An Efficient High Dimensional Cluster Method and Its Application in Global Climate Sets.

Hierarchical Spatial Clustering in Multi-Hop Wireless Sensor Networks

A Grid-Based Density Peaks Clustering Algorithm

Application of a grid - based spatial clustering method on regional division

DBSTC: an Effective Method for Discovering Cluster Features with Different Spatiotemporal Densities

Enhanced Locality Sensitive Clustering in High Dimensional Space

Subspace Clustering by Directly Solving Discriminative K-means

HiSpatialCluster: A Novel High-Performance Software Tool for Clustering Massive Spatial Points.

Novel clustering framework using k-means (S k-means) for mining spatiotemporal structured climate data

An Efficient Global K-means Clustering Algorithm.

SSCG: Spatial Subcluster Clustering Method by Grid-Connection.

Interactive Local Clustering Operations for High Dimensional Data in Parallel Coordinates

A Study of Performance Optimization Method for Massive Spaito-temporal Data Based on Spatio-temporal Partition Clustering

A Novel Spatio-temporal Clustering Approach by Process Similarity.

Discovering the Skyline of Subspace Clusters in High-Dimensional Data

Adaptive Spatial Clustering for Multi-Dimensional Data and Its Cloud Model Representation

Deep Spatiotemporal Clustering: A Temporal Clustering Approach for Multi-dimensional Climate Data

A spatial data partition algorithm based on statistical cluster

Spatiotemporal Cluster Analysis of Gridded Temperature Data -- A Comparison Between K-means and MiSTIC

SCADDA: Spatio-temporal cluster analysis with density-based distance augmentation and its application to fire carbon emissions

Spatial-Temporal Distribution Analysis of Industrial Heat Sources in the US with Geocoded, Tree-Based, Large-Scale Clustering