Abstract:Scalability of clustering algorithms is a critical issue in real world clustering applications. Usually, data sampling and parallelization are two common ways to address the scalability issue. Despite their wide utilization in a number of clustering algorithms, they suffer from several major drawbacks. For example, most data sampling can often lead to biased solutions due to its inability in accurately capturing the distribution of the entire data set. On the other hand, the performance of parallelization highly depends on the original clustering routines which are not parallel algorithms in nature, such that customizing each algorithm to be parallel may hurt the clustering performance. To alleviate these problems, we propose a general two-step framework for scalable clustering in this work, where the first step is to obtain skeleton structure of data and the second step is to obtain the final clustering. Concretely, data are first partitioned and located across a two-dimensional grid, and then local clustering algorithms are iteratively applied on the cells of the grid, each providing a set of intermediate core points. These core points represent the dense or central regions of data, which can be centers, modes and means for centroid-based, density-based and probability-based clustering, respectively. Finally, these core points are further used to obtain the final clustering. The proposed framework enjoys several benefits: (1) the local clustering on partitioned cells are conducted in parallel and thus can lead to high speed-up; (2) the clustering on the representative core points can be more robust; (3) the framework can be easily applied to other basic clustering methods and thus achieves a general scalable solution. Theoretical analysis is provided and extensive experimental results have demonstrated the effectiveness and efficiency of the proposed framework.

A modified parallel k-means clustering with improved initial centers

The Parallel Implementation and Application of an Improved K-means Algorithm

An Improved Parallel K-means Algorithm Based on MapReduce

A Novel Approach Towards Bisecting K-Means Clustering Algorithm Parallelism

An Improved Parallel K-means Clustering Algorithm with MapReduce

Cluster Center Initialization Parallel Algorithm for K-Means Algorithm

A Novel Density Based Clustering Algorithm and Its Parallelization.

A Parallel Varied Density-Based Clustering Algorithm with Optimized Data Partition

An Improved K-means Distributed Clustering Algorithm Based on Spark Parallel Computing Framework

The Study Of Parallel K-Means Algorithm

An Improved K-means Algorithm Based on Multiple Clustering and Density.

Research on K-medoids clustering algorithm based on data density and its parallel processing based on MapReduce

Research on Efficient K_Means Parallel Algorithm Based on Hadoop Distributed Architecture

An Improved K-means Algorithm Based on Mapreduce and Grid

Improved Initial Cluster Center Selection in K-Means Clustering

Parallel implementation of K-Means clustering algorithm based on mapReduce computing model of hadoop

Optimization of k-means clustering algorithm in hadoop distributed computing framework

An Improved Initial Clustering Center Selection Method for K-Means Algorithm

K-means Clustering Algorithm with Improved Initial Center

Parallel Boosted Clustering

A K-means clustering with optimized initial center based on Hadoop platform