Parallel Boosted Clustering
Yazhou Ren,Uday Kamath,Carlotta Domeniconi,Zenglin Xu
DOI: https://doi.org/10.1016/j.neucom.2019.04.003
IF: 6
2019-01-01
Neurocomputing
Abstract:Scalability of clustering algorithms is a critical issue in real world clustering applications. Usually, data sampling and parallelization are two common ways to address the scalability issue. Despite their wide utilization in a number of clustering algorithms, they suffer from several major drawbacks. For example, most data sampling can often lead to biased solutions due to its inability in accurately capturing the distribution of the entire data set. On the other hand, the performance of parallelization highly depends on the original clustering routines which are not parallel algorithms in nature, such that customizing each algorithm to be parallel may hurt the clustering performance. To alleviate these problems, we propose a general two-step framework for scalable clustering in this work, where the first step is to obtain skeleton structure of data and the second step is to obtain the final clustering. Concretely, data are first partitioned and located across a two-dimensional grid, and then local clustering algorithms are iteratively applied on the cells of the grid, each providing a set of intermediate core points. These core points represent the dense or central regions of data, which can be centers, modes and means for centroid-based, density-based and probability-based clustering, respectively. Finally, these core points are further used to obtain the final clustering. The proposed framework enjoys several benefits: (1) the local clustering on partitioned cells are conducted in parallel and thus can lead to high speed-up; (2) the clustering on the representative core points can be more robust; (3) the framework can be easily applied to other basic clustering methods and thus achieves a general scalable solution. Theoretical analysis is provided and extensive experimental results have demonstrated the effectiveness and efficiency of the proposed framework.