Consensus Clustering on Big Data

Hongfu Liu,Gong Cheng,Junjie Wu
DOI: https://doi.org/10.1109/icsssm.2015.7170344
2015-01-01
Abstract:Big data clustering is a hot topic with the rising of user generated contents. Although a lot of clustering algorithms have been proposed and cloud computing resources are widely available, obtaining a good-quality partition with high efficiency is still up in the air. In this paper, we make full use of consensus clustering to handle big data clustering. Generally speaking, we use divide-and-conquer strategy to dissemble the whole big data into small subsets, then basic partitions are generated from small subsets and consensus clustering is followed to obtain the final result. For the consensus part, we apply K-means-based Consensus Clustering (KCC) to equivalently transfer the consensus clustering problem into a K-means-like optimization problem for high efficiency. Further, two-sided sampling is extended by random sampling on instances and features simultaneously. Extensive experiments on eight real-world data sets demonstrate the advantages of KCC over some widely used methods. More importantly, the ability to handle incomplete basic partitions and the natural suitability to distributed computing make KCC a promising candidate for big data clustering.
What problem does this paper attempt to address?