Scalable bootstrap clustering for massive data

Haocheng Wang,Fuzhen Zhuang,Xiang Ao,Qing He,Zhongzhi Shi
DOI: https://doi.org/10.1109/SNPD.2014.6888693
2014-01-01
Abstract:The bootstrap provides a simple and powerful means of improving the accuracy of clustering. However, for today's increasingly large datasets, the computation of bootstrap-based quantities can be prohibitively demanding. In this paper we introduce the Bag of Little Bootstraps Clustering (BLBC), a new procedure which utilizes the Bag of Little Bootstraps technique to obtain a robust, computationally efficient means of clustering for massive data. Moreover, BLBC is suited to implementation on modern parallel and distributed computing architectures which are often used to process large datasets. We investigate empirically the performance characteristics of BLBC and compare to the performances of existing methods via experiments on simulated data and real data. The results show that BLBC has a significantly more favorable computational profile than the bootstrap based clustering while maintaining good statistical correctness.
What problem does this paper attempt to address?