A Highly Scalable Clustering Scheme Using Boundary Information

Qiuhui Tong,Xiu Li,Bo Yuan
DOI: https://doi.org/10.1016/j.patrec.2017.01.016
IF: 4.757
2017-01-01
Pattern Recognition Letters
Abstract:Many advanced clustering techniques are effective in dealing datasets in complicated situations. However, when facing large datasets, which are increasingly common in the era of big data, the time requirements of most existing techniques can quickly become intolerable. To tackle this challenge, in this paper, we propose Scalable Clustering Using Boundary Information (SCUBI), a highly flexible and scalable clustering scheme. The idea of SCUBI is to identify the boundary points of the original dataset in the first place and then group boundary points into suitable clusters using existing clustering techniques. Finally, the rest points are assigned to the same cluster as their nearest boundary points. To demonstrate the effectiveness and scalability of SCUBI, we plug the well-known DBSCAN algorithm into SCUBI. Comprehensive experiments are conducted using datasets with up to two million data points to compare the clustering results and time efficiency between DBSCAN and SCUBI-DBSCAN. Experimental results show that our method can obtain almost identical clustering results as the standard DBSCAN while achieving orders of magnitude speedup especially on large datasets, which confirms the scalability of SCUBI. Experiments are also performed on other clustering algorithms with high time complexity to verify the flexibility of SCUBI. (C) 2017 Elsevier B.V. All rights reserved.
What problem does this paper attempt to address?