A Sampling-Based Density Peaks Clustering Algorithm for Large-Scale Data

Shifei Ding,Chao Li,Xiao Xu,Ling Ding,Jian Zhang,Lili Guo,Tianhao Shi
DOI: https://doi.org/10.1016/j.patcog.2022.109238
IF: 8
2022-12-08
Pattern Recognition
Abstract:With the rapid development of information technology, massive amount of data is generated. How to discover useful information to support decision-making has become one of the focuses of scholar's research. Clustering is thought to be one of the main means to deal with large-scale data. Density peaks clustering (DPC) is an effective density-based clustering algorithm which is widely applied in numerous fields because of its satisfactory performance. However, the computational complexity of DPC is O(N2) which is not friendly to large-scale data. To solve this issue, a sampling-based density peaks clustering algorithm for large-scale data (SDPC) is proposed. Firstly, a sampling method is used to reduce the distance calculations. Secondly, approximate representatives are identified by an improved TI search strategy which further accelerates the clustering process. Afterwards, the approximate representatives are clustered by DPC. Finally, the remaining points are allocated to the same cluster as its nearest representatives. Experimental results on both synthetic datasets and real-world datasets illustrate that SDPC is more efficient than DPC, while its clustering performance maintains the same level as DPC.
computer science, artificial intelligence,engineering, electrical & electronic
What problem does this paper attempt to address?