HCDC: A novel hierarchical clustering algorithm based on density-distance cores for data sets with varying density
Qi-Fen Yang,Wan-Yi Gao,Gang Han,Zi-Yang Li,Meng Tian,Shu-Hua Zhu,Yu-hui Deng
DOI: https://doi.org/10.1016/j.is.2022.102159
IF: 3.18
2022-12-19
Information Systems
Abstract:Cluster analysis is a crucial data mining technology widely used in image segmentation, language processing, and pattern recognition. Most existing clustering algorithms cannot identify complex shapes in manifold data sets and data sets with varying-density distribution, especially when clusters with significant differences in density are close to each other. Hierarchical clustering algorithms can identify data sets of arbitrary shapes. However, hierarchical clustering algorithms not only cannot cluster datasets with significant density variations but also have a high time cost. So in this paper, we propose a novel hierarchical clustering algorithm based on density-distance cores, called HCDC. It first selects the density-distance representative points for each point from the set of candidate representative points. Then it selects density-distance cores from all density-distance representatives. And it replaces the whole data set with density-distance cores and uses a new distance between them to apply hierarchical clustering. To avoid the influence of noise points in the dataset when finding density-distance cores, we also propose the noise point detection method and verify the feasibility of this method. In this paper, we compare our proposed algorithm with existing classical and novel algorithms on synthetic and real datasets. Experiments show that our algorithm clusters better than existing algorithms on complex-shaped datasets and datasets with different densities. On datasets with sparse and dense clusters close to each other, the ARI score of HCDC is more than 0.1 higher than that of LDP-MST. In particular, on the grid dataset, HCDC's ARI score is 0.997 higher than LDP-MST. On DS3 and DS8, HCDC's ARI score is more than 0.14 higher than the second-best algorithm, RNN-DBSCAN. Moreover, on the zoo dataset, HCDC's ARI score is 0.15 and 0.6 higher than RNN-DBSCAN and LDP-MST, respectively. On the olivetti face dataset, HCDC is the only algorithm with an NMI score above 0.9 on photo1 and photo2 datasets.
computer science, information systems