Abstract:Cluster analysis is a crucial data mining technology widely used in image segmentation, language processing, and pattern recognition. Most existing clustering algorithms cannot identify complex shapes in manifold data sets and data sets with varying-density distribution, especially when clusters with significant differences in density are close to each other. Hierarchical clustering algorithms can identify data sets of arbitrary shapes. However, hierarchical clustering algorithms not only cannot cluster datasets with significant density variations but also have a high time cost. So in this paper, we propose a novel hierarchical clustering algorithm based on density-distance cores, called HCDC. It first selects the density-distance representative points for each point from the set of candidate representative points. Then it selects density-distance cores from all density-distance representatives. And it replaces the whole data set with density-distance cores and uses a new distance between them to apply hierarchical clustering. To avoid the influence of noise points in the dataset when finding density-distance cores, we also propose the noise point detection method and verify the feasibility of this method. In this paper, we compare our proposed algorithm with existing classical and novel algorithms on synthetic and real datasets. Experiments show that our algorithm clusters better than existing algorithms on complex-shaped datasets and datasets with different densities. On datasets with sparse and dense clusters close to each other, the ARI score of HCDC is more than 0.1 higher than that of LDP-MST. In particular, on the grid dataset, HCDC's ARI score is 0.997 higher than LDP-MST. On DS3 and DS8, HCDC's ARI score is more than 0.14 higher than the second-best algorithm, RNN-DBSCAN. Moreover, on the zoo dataset, HCDC's ARI score is 0.15 and 0.6 higher than RNN-DBSCAN and LDP-MST, respectively. On the olivetti face dataset, HCDC is the only algorithm with an NMI score above 0.9 on photo1 and photo2 datasets.

Clustering Categorical Data Based on Distance Vectors

A Statistical Information-Based Clustering Approach in Distance Space

Estimation of number of clusters in categorical data via distance-based likelihood function

Clustering High-Dimensional Noisy Categorical Data

Categorical data clustering: 25 years beyond K-modes

Categorizing Flight Paths using Data Visualization and Clustering Methodologies

Information-based Projection Method for Categorical Clustering and Outlier Detection

Clustering of high-dimensional observations

A k-mean clustering algorithm for mixed numeric and categorical data

Subspace Clustering by Directly Solving Discriminative K-means

Discriminative Similarity for Data Clustering

EDMD: An Entropy based Dissimilarity measure to cluster Mixed-categorical Data

Categorical Clustering by Converting Associated Information

Order Is All You Need for Categorical Data Clustering

Clustering by measuring local direction centrality for data with heterogeneous density and weak connectivity

Cluster Algorithm Based on Edge Density Distance

Efficient Clustering with Limited Distance Information

A Hash-based Co-Clustering Algorithm for Categorical Data

Clustering ensemble algorithm for categorical data

HCDC: A novel hierarchical clustering algorithm based on density-distance cores for data sets with varying density

Spectral Clustering for Discrete Distributions