Clustering by Mining Density Distributions and Splitting Manifold Structure

Zhichang Xu,Zhiguo Long,Hua Meng
2024-08-20
Abstract:Spectral clustering requires the time-consuming decomposition of the Laplacian matrix of the similarity graph, thus limiting its applicability to large datasets. To improve the efficiency of spectral clustering, a top-down approach was recently proposed, which first divides the data into several micro-clusters (granular-balls), then splits these micro-clusters when they are not "compact'', and finally uses these micro-clusters as nodes to construct a similarity graph for more efficient spectral clustering. However, this top-down approach is challenging to adapt to unevenly distributed or structurally complex data. This is because constructing micro-clusters as a rough ball struggles to capture the shape and structure of data in a local range, and the simplistic splitting rule that solely targets ``compactness'' is susceptible to noise and variations in data density and leads to micro-clusters with varying shapes, making it challenging to accurately measure the similarity between them. To resolve these issues, this paper first proposes to start from local structures to obtain micro-clusters, such that the complex structural information inside local neighborhoods is well captured by them. Moreover, by noting that Euclidean distance is more suitable for convex sets, this paper further proposes a data splitting rule that couples local density and data manifold structures, so that the similarities of the obtained micro-clusters can be easily characterized. A novel similarity measure between micro-clusters is then proposed for the final spectral clustering. A series of experiments based on synthetic and real-world datasets demonstrate that the proposed method has better adaptability to structurally complex data than granular-ball based methods.
Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of inefficiency in spectral clustering when dealing with large - scale data sets, especially the limitations of existing methods when facing data with complex structures or uneven distributions. Specifically: 1. **High computational complexity of spectral clustering**: Traditional spectral clustering requires spectral decomposition of the Laplacian matrix of the similarity graph, and its time complexity is \(O(n^3)\), which makes it very time - consuming when dealing with large - scale data sets. 2. **Deficiencies in the "Granular - Balls" - based method**: A recently proposed top - down method performs more efficient spectral clustering by dividing data into multiple micro - clusters (i.e., "Granular - Balls") and then constructing a similarity graph based on these micro - clusters. However, this method has poor performance when dealing with data with complex structures or uneven densities, mainly because: - Constructing "Granular - Balls" as rough spheres is difficult to capture the data shape and structure within the local range. - Simple splitting rules only target "compactness" and are easily affected by noise and changes in data density, resulting in micro - clusters with different shapes and making it difficult to accurately measure the similarity between them. To overcome these problems, this paper proposes a new method to improve the existing spectral clustering algorithm in the following ways: - **Obtaining micro - clusters from local structures**: Ensure that complex structural information is well captured within the local range. - **Introducing splitting rules that combine local density and data manifold structure**: Make the similarity between micro - clusters easier to characterize. - **Proposing a new method for measuring similarity between micro - clusters**: For the final spectral clustering. Through a series of experiments, this method shows better adaptability and performance on synthetic and real - world data sets, especially when dealing with data with complex structures.