Block-Diagonal Guided DBSCAN Clustering

Weibing Zhao
2024-04-27
Abstract:Cluster analysis plays a crucial role in database mining, and one of the most widely used algorithms in this field is DBSCAN. However, DBSCAN has several limitations, such as difficulty in handling high-dimensional large-scale data, sensitivity to input parameters, and lack of robustness in producing clustering results. This paper introduces an improved version of DBSCAN that leverages the block-diagonal property of the similarity graph to guide the clustering procedure of DBSCAN. The key idea is to construct a graph that measures the similarity between high-dimensional large-scale data points and has the potential to be transformed into a block-diagonal form through an unknown permutation, followed by a cluster-ordering procedure to generate the desired permutation. The clustering structure can be easily determined by identifying the diagonal blocks in the permuted graph. We propose a gradient descent-based method to solve the proposed problem. Additionally, we develop a DBSCAN-based points traversal algorithm that identifies clusters with high densities in the graph and generates an augmented ordering of clusters. The block-diagonal structure of the graph is then achieved through permutation based on the traversal order, providing a flexible foundation for both automatic and interactive cluster analysis. We introduce a split-and-refine algorithm to automatically search for all diagonal blocks in the permuted graph with theoretically optimal guarantees under specific cases. We extensively evaluate our proposed approach on twelve challenging real-world benchmark clustering datasets and demonstrate its superior performance compared to the state-of-the-art clustering method on every dataset.
Machine Learning,Artificial Intelligence,Data Structures and Algorithms
What problem does this paper attempt to address?
The paper attempts to address the limitations of the DBSCAN algorithm when dealing with high-dimensional large-scale data, including sensitivity to input parameters and insufficient robustness in generating clustering results. Specifically, the paper proposes an improved DBSCAN algorithm that utilizes the block-diagonal property of the similarity graph to guide the clustering process, thereby enhancing the algorithm's performance in handling high-dimensional large-scale data. ### Main Issues: 1. **Handling high-dimensional large-scale data**: The traditional DBSCAN algorithm performs poorly when dealing with high-dimensional large-scale data because constructing a graph for high-dimensional data is very expensive, and distance measurement methods (such as Euclidean distance, Chebyshev distance, cosine similarity, etc.) can only describe local relationships in the data space, making it difficult to represent relationships between high-dimensional data. 2. **Sensitivity to input parameters**: The DBSCAN algorithm is very sensitive to parameters (such as neighborhood radius ϵ and minimum points δ). Improper parameter settings can lead to inaccurate clustering results, especially in clusters with different densities. 3. **Insufficient robustness in generating clustering results**: Connectivity-based methods are not robust enough when dealing with noisy datasets, which may result in all points being connected through noise points, forming a single cluster. ### Solution: The paper proposes an improved algorithm named BD-DBSCAN, which mainly includes three key stages: 1. **Graph construction**: Construct a similarity graph with a potential block-diagonal form. 2. **Graph arrangement**: Use a traversal algorithm to find an arrangement that transforms the graph into a block-diagonal form. 3. **Graph partitioning**: Automatically identify the diagonal blocks in the arranged graph to determine the clustering structure. Through these steps, the BD-DBSCAN algorithm can better handle high-dimensional large-scale data, reduce dependency on input parameters, and improve the robustness of the generated clustering results.