Abstract:Cluster analysis plays a crucial role in database mining, and one of the most widely used algorithms in this field is DBSCAN. However, DBSCAN has several limitations, such as difficulty in handling high-dimensional large-scale data, sensitivity to input parameters, and lack of robustness in producing clustering results. This paper introduces an improved version of DBSCAN that leverages the block-diagonal property of the similarity graph to guide the clustering procedure of DBSCAN. The key idea is to construct a graph that measures the similarity between high-dimensional large-scale data points and has the potential to be transformed into a block-diagonal form through an unknown permutation, followed by a cluster-ordering procedure to generate the desired permutation. The clustering structure can be easily determined by identifying the diagonal blocks in the permuted graph. We propose a gradient descent-based method to solve the proposed problem. Additionally, we develop a DBSCAN-based points traversal algorithm that identifies clusters with high densities in the graph and generates an augmented ordering of clusters. The block-diagonal structure of the graph is then achieved through permutation based on the traversal order, providing a flexible foundation for both automatic and interactive cluster analysis. We introduce a split-and-refine algorithm to automatically search for all diagonal blocks in the permuted graph with theoretically optimal guarantees under specific cases. We extensively evaluate our proposed approach on twelve challenging real-world benchmark clustering datasets and demonstrate its superior performance compared to the state-of-the-art clustering method on every dataset.

What problem does this paper attempt to address?

The paper attempts to address the limitations of the DBSCAN algorithm when dealing with high-dimensional large-scale data, including sensitivity to input parameters and insufficient robustness in generating clustering results. Specifically, the paper proposes an improved DBSCAN algorithm that utilizes the block-diagonal property of the similarity graph to guide the clustering process, thereby enhancing the algorithm's performance in handling high-dimensional large-scale data. ### Main Issues: 1. **Handling high-dimensional large-scale data**: The traditional DBSCAN algorithm performs poorly when dealing with high-dimensional large-scale data because constructing a graph for high-dimensional data is very expensive, and distance measurement methods (such as Euclidean distance, Chebyshev distance, cosine similarity, etc.) can only describe local relationships in the data space, making it difficult to represent relationships between high-dimensional data. 2. **Sensitivity to input parameters**: The DBSCAN algorithm is very sensitive to parameters (such as neighborhood radius ϵ and minimum points δ). Improper parameter settings can lead to inaccurate clustering results, especially in clusters with different densities. 3. **Insufficient robustness in generating clustering results**: Connectivity-based methods are not robust enough when dealing with noisy datasets, which may result in all points being connected through noise points, forming a single cluster. ### Solution: The paper proposes an improved algorithm named BD-DBSCAN, which mainly includes three key stages: 1. **Graph construction**: Construct a similarity graph with a potential block-diagonal form. 2. **Graph arrangement**: Use a traversal algorithm to find an arrangement that transforms the graph into a block-diagonal form. 3. **Graph partitioning**: Automatically identify the diagonal blocks in the arranged graph to determine the clustering structure. Through these steps, the BD-DBSCAN algorithm can better handle high-dimensional large-scale data, reduce dependency on input parameters, and improve the robustness of the generated clustering results.

Block-Diagonal Guided DBSCAN Clustering

Privacy Preserving Distributed DBSCAN Clustering

Using Greedy Algorithm: DBSCAN Revisited II

A Parallel Varied Density-Based Clustering Algorithm with Optimized Data Partition

An efficient and scalable density-based clustering algorithm for datasets with complex structures.

Structured block diagonal representation for subspace clustering

A Fast Algorithm for Density-Based Clustering in Large Database

A Statistical Information-Based Clustering Approach in Distance Space

GriT-DBSCAN: A spatial clustering algorithm for very large databases

Approaches for Scaling Dbscan Algorithm to Large Spatial Databases

A Parallel DBSCAN Algorithm Based on Spark

Subspace Clustering Via Block-Diagonal Decomposition

Revised DBSCAN Clustering Algorithm Based on Dual Grid

A fast DBSCAN algorithm using a bi-directional HNSW index structure for big data

Scaling Up the DBSCAN Algorithm for Clustering Large Spatial Databases Based on Sampling Technique

Subspace Clustering by Block Diagonal Representation

GB-DBSCAN: A fast granular-ball based DBSCAN clustering algorithm

An efficient DBSCAN optimized by arithmetic optimization algorithm with opposition-based learning

5New density clustering algorithm based on MapReduce

MRG-DBSCAN: an Improved DBSCAN Clustering Method Based on Map Reduce and Grid

Research On The Parallelization Of The Dbscan Clustering Algorithm For Spatial Data Mining Based On The Spark Platform