Abstract:In the era of big data, it is increasingly common that large amount of data is generated across multiple distributed sites and cannot be gathered into a centralized site for further analysis, which invalidates the assumption of traditional clustering techniques based on centralized models. The major challenge is that these distributed datasets cannot be trivially merged due to issues such as privacy concerns, limited network bandwidth among sites and limited computational capacity of a single site. To tackle this challenge, we propose an efficient distributed clustering scheme using boundary information (DCUBI), which features good flexibility and scalability. The main procedure of DCUBI consists of three steps: local-global-local. Firstly, each local site extracts the boundary points from its own local data and applies traditional clustering on boundary points only. Secondly, labeled boundary points from each site are sent to the central site as local representatives where boundary and cluster fusion is conducted to form the global clustering model. Finally, the global boundary and cluster information is sent back to each local site for refined local clustering. To demonstrate the effectiveness of DCUBI, we plug the well-known DBSCAN algorithm into DCUBI and comprehensive experiments are conducted using datasets with different properties. Experiment results clearly verify the quality of clustering by DCUBI as well as its superior time efficiency when the volume of data in each site is large. Furthermore, other popular clustering techniques especially those with high time complexity such as spectral clustering and affinity propagation clustering are also plugged into DCUBI to demonstrate the flexibility of the proposed scheme.

Towards a Compact and Effective Representation for Datasets with Inhomogeneous Clusters.

Using Visualization to Improve Clustering Analysis on Heterogeneous Information Network.

Efficient Distributed Clustering Using Boundary Information

Mining Representative Subspace Clusters in High-dimensional Data.

An Efficient Method for Boundary Points Detection Based on Data Expression

A Highly Scalable Clustering Scheme Using Boundary Information

Surface Extraction and Boundary Detection Based on DBSCAN Clustering in 3D Point Clouds

Towards effective and efficient mining of arbitrary shaped clusters

A Unified Framework for Representation-Based Subspace Clustering of Out-of-Sample and Large-Scale Data.

Efficient Approaches for Summarizing Subspace Clusters into K Representatives

ROCM: A Rolling Iteration Clustering Model Via Extracting Data Features

Clustering by measuring local direction centrality for data with heterogeneous density and weak connectivity

A Statistical Information-Based Clustering Approach in Distance Space

3D Surface Segmentation from Point Clouds Via Quadric Fits Based on DBSCAN Clustering

A Robust and Efficient Boundary Point Detection Method by Measuring Local Direction Dispersion

A Parallel Varied Density-Based Clustering Algorithm with Optimized Data Partition

An Effective Clustering Algorithm Using Adaptive Neighborhood and Border Peeling Method

A Novel Type of Boundary Extraction Method and Its Statistical Improvement for Unorganized Point Clouds Based on Concurrent Delaunay Triangular Meshes

Non-iterative Border-Peeling Clustering Algorithm based on Swap Strategy

CLINCH: clustering incomplete high-dimensional data for data mining application

On Saving Outliers for Better Clustering over Noisy Data.