Abstract:AbstractDensity-based clustering for big data is critical for many modern applications ranging from Internet data processing to massive-scale moving object management. This paper proposes Cludoop algorithm, an efficient distributed density-based clustering for big data using Hadoop. First, we propose a serial clustering algorithm CluC by leveraging cell partition optimization and c-cluster to fast find clusters. CluC completes classification of the points using the relationships of connected cells around points instead of expensive completed neighbor query, which significantly reduce the number of distance calculations. Second, we propose the Cludoop, which can efficiently cluster very-large-scale data in parallel using already existing data partition on Map/Reduce platform. It employs the proposed serial clustering CluC as a plugged-in clustering on parallel mapper, along with a cell description instead of completed cell in transmission to reduce both network and I/O costs. Guided by proposed cell-based principles, we also design a Merging-Refinement-Merging 3-step framework to merge c-clusters on the overlay of assigned preclustering result on reducer. Finally, our comprehensive experimental evaluation on 10 network-connected commercial PCs, using both huge-volume real and synthetic data, demonstrates (1) the effectiveness of our algorithm in finding correct clusters with arbitrary shape and (2) the fact that our proposed algorithm exhibits better scalability and efficiency than state-of-the-art method.

Partition Affinity Propagation for Clustering Large Scale of Data in Digital Library

Distributed Affinity Propagation Clustering Based on MapReduce

Affinity Propagation Clustering Algorithm Based on Large-Scale Data-Set

A Fast Algorithm for Density-Based Clustering in Large Database

Local and Global Approaches of Affinity Propagation Clustering for Large Scale Data

A boosted clustering algorithm for distributed homogeneous data mining

Privacy-Preserving Affinity Propagation Clustering over Vertically Partitioned Data

A Parallel Varied Density-Based Clustering Algorithm with Optimized Data Partition

A Hybrid Approach to Clustering in Very Large Databases

Adjustable Preference Affinity Propagation Clustering

Online Stream Clustering Using Density and Affinity Propagation Algorithm

An Improved Affinity Propagation Clustering Algorithm Based on Entropy Weight Method and Principal Component Analysis

A Study of Performance Optimization Method for Massive Spaito-temporal Data Based on Spatio-temporal Partition Clustering

Extended Affinity Propagation: Global Discovery and Local Insights

An Improved Integrated Clustering Learning Strategy Based on Three-Stage Affinity Propagation Algorithm with Density Peak Optimization Theory

Improved Hierarchical Clustering on Massive Datasets with Broad Guarantees

Cludoop: an efficient distributed density-based clustering for big data using hadoop

Scalable Co-Clustering for Large-Scale Data through Dynamic Partitioning and Hierarchical Merging

A Sampling-Based Density Peaks Clustering Algorithm for Large-Scale Data

Enhanced Locality Sensitive Clustering in High Dimensional Space

Online Clustering of Evolution Data Stream Based on Affinity Propagation Clustering