Abstract:Clustering is an important technique in data mining and knowledge discovery. Affinity propagation clustering (AP) and density peaks and distance-based clustering (DDC) are two significant clustering algorithms proposed in 2007 and 2014 respectively. The two clustering algorithms have simple and clear design ideas, and are effective in finding meaningful clustering solutions. They have been widely used in various applications successfully. However, a key disadvantage of AP is its high time complexity, which has become a bottleneck when applying AP for large-scale problems. The core idea of DDC is to construct the decision graph based on the local density and the distance of each data point, and then select the cluster centers, but the selection of the cluster centers is relatively subjective, and sometimes it is difficult to determine a suitable number of cluster centers. Here, we propose a two-stage clustering algorithm, called DDAP, to overcome these shortcomings. First, we select a small number of potential exemplars based on the two quantities of each data point in DDC to greatly compress the scale of the similarity matrix. Then we implement message-passing on the incomplete similarity matrix. In experiments, two synthetic datasets, nine publicly available datasets, and a real-world electronic medical records (EMRs) dataset are used to evaluate the proposed method. The results demonstrate that DDAP can achieve comparable clustering performance with the original AP algorithm, while the computational efficiency improves observably.

Clustering Large Scale Data Set Based on Distributed Local Affinity Propagation on Spark

Distributed Affinity Propagation Clustering Based on MapReduce

Distributed High-Dimension Matrix Operation Optimization on Spark

Local and Global Approaches of Affinity Propagation Clustering for Large Scale Data

Affinity Propagation Clustering Algorithm Based on Large-Scale Data-Set

Distributed affinity propagation clustering algorithm based on GraphLab

An Improved K-means Distributed Clustering Algorithm Based on Spark Parallel Computing Framework

Single Large-Scale Graph Frequent Subgraph Algorithm Based on Spark

Fast Clustering by Affinity Propagation Based on Density Peaks.

K-AP Clustering Algorithm for Large Scale Dataset

A Local Approach of Adaptive Affinity Propagation Clustering for Large Scale Data

A Joint Grid Segmentation Based Affinity Propagation Clustering Method for Big Data

DACA: Distributed Adaptive Grid Decision Graph Based Clustering Algorithm

Shuffle-Efficient Distributed Locality Sensitive Hashing On Spark

Adjustable Preference Affinity Propagation Clustering

A Density-Adaptive Affinity Propagation Clustering Algorithm Based on Spectral Dimension Reduction

Efficient Distributed Density Peaks for Clustering Large Data Sets in MapReduce

Parallel Division Clustering Algorithm Based on Spark Framework and ASPSO

Distributed structural clustering on large graph

PACk: an Efficient Partition-based Distributed Agglomerative Hierarchical Clustering Algorithm for Deduplication

Efficient Distributed Data Clustering on Spark