Abstract:Cognitive computing involves discovering hidden rules and patterns in massive volumes of data. Density peaks clustering (DPC) is a powerful data mining tool that can identify density peaks in decision graphs and assign labels to them without requiring iterations. It can efficiently and simply detect clusters of arbitrary shapes. However, on the one hand, density measurement using the ϵ neighbor or Gaussian kernel only reflects the global structure of the data, so that correct density peaks cannot be found, and performance on manifold datasets is weakened. On the other hand, the one-step allocation strategy results in chain reaction. Once a point with high density is misallocated, a series of points will be incorrectly assigned. To solve this problem, this paper proposes the Jaccard coefficient to measure the similarity between points. The proposed density measurement based on Jaccard coefficient is only related to the k points that share the max similarity with the given point, which can reflect the local structure of manifold datasets, and the density peaks can be identified accurately. Aiming at the chain reaction caused by the assignment strategy of DPC, we develop a two-step allocation strategy based on label propagation and the proposed measurement of similarity. The first step is to assign labels to points close to the clustering centers, where these are equal to labeled points in the label propagation algorithm. The second step is to complete the assignment of labels to the remaining points according to labeled data which is the nearest to each unassigned sample. We compared the proposed algorithm with four algorithms on synthetic datasets and real-world datasets. The three metrics among these algorithms show that the proposed algorithm outperforms other algorithms. The results of clustering on synthetic datasets verified the effectiveness of the proposed method for manifold datasets, and three metrics on the UCI datasets and the Olivetti Faces dataset show that it can reveal the patterns and associations of real-world datasets.

Parallel Massive Clustering of Discrete Distributions

Distributed Affinity Propagation Clustering Based on MapReduce

Parallel Topic Model and Its Application on Document Clustering.

Parallel spectral clustering algorithm

A boosted clustering algorithm for distributed homogeneous data mining

A Parallel Varied Density-Based Clustering Algorithm with Optimized Data Partition

Faithful Density-Peaks Clustering via Matrix Computations on MPI Parallelization System

Distributed Information Theoretic Clustering

Spectral Clustering for Discrete Distributions

Faster Parallel Exact Density Peaks Clustering

Paralinear Distance and Its Algorithm for Hierarchical Clustering of High-dimensional Discrete Variables

UP-DPC: Ultra-scalable Parallel Density Peak Clustering

ParSymG: a Parallel Clustering Approach for Unsupervised Classification of Remotely Sensed Imagery

A Survey and Experimental Review on Data Distribution Strategies for Parallel Spatial Clustering Algorithms

A Domain Adaptive Density Clustering Algorithm for Data with Varying Density Distribution

A Sampling-Based Density Peaks Clustering Algorithm for Large-Scale Data

PECANN: Parallel Efficient Clustering with Graph-Based Approximate Nearest Neighbor Search

Density Peaks Clustering Based on Jaccard Similarity and Label Propagation

Distributed Bayesian Matrix Decomposition for Big Data Mining and Clustering

Dual-disentangled Deep Multiple Clustering

Improved Hierarchical Clustering on Massive Datasets with Broad Guarantees