Abstract:In a wide variety of emerging data-intensive applications, such as social network analysis, Web document clustering, entity resolution, and detection of consistently co-expressed genes in systems biology, the detection of dense subgraphs (cliques and approximate or quasi-cliques) is an essential component. Unfortunately, these problems are NP-Complete and thus computationally intensive at scale — hence there is a need to come up with techniques for distributing the computation across multiple machines such that the computation, which is too time-consuming on a single machine, can be efficiently performed on a machine cluster given that it is large enough. In this paper, we first propose a new approach for maximal clique and quasi-clique enumeration, which identifies dense subgraphs by recursive graph partitioning. Given a connected graph G = (V,E), it has a space complexity of O(|E|) and a time complexity of O(|E|μ(G)), where μ(G) represents the number of different cliques (quasi-cliques) existing in G. It recursively divides a graph until each task is sufficiently small to be processed in parallel. We then develop parallel solutions and demonstrate how graph partitioning can enable effective load balancing. Finally, we evaluate the performance of the proposed approach on real and synthetic graph data and show that it performs considerably better than existing approaches in both centralized and parallel settings. In the parallel setting, it can achieve the speedups of up to 10x over existing approaches on large graphs. Our parallel algorithms are implemented and evaluated on MapReduce, a popular shared-nothing parallel framework, but can easily generalize to other shared-nothing or shared-memory parallel frameworks.

A Study on Parallel Algorithm of the Gene Expression Data Clustering Analysis

Efficient Parallel Clustering Algorithm Based on Density

Parallel spectral clustering algorithm

A Parallel Clustering Algorithm Using Mapping and Sampling-Partitioning on the Cluster Computing Systems

A Parallel Algorithm for Gene Expressing Data Biclustering

Accelerating Gene Clustering on Heterogeneous Clusters

Application of New Clustering Algorithms in Gene Expression Data

A Parallel Varied Density-Based Clustering Algorithm with Optimized Data Partition

A Parallel K-Means Clustering Algorithm with MPI

Gen-Cluster: an Efficient Gene Expression Data High Dimensional Clustering Algorithm

Data Clustering Algorithm for DNA Microarray Based on Graph Theory

Triclustering of Gene Expression Microarray Data Using Coarse-Grained Parallel Genetic Algorithm

PGMCLU: A Novel Parallel Grid-Based Clustering Algorithm for Multi-Density Datasets

Parallelizing Clique and Quasi-Clique Detection over Graph Data

Faithful Density-Peaks Clustering via Matrix Computations on MPI Parallelization System

Study of Fast Parallel Clustering Partition Algorithm for Large Data Sets

Parallel Information Fusion Method for Microarray Data Analysis

The Study of Parallel Clustering Algorithm for Cluster System

Application of New Algorithm in Gene Expression Profile Clustering

A Novel Density Based Clustering Algorithm and Its Parallelization.

Parallel Clustering Methods for Data Mining