Abstract:We study distributed algorithms for large-scale graphs, focusing on the fundamental problems of connectivity and minimum spanning tree (MST). We consider the k-machine model, a well-studied model for distributed computing for large-scale graph computations, where k ≥ 2 machines jointly perform computations on graphs with n nodes (typically, n ≫ k). The input graph is assumed to be initially randomly partitioned among the k machines, a common implementation in many real-world systems. Communication is point-to-point, and the goal is to minimize the number of communication rounds (denoted Tc) of the computation. While communication is a significant factor that affects the time needed for large-scale computations, the computation cost incurred by the individual machines also contributes to the overall time complexity of the distributed algorithm. We posit a complexity measure called the local computation cost (denoted Tℓ) that measures the worst-case local computation cost among the machines. A lower bound for Tℓ in our model is Ω((m + n)/k + Δ + k), while a lower bound on Tc is Ω(n/k2) [Klauck et al., SODA 2015], where m is the number of edges and Δ is the maximum degree. Prior algorithms for connectivity and MST in the k-machine model [Klauck et al., SODA 2015, Pandurangan et al., SPAA 2016] do not take into account local computation; a straightforward local implementation of these algorithms is not optimal with respect to local computation. In this paper, we study several distributed algorithms for connectivity and MST and analyze their performance with respect to both the computation and communication cost. In particular, we analyze a well-studied flooding algorithm for connectivity and connected components that takes rounds and local computation time.1 We then present a deterministic filtering algorithm that has an improved round complexity of but local computation complexity of . Next, we present two deterministic algorithms which are increasingly sophisticated implementations of the classical Borůvka’s algorithm, the last of which has round complexity and local computation complexity . We finally present a randomized algorithm to find connected components with round complexity and local computation complexity that are both essentially optimal (up to polylogarithmic factors).

In-database connected component analysis

Connected Components in Linear Work and Near-Optimal Time

G-SQL: Fast Query Processing via Graph Exploration.

FastSV: A Distributed-Memory Connected Component Algorithm with Fast Convergence.

Towards Scalable and Practical Batch-Dynamic Connectivity

An Effective High-Performance Multiway Spatial Join Algorithm with Spark

Distributed Parallel PCA for Modeling and Monitoring of Large-Scale Plant-Wide Processes with Big Data.

Answering Subgraph Queries over Massive Disk Resident Graphs

A MapReduce-Based Approach for Fast Connected Components Detection from Large-Scale Networks

A New Algorithm for Computing Disjoint Orthogonal Components in the Parallel Factor Analysis Model with Simulations and Applications to Real-World Data

Distributed Algorithms for Connectivity and MST in Large Graphs with Efficient Local Computation

Vertex-centric Parallel Computation of SQL Queries

PECANN: Parallel Efficient Clustering with Graph-Based Approximate Nearest Neighbor Search

Subgraph Search over Massive Disk Resident Graphs

Connectivity Oracles for Graphs Subject to Vertex Failures.

A linearly convergent algorithm for distributed principal component analysis

A Graph Database Supported GA-Based Approach to Social Network Analysis.

Nonlinear Component Analysis for Large-Scale Data Set Using Fixed-Point Algorithm

Randomized algorithms for distributed computation of principal component analysis and singular value decomposition

GPU-Powered Spatial Database Engine for Commodity Hardware: Extended Version

Community detection for binary graphical models in high dimension