Faster Streaming and Scalable Algorithms for Finding Directed Dense Subgraphs in Large Graphs

Slobodan Mitrović,Theodore Pan
2023-11-18
Abstract:Finding dense subgraphs is a fundamental algorithmic tool in data mining, community detection, and clustering. In this problem, one aims to find an induced subgraph whose edge-to-vertex ratio is maximized. We study the directed case of this question in the context of semi-streaming and massively parallel algorithms. In particular, we show that it is possible to find a $(2+\epsilon)$ approximation on randomized streams even in a single pass by using $O(n \cdot {\rm poly} \log n)$ memory on $n$-vertex graphs. Our result improves over prior works, which were designed for arbitrary-ordered streams: the algorithm by Bahmani et al. (VLDB 2012) which uses $O(\log n)$ passes, and the work by Esfandiari et al. (2015) which makes one pass but uses $O(n^{3/2})$ memory. Moreover, our techniques extend to the Massively Parallel Computation model yielding $O(1)$ rounds in the super-linear and $O(\sqrt{\log n})$ rounds in the nearly-linear memory regime. This constitutes a quadratic improvement over state-of-the-art bounds by Bahmani et al. (VLDB 2012 and WAW 2014), which require $O(\log n)$ rounds even in the super-linear memory regime. Finally, we empirically evaluate our single-pass semi-streaming algorithm on $6$ benchmarks and show that, even on non-randomly ordered streams, the quality of its output is essentially the same as that of Bahmani et al. (VLDB 2012) while it is $2$ times faster on large graphs.
Data Structures and Algorithms
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of efficiently finding directed dense sub - graphs in large - scale graphs. Specifically, the researchers focus on how to quickly and effectively find dense sub - graphs in directed graphs in the semi - streaming model and the Massively Parallel Computation (MPC) environment. #### Background and Motivation 1. **Importance of Dense Sub - graphs**: - Dense sub - graph discovery is a fundamental tool in applications such as data mining, community detection, spam detection, fraud discovery, clustering, and graph compression. - For undirected graphs, there are already many effective algorithms to find dense sub - graphs, but for directed graphs, the efficiency of existing methods is low, especially when dealing with large - scale graphs. 2. **Limitations of Existing Methods**: - Existing semi - streaming algorithms either need to traverse the graph multiple times (such as Bahmani et al. [VLDB 2012]), or require a large amount of memory (such as Esfandiari et al. [2015]). - In the MPC environment, existing directed dense sub - graph algorithms require a large number of rounds, especially in the near - linear memory case. #### Main Contributions of the Paper 1. **Single - Pass Semi - streaming Algorithm**: - A single - pass semi - streaming algorithm is proposed, which can output a (2 + ε)-approximate directed dense sub - graph with high probability on a randomized stream while using only O(n·poly log n) memory. - This algorithm also performs very well on non - randomized streams, being twice as fast as Bahmani et al. [VLDB 2012] and having comparable or even higher accuracy. 2. **Improvements in the MPC Environment**: - Under the super - linear memory condition, an O(1) - round MPC algorithm is proposed. - Under the near - linear memory condition, an O(√log n) - round MPC algorithm is proposed, significantly reducing the round - complexity. 3. **Removal of the Assumption of the Optimal Ratio c**: - By guessing the c value and running the algorithm, the assumption that the optimal ratio c is known is removed, obtaining a 2(1 + ε)√δ - approximate solution. #### Summary This paper significantly improves the efficiency of finding directed dense sub - graphs in large - scale graphs by proposing new semi - streaming and MPC algorithms, especially in the case of single - pass and low - memory requirements. These improvements are of great significance for processing modern large - scale directed graphs (such as social networks, email networks, etc.).