Abstract:As networks continue to increase in size, current methods must be capable of handling large numbers of nodes and edges in order to be practically relevant. Instead of working directly with the entire (large) network, analyzing sub-networks has become a popular approach. Due to a network's inherent inter-connectedness, sub-sampling is not a trivial task. While this problem has gained attention in recent years, it has not received sufficient attention from the statistics community. In this work, we provide a thorough comparison of seven graph sub-sampling algorithms by applying them to divide-and-conquer algorithms for community structure and core-periphery (CP) structure. After discussing the various algorithms and sub-sampling routines, we derive theoretical results for the mis-classification rate of the divide-and-conquer algorithm for CP structure under various sub-sampling schemes. We then perform extensive experiments on both simulated and real-world data to compare the various methods. For the community detection task, we found that sampling nodes uniformly at random yields the best performance. For CP structure on the other hand, there was no single winner, but algorithms which sampled core nodes at a higher rate consistently outperformed other sampling routines, e.g., random edge sampling and random walk sampling. The varying performance of the sampling algorithms on different tasks demonstrates the importance of carefully selecting a sub-sampling routine for the specific application.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of graph sub - sampling in large - scale network analysis. As the scale of networks continues to increase, existing methods need to be able to handle a large number of nodes and edges in order to have practical application value. It is usually impractical to directly analyze the entire large - scale network, so analyzing sub - networks has become a popular method. However, due to the complex inter - connectivity of the network itself, sub - sampling is not a simple problem. Specifically, this paper focuses on the following points: 1. **Importance of sub - sampling**: Sub - sampling is crucial in large - scale network analysis because it can break down large - scale data sets into multiple smaller data sets for processing, thereby significantly improving computational efficiency. However, choosing an appropriate sub - sampling method is very important for maintaining network structure characteristics and ensuring the effectiveness of analysis results. 2. **Comparison of sub - sampling methods**: The author compares seven sub - sampling algorithms and applies them to divide - and - conquer algorithms to identify community structure and core - periphery (CP) structure. Through theoretical analysis and experimental verification, the performance of different sub - sampling methods in these tasks is evaluated. 3. **Theoretical analysis**: The author derives theoretical results of the mis - classification rate of the divide - and - conquer algorithm for CP structure under different sub - sampling schemes. This helps to understand the impact of different sub - sampling methods on algorithm performance. 4. **Empirical research**: Through extensive experiments on simulated data and real - world data, the effects of various sub - sampling methods are compared. The results show that for the community detection task, random node sampling performs best; while for the CP structure identification task, algorithms that sample core nodes with a higher probability (such as edge sampling and random walk sampling) usually perform better. In conclusion, this paper attempts to provide guidance on how to select the most appropriate sub - sampling strategy in large - scale network analysis by systematically comparing different sub - sampling methods to ensure efficiency and accuracy in different tasks.

Graph sub-sampling for divide-and-conquer algorithms in large networks

A new algorithm for extracting a small representative subgraph from a very large graph

Understanding Graph Sampling Algorithms for Social Network Analysis

Efficient Algorithms for Summarizing Graph Patterns

GraphSDH: A General Graph Sampling Framework with Distribution and Hierarchy

Sampling unknown large networks restricted by low sampling rates

Estimating the Number of Connected Components in a Graph via Subgraph Sampling

Large Graph Sampling Algorithm for Frequent Subgraph Mining

Preserving the topological properties of complex networks in network sampling

Empirical comparison of network sampling techniques

Subnetwork enumeration algorithms for multilayer networks

Sampling Subgraph Network with Application to Graph Classification

Efficiently Estimating Motif Statistics of Large Networks

Graph Sampling Approach for Reducing Computational Complexity of Large-Scale Social Network

Scalable and Robust Community Detection with Randomized Sketching

Towards Cost-efficient Sampling Methods

Albatross sampling: robust and effective hybrid vertex sampling for social graphs.

Two provably consistent divide and conquer clustering algorithms for large networks

Graph Sampling for Scalable and Expressive Graph Neural Networks on Homophilic Graphs

Cluster-preserving Sampling Algorithm for Large-Scale Graphs.

Efficient k -Clique Counting on Large Graphs: The Power of Color-Based Sampling Approaches