Graph sub-sampling for divide-and-conquer algorithms in large networks

Eric Yanchenko
2024-09-11
Abstract:As networks continue to increase in size, current methods must be capable of handling large numbers of nodes and edges in order to be practically relevant. Instead of working directly with the entire (large) network, analyzing sub-networks has become a popular approach. Due to a network's inherent inter-connectedness, sub-sampling is not a trivial task. While this problem has gained attention in recent years, it has not received sufficient attention from the statistics community. In this work, we provide a thorough comparison of seven graph sub-sampling algorithms by applying them to divide-and-conquer algorithms for community structure and core-periphery (CP) structure. After discussing the various algorithms and sub-sampling routines, we derive theoretical results for the mis-classification rate of the divide-and-conquer algorithm for CP structure under various sub-sampling schemes. We then perform extensive experiments on both simulated and real-world data to compare the various methods. For the community detection task, we found that sampling nodes uniformly at random yields the best performance. For CP structure on the other hand, there was no single winner, but algorithms which sampled core nodes at a higher rate consistently outperformed other sampling routines, e.g., random edge sampling and random walk sampling. The varying performance of the sampling algorithms on different tasks demonstrates the importance of carefully selecting a sub-sampling routine for the specific application.
Social and Information Networks,Computation
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of graph sub - sampling in large - scale network analysis. As the scale of networks continues to increase, existing methods need to be able to handle a large number of nodes and edges in order to have practical application value. It is usually impractical to directly analyze the entire large - scale network, so analyzing sub - networks has become a popular method. However, due to the complex inter - connectivity of the network itself, sub - sampling is not a simple problem. Specifically, this paper focuses on the following points: 1. **Importance of sub - sampling**: Sub - sampling is crucial in large - scale network analysis because it can break down large - scale data sets into multiple smaller data sets for processing, thereby significantly improving computational efficiency. However, choosing an appropriate sub - sampling method is very important for maintaining network structure characteristics and ensuring the effectiveness of analysis results. 2. **Comparison of sub - sampling methods**: The author compares seven sub - sampling algorithms and applies them to divide - and - conquer algorithms to identify community structure and core - periphery (CP) structure. Through theoretical analysis and experimental verification, the performance of different sub - sampling methods in these tasks is evaluated. 3. **Theoretical analysis**: The author derives theoretical results of the mis - classification rate of the divide - and - conquer algorithm for CP structure under different sub - sampling schemes. This helps to understand the impact of different sub - sampling methods on algorithm performance. 4. **Empirical research**: Through extensive experiments on simulated data and real - world data, the effects of various sub - sampling methods are compared. The results show that for the community detection task, random node sampling performs best; while for the CP structure identification task, algorithms that sample core nodes with a higher probability (such as edge sampling and random walk sampling) usually perform better. In conclusion, this paper attempts to provide guidance on how to select the most appropriate sub - sampling strategy in large - scale network analysis by systematically comparing different sub - sampling methods to ensure efficiency and accuracy in different tasks.