Abstract:Community detection is the problem of identifying tightly connected clusters of nodes within a network. Efficient parallel algorithms for this play a crucial role in various applications, especially as datasets expand to significant sizes. The Label Propagation Algorithm (LPA) is commonly employed for this purpose due to its ease of parallelization, rapid execution, and scalability. However, it may yield internally disconnected communities. This technical report introduces GSL-LPA, derived from our parallelization of LPA, namely GVE-LPA. Our experiments on a system with two 16-core Intel Xeon Gold 6226R processors show that GSL-LPA not only mitigates this issue but also surpasses FLPA, igraph LPA, and NetworKit LPA by 55x, 10,300x, and 5.8x, respectively, achieving a processing rate of 844 M edges/s on a 3.8 B edge graph. Additionally, GSL-LPA scales at a rate of 1.6x for every doubling of threads.
What problem does this paper attempt to address?
### Problems the paper attempts to solve
This paper attempts to solve an important problem in community detection: **internally disconnected communities**. Specifically:
- **Community detection** refers to identifying clusters of closely - connected nodes in complex networks. This task has important applications in multiple fields, such as topic discovery, protein annotation, recommendation systems, and targeted advertising.
- **Label Propagation Algorithm (LPA)** is a widely - used community detection method, favored for its ease of parallelization, fast execution, and strong scalability. However, a major problem with LPA is that it may generate **internally disconnected communities**, that is, there are no direct or indirect paths connecting nodes within certain communities.
- **Internally disconnected communities** can affect the accuracy and robustness of community detection, so a method is needed to solve this problem.
### Solutions
To address the above problems, the paper proposes **GSL - LPA** (Global Synchronization Label Propagation Algorithm), an improved parallel label propagation algorithm. The main contributions of GSL - LPA include:
1. **Solve the problem of internally disconnected communities**: GSL - LPA divides internally disconnected communities by introducing a post - processing step (Split Last, SL). Specific techniques include:
- **Minimum Label Propagation (LP)**: Re - assign community labels through minimum label propagation.
- **Minimum Label Propagation with Pruning (LPP)**: Add pruning optimization on the basis of LP to reduce unnecessary calculations.
- **Breadth - First Search (BFS)**: Randomly select a node from each community through BFS and traverse its reachable nodes to ensure internal connectivity within each community.
2. **Performance improvement**: Experimental results show that GSL - LPA not only solves the problem of internally disconnected communities but also significantly outperforms other existing LPA implementations, such as FLPA, igraph LPA, and NetworKit LPA in performance. Specifically:
- Processing speed: On a graph with 380 million edges, the processing rate of GSL - LPA reaches 844M edges/second.
- Parallel scalability: The parallel expansion rate of GSL - LPA is that for every doubling of the number of threads, the performance is improved by 1.6 times.
### Experimental verification
The paper verifies the effectiveness and performance advantages of GSL - LPA through experiments on a server equipped with two 16 - core Intel Xeon Gold 6226R processors. The experimental data comes from the SuiteSparse Matrix Collection, covering graphs from 3.07 million to 214 million nodes and from 25.4 million to 380 million edges.
### Summary
In conclusion, by proposing the GSL - LPA algorithm, this paper successfully solves the problem of the label propagation algorithm generating internally disconnected communities in community detection and achieves a significant improvement in performance. This result is of great significance for research and applications in the field of community detection.