Scalable Triangle Discovery Algorithm for Large Scale-Free Network with Limited Internal Memory

Xiaoping Zhou,Xun Liang,Zhi Tang
DOI: https://doi.org/10.1109/tbdata.2018.2889120
2020-01-01
IEEE Transactions on Big Data
Abstract:In complex network, the links among the nodes fabricate thousands of polygons, where triangles are the most foundamental ones. As more links are added to the network, more polygons are split to be triangles. Discovering the triangles can help understand the current status of the network. Due to the limitations of internal memory, discovering all the triangles in a large-scale real-world network, e.g., online social network (OSN), is challenging and has intrigued researchers for years. Currently, parallel computation frameworks, e.g., MapReduce, have been employed to address this issue. The last reducer is required to load all the neighbors of each node into internal memory, which causes “the curse of the last reducer.” The scale-free characteristic in real-world networks further exacerbate the applicability of the parallel triangle discovery algorithms. The current efforts on this issue mainly focus on dividing a large network into several small subnetworks and later employ internal memory algorithms. In this paper, we address this issue from a different perspective by employing the power-law degree distribution in scale-free networks. We theoretically and empirically prove that one node in scale-free network has a limited number of neighbors with a higher degree. Employing this finding, an novel parallel triangle discovery algorithm called degree partition (DePart) is designed. Since only the portion of neighbors with higher degrees are loaded into internal memory in the reducers, DePart solves “the curse of the last reducer” problem and works efficiently in clusters with limited internal memory. We develop DePart using MapReduce framework and examine the performance of DePart on five large real-world networks. The empirical testing results demonstrate that DePart has a favorable tradeoff between internal memory and total disk space and outperforms the current state-of-the-art MapReduce-based triangle discovery algorithms.
What problem does this paper attempt to address?