Efficient Maximal Biclique Enumeration on Large Uncertain Bipartite Graphs

Jianhua Wang,Jianye Yang,Ziyi Ma,Chengyuan Zhang,Shiyu Yang,Wenjie Zhang
DOI: https://doi.org/10.1109/TKDE.2023.3272110
IF: 9.235
2023-01-01
IEEE Transactions on Knowledge and Data Engineering
Abstract:In this article, we study the problem of maximal biclique enumeration on large uncertain bipartite graphs. Given an uncertain bipartite graph G=(U,V,E,p), a probability threshold tau, and two size constraints alpha and beta, we aim to efficiently enumerate all maximal tau-bicliques in G, where a maximal tau-biclique B(L,R) is a complete subgraph of G with (1) the probability of B is no less than tau, (2) |L| >= alpha and |R| >= beta, and (3) B is a maximal complete subgraph satisfying (1) and (2). This problem has many applications, such as biclustering of gene expression data, fraud detection, similar group identification, etc. Despite the wide range of applications, to the best of our knowledge, we note that there are no efficient and scalable solutions to this problem in the literature. This problem is computationally challenging due to its #P-completeness. In this article, we propose a competitive branch-and-bound method, namely MBEN, which explores the search space in a depth-first manner with a variety of pruning techniques. To improve the performance of MBEN, we propose several novel and efficient search processing optimizations. First, we always select the side with fewer candidates to expand the search space. With this search strategy, we have a chance to prune the fruitless branches early. Second, we devise an advanced pruning technique by considering the size pruning and probability pruning at the same time to boost the pruning capacity. Last, we implement MBEN with pre-allocated arrays and pointer maintaining techniques such that the frequent work sets creating operations can be substituted by array element switching operations. In addition, we introduce useful graph reduction techniques to further accelerate the computation. Comprehensive performance studies on 10 real datasets demonstrate that our proposals can significantly outperform the baseline methods by more than two orders of magnitude.
What problem does this paper attempt to address?