Efficient k -Clique Count Estimation with Accuracy Guarantee

Lijun Chang,Rashmika Gamage,Jeffrey Xu Yu
DOI: https://doi.org/10.14778/3681954.3682032
IF: 2.5
2024-07-01
Proceedings of the VLDB Endowment
Abstract:Counting and enumerating all occurrences of k -cliques, i.e., complete subgraphs with k vertices, in a large graph G is a fundamental problem with many applications. However, exact solutions are often infeasible due to the exponential growth in the number of k -cliques when k increases. Thus, a more practical approach is approximately counting and uniformly sampling k -cliques. Turán-Shadow and DPColorPath are two state-of-the-art algorithms for approximately counting k -cliques. The general idea is first constructing a sample space that is a superset of all k -cliques in G , and then sampling t elements uniformly-at-random (u.a.r.) from the sample space for a pre-determined t ; the k -clique count is estimated as the sample space size multiplied by the ratio of k -cliques among the t samples. Although techniques have been proposed in Turán-Shadow for setting t to ensure the estimation accuracy, the theoretically chosen t is often too large to be practical. As a result, both of the existing algorithms used a fixed t in their implementations and thus do not offer accuracy guarantee. In this paper, we propose the first randomized algorithm that achieves the theoretical estimation accuracy and the practical efficiency at the same time. Different from the existing algorithms, we pre-determine the number s of k-clique samples that are required to achieve the estimation accuracy. Consequently, we can estimate the running time of the sampling stage (i.e., time taken to sample sk -cliques), for a given sample space. Then, we propose to balance the time of constructing/refining the sample space and the time of the sampling stage, by stopping the refinement of the sample space once the elapsed time is comparable to the estimated time of the sampling stage. Extensive empirical studies on large real graphs show that our algorithm SR-kCCE provides an accurate k -clique count estimation and also runs efficiently. As a by-product, our algorithm can also be used for efficiently sampling a certain number of k -cliques u.a.r. from G.
computer science, information systems, theory & methods
What problem does this paper attempt to address?