A Compact and Accurate Sketch for Estimating a Large Range of Set Difference Cardinalities

Peng Jia,Pinghui Wang,Rundong Li,Junzhou Zhao,Junlan Feng,Xidian Wang,Xiaohong Guan
DOI: https://doi.org/10.1109/icde60146.2024.00110
2024-01-01
Abstract:Computing set difference cardinalities is a critical task in database optimization, network management, and anomaly detection. Due to the limited computational and mem-ory resources, exactly calculating set difference cardinalities becomes impractical in real-world applications. To solve this issue, sketch methods such as Odd sketch, Tug-of-War sketch, and HyperLogLog sketch can be extended to provide approximate estimations of set difference cardinalities. They use a family of hash functions to compress all elements in a set into a compact data structure. Unfortunately, Odd sketch suffers from limited estimation range, while Tug-of-War sketch and HyperLogLog sketch unavoidably face the problems of large estimation errors and high computational costs. In this paper, we design a novel data structure of bit array GXBits to fast and accurately estimate set difference cardinalities in a large range. In GXBits, the prob-ability of each bit recording its corresponding elements follows a variant of geometric distributions and varies across different bits. We conduct extensive experiments on synthetic datasets and real-world datasets. Experimental results demonstrate that our method GXBits is more computationally and memory efficient, and significantly increases the estimation accuracy of existing methods by up to 221.3 times.
What problem does this paper attempt to address?