Half-Xor: A Fully-Dynamic Sketch for Estimating the Number of Distinct Values in Big Tables
Pinghui Wang,Dongdong Xie,Junzhou Zhao,Jinsong Li,Zhicheng Li,Rundong Li,Yang Ren,Jia Di
DOI: https://doi.org/10.1109/tkde.2024.3359710
IF: 9.235
2024-01-01
IEEE Transactions on Knowledge and Data Engineering
Abstract:Calculating the number of distinct values (i.e., NDV) in a column of a big table is costly yet fundamental to a variety of database applications such as data compression and profiling. To reduce the high time and space cost, a number of sketch methods (e.g., HyperLogLog) have been proposed, which estimate the NDV from a constructed compact data summary of distinct values. However, these methods fail or are costly to manage fully-dynamic scenarios where data is often inserted into and deleted from the table. To solve this issue, we propose a novel sketch method, Half-Xor. Our Half-Xor sketch consists of a compact bit matrix and a small counter array, and it needs to set a few bits and update a counter when handling a data insertion/deletion. Compared with the state-of-the-art mergeable method, our experimental results demonstrate that our method Half-Xor is up to 6.6 times more accurate under the same memory usage and reduces the memory usage by up to 16 times to achieve the same estimation accuracy.
computer science, information systems, artificial intelligence,engineering, electrical & electronic