Variable-length Encoding Framework: A Generic Framework for Enhancing the Accuracy of Approximate Membership Queries

Haipeng Dai,Hancheng Wang,Zhipeng Chen,Jiaqi Zheng,Meng Li,Rong Gu,Chen Tian,Wanchun Don
DOI: https://doi.org/10.1109/icdm58522.2023.00015
2023-01-01
Abstract:Approximate membership query (AMQ) data structures can efficiently indicate whether an element exists in a data set. Therefore, they are widely used in data mining applications such as IoT streaming data mining, anomaly detection, duplicate detection, record linkage, and community discovery. The data amount to be processed in real-world applications often changes frequently and dynamically. Thus, before using the AMQ data structures, it is necessary to configure their capacity to the maximum number of elements that will be stored during runtime. We observe that when the number of elements stored in an AMQ data structure is lower than its capacity, a significant amount of space is wasted, making the false positive rate much higher than expected. To tackle this problem, we propose the variable-length encoding framework. It dynamically adjusts the encoding length of each element according to the number of elements stored in the AMQ data structure. Based on this design, the variable-length encoding framework can make full use of the memory space allocated to AMQ data structures, thereby improving the space efficiency and reducing the false positive rate. In addition, as a general encoding scheme, the variable-length encoding framework can be widely used in different types of AMQ data structures. Theoretical analysis and evaluation results show that AMQ data structures using the variable-length encoding framework have significantly lower false positive rates compared with state-of-the-art AMQ data structures. For example, when the load factor is 25%, the variable-length encoding framework can reduce the false positive rate of AMQ data structures by 88.15% on average (up to 99.40%).
What problem does this paper attempt to address?